What Is Skeleton Key? AI Jailbreak Technique Explained

New jailbreak techniques pose a risk to ChatGPT and other AI models, potentially enabling them to exhibit behaviors that are typically restricted.




Among the various critiques of artificial intelligence, one of the most alarming is the potential for the technology to be exploited by malicious individuals for harmful purposes or even just for amusement.

A common method used for such exploitation is known as "jailbreaking." According to our AI terminology guide, jailbreaking involves hacking techniques aimed at circumventing the ethical safeguards of AI systems.

Recently, Microsoft unveiled a new jailbreak technique named Skeleton Key. This method has proven effective against several leading AI chatbots, such as OpenAI's ChatGPT, Google's Gemini, and Anthropic's Claude.

What Is Skeleton Key? AI Jailbreak Technique Explained


Guardrails vs Jailbreaks


To mitigate the risks posed by generative AI chatbots, developers implement moderation tools known as "guardrails." These guardrails are designed to prevent the models from exhibiting bias, compromising user privacy, or being misused in harmful ways.

However, these safeguards can sometimes be bypassed through specific prompts. Such attempts to override moderation are referred to as "jailbreaks."

Alarmingly, the potential number of jailbreaks is considered "virtually unlimited." Skeleton Key is one of the newest and potentially most troublesome examples of these jailbreak techniques.



What Is Skeleton Key?


Mark Russinovich, the Chief Technology Officer of Microsoft Azure, recently detailed Skeleton Key in a blog post, explaining both the nature of the attack and the efforts to counter its potential harm.

Russinovich describes Skeleton Key as a jailbreak attack that employs a multi-step approach to trick AI models into bypassing their own guardrails. The technique’s complete bypass capabilities are what inspired the name "Skeleton Key."

He writes, "By bypassing safeguards, Skeleton Key allows users to make the model exhibit behaviors that are typically forbidden, such as generating harmful content or overriding its standard decision-making rules."

Once these guardrails are bypassed, the compromised AI model becomes incapable of distinguishing between malicious or unauthorized requests and legitimate ones.



How Skeleton Key Is Used and its Effect


Instead of attempting to overhaul an AI model’s guidelines, Skeleton Key exploiters use specific prompts to subtly subvert its behavior.

Rather than outright rejecting the request, the model might issue a warning about potentially harmful content. The attacker can then manipulate the chatbot into producing outputs that are offensive, harmful, or even illegal.

For instance, Russinovich's blog post describes a scenario where a user asks for instructions to make a Molotov cocktail. The chatbot initially responds with a warning, stating it is designed to be "safe and helpful." However, when the user claims the query is for educational purposes and suggests that the chatbot provide the information with a warning prefix, the chatbot complies, thus violating its own safety protocols.

In Microsoft's testing, the Skeleton Key technique was used to extract otherwise restricted information across various categories, including explosives, bioweapons, political content, self-harm, racism, drugs, graphic sex, and violence.



Countering the Threat of Skeleton Key Exploits


In addition to sharing its findings with other AI providers and deploying its own "prompt shields" to protect Microsoft Azure AI-managed models (like Copilot) from Skeleton Key attacks, Microsoft’s blog outlines several measures developers can take to reduce risks:


1. Input Filtering: Detect and block inputs with harmful or malicious intent.

2. System Messaging: Implement additional safeguards when jailbreak attempts are detected.

3. Output Filtering: Prevent responses that violate the AI model’s safety criteria.

4. Abuse Monitoring: Use AI detection to recognize and respond to attempts to bypass guardrails.


Microsoft confirms that these software updates have been applied to its own AI technology and large language models.