Of the many criticisms levelled at artificial intelligence models, one of the most emotive is the idea that the technology’s power may be undermined and manipulated by bad actors, whether for malign uses or mere sport.
One way they do this is through “jailbreaks” — defined by our A-to-Z glossary of AI terms as “a form of hacking with the goal of bypassing the ethical safeguards of AI models.”
Now Microsoft has revealed a newly discovered jailbreak technique — called Skeleton Key — that has been found to be effective on some of the world’s most popular AI chatbots, including OpenAI’s ChatGPT, Google’s Gemini and Anthropic’s Claude.
Guardrails vs Jailbreaks
To help prevent generative AI chatbots from causing harm, developers put moderation tools known as “guardrails” in place. In theory, these are supposed to prevent the models from being impacted by biases, impeding user privacy, or generally be used for negative purposes.
It’s possible, however, to dodge these guardrails when certain prompts are entered. Attempts like this to override moderation are called “jailbreaks.”
This just in! View
the top business tech deals for 2024 👨💻
Unfortunately, the number of possible jailbreaks is thought to be “virtually unlimited”, with Skeleton Key being one of the latest and potentially most problematic.
What Is Skeleton Key?
Mark Russinovich, Chief Technology Officer of Microsoft Azure, has written a blog post to explain what Skeleton Key is and what is being done to mitigate its harmful potential.
He explains that Skeleton Key is a jailbreak attack that uses a multi-turn strategy to get the AI model to ignore its own guardrails. It’s the technique’s “full bypass abilities” that has prompted the Skeleton Key analogy.
“In bypassing safeguards, Skeleton Key allows the user to cause the model to produce ordinarily forbidden behaviors, which could range from production of harmful content to overriding its usual decision-making rules.” – Mark Russinovich, Chief Technology Officer of Microsoft Azure
With the guardrails ignored, the compromised AI model cannot “determine malicious or unsanctioned requests from any other”.
How Skeleton Key Is Used and its Effect
Rather than try to change an AI model’s guidelines altogether, exploiters of Skeleton Key use prompts that seek to undermine its behaviors.
The result is that instead of having the request flat-out rejected, the model will give a warning of harmful content. The attacker can then fool the chatbot into producing an output that may be offensive, harmful, or even illegal.
An example is given in the post where the query asks for instructions to make a Molotov Cocktail (a crude, handmade explosive). The chatbot initially warns that it is programmed to be “safe and helpful.”
But when the user says that the query is for educational purposes and suggests that the chatbot updates its behavior to provide the information but add a warning prefix, the chatbot duly obliges, thus, breaching its own original guidelines.
Microsoft’s testing used the Skeleton Key technique to gather otherwise unavailable information in a diverse range of categories, including explosives, bioweapons, political content, self-harm, racism, drugs, graphic sex, and violence.
Mitigating the Use of Skeleton Key
In addition to sharing its findings with other AI providers and implementing its own “prompt shields” to protect Microsoft Azure AI-managed models (e.g. Copilot) from Skeleton Key, the blog also lists several measures developers can take to mitigate the risk.
They include:
- Input filtering to detect and blocks inputs that contain harmful or malicious intent.
- System messaging to provide additional safeguards where jailbreak techniques are attempted.
- Output filtering to prevent query answers that breach AI model’s own safety criteria.
- Abuse monitoring that uses AI detection to recognizes attempts to violate guardrails.
Microsoft confirms that it has made these software updates to its own AI technology and large language models.