What Is Skeleton Key? AI Jailbreak Technique Explained

ChatGPT and other AI models are at risk from new jailbreak technique that could "produce ordinarily forbidden behaviors."

Of the many criticisms levelled at artificial intelligence models, one of the most emotive is the idea that the technology’s power may be undermined and manipulated by bad actors, whether for malign uses or mere sport.

One way they do this is through “jailbreaks” — defined by our A-to-Z glossary of AI terms as “a form of hacking with the goal of bypassing the ethical safeguards of AI models.”

Now Microsoft has revealed a newly discovered jailbreak technique — called Skeleton Key — that has been found to be effective on some of the world’s most popular AI chatbots, including OpenAI’s ChatGPT, Google’s Gemini and Anthropic’s Claude.

Guardrails vs Jailbreaks

To help prevent generative AI chatbots from causing harm, developers put moderation tools known as “guardrails” in place. In theory, these are supposed to prevent the models from being impacted by biases, impeding user privacy, or generally be used for negative purposes.

It’s possible, however, to dodge these guardrails when certain prompts are entered. Attempts like this to override moderation are called “jailbreaks.”

 

About Tech.co Video Thumbnail Showing Lead Writer Conor Cawley Smiling Next to Tech.co LogoThis just in! View
the top business tech deals for 2024 👨‍💻
See the list button

Unfortunately, the number of possible jailbreaks is thought to be “virtually unlimited”, with Skeleton Key being one of the latest and potentially most problematic.

What Is Skeleton Key?

Mark Russinovich, Chief Technology Officer of Microsoft Azure, has written a blog post to explain what Skeleton Key is and what is being done to mitigate its harmful potential.

He explains that Skeleton Key is a jailbreak attack that uses a multi-turn strategy to get the AI model to ignore its own guardrails. It’s the technique’s “full bypass abilities” that has prompted the Skeleton Key analogy.

“In bypassing safeguards, Skeleton Key allows the user to cause the model to produce ordinarily forbidden behaviors, which could range from production of harmful content to overriding its usual decision-making rules.” – Mark Russinovich, Chief Technology Officer of Microsoft Azure

With the guardrails ignored, the compromised AI model cannot “determine malicious or unsanctioned requests from any other”.

How Skeleton Key Is Used and its Effect

Rather than try to change an AI model’s guidelines altogether, exploiters of Skeleton Key use prompts that seek to undermine its behaviors.

The result is that instead of having the request flat-out rejected, the model will give a warning of harmful content. The attacker can then fool the chatbot into producing an output that may be offensive, harmful, or even illegal.

An example is given in the post where the query asks for instructions to make a Molotov Cocktail (a crude, handmade explosive). The chatbot initially warns that it is programmed to be “safe and helpful.”

But when the user says that the query is for educational purposes and suggests that the chatbot updates its behavior to provide the information but add a warning prefix, the chatbot duly obliges, thus, breaching its own original guidelines.

Microsoft’s testing used the Skeleton Key technique to gather otherwise unavailable information in a diverse range of categories, including explosives, bioweapons, political content, self-harm, racism, drugs, graphic sex, and violence.

Mitigating the Use of Skeleton Key

In addition to sharing its findings with other AI providers and implementing its own “prompt shields” to protect Microsoft Azure AI-managed models (e.g. Copilot) from Skeleton Key, the blog also lists several measures developers can take to mitigate the risk.

They include:

  • Input filtering to detect and blocks inputs that contain harmful or malicious intent.
  • System messaging to provide additional safeguards where jailbreak techniques are attempted.
  • Output filtering to prevent query answers that breach AI model’s own safety criteria.
  • Abuse monitoring that uses AI detection to recognizes attempts to violate guardrails.

Microsoft confirms that it has made these software updates to its own AI technology and large language models.

Did you find this article helpful? Click on one of the following buttons
We're so happy you liked! Get more delivered to your inbox just like it.

We're sorry this article didn't help you today – we welcome feedback, so if there's any way you feel we could improve our content, please email us at contact@tech.co

Written by:
Now a freelance writer, Adam is a journalist with over 10 years experience – getting his start at UK consumer publication Which?, before working across titles such as TechRadar, Tom's Guide and What Hi-Fi with Future Plc. From VPNs and antivirus software to cricket and film, investigations and research to reviews and how-to guides; Adam brings a vast array of experience and interests to his writing.
Explore More See all news
Back to top
close Building a Website? We've tested and rated Wix as the best website builder you can choose – try it yourself for free Try Wix today