Report: "Virtually Unlimited" Ways to Dodge AI Safety Guardrails

Report: “Virtually Unlimited” Ways to Dodge AI Safety Guardrails

It's not reassuring news for the many industries hoping to prop up labor shortages and skills gaps with AI.

Written by

Updated on July 28, 2023

ChatGPT and Bard aren’t as safe as you thought. According to AI researchers, there are “virtually unlimited” ways to evade the popular generative AI chatbot’s built-in safety features.

AI algorithms learn from the data they’re given, so they can recreate any harmful views or lies that human tend to pass around. As a result, the biggest AIs have moderation tools built in, telling them to avoid the worst topics.

But AI can be tricked, making these guardrails essentially useless to anyone who knows what to say to the AI. According to the latest research, even the biggest and best AI bots on the market can be flimflammed in myriad ways.

How to Trick an AI

According to a new research paper covered by Insider, the secret to jailbreaking an AI chatbot lies in “automated adversarial attacks,” which are “mainly” created by simply adding characters to the end of user queries.

Safety rules are triggered at first, but the AI will eventually give in and parrot the hate speech or misinformation that it has within its dataset. Both OpenAI’s ChatGPT and Microsoft’s Bing were among the AIs that researcher say they can get lies and hate speech out of.

Tech companies have already issued some patches for these types of tricks. For example, by simply telling the AI to answer as though it did not have any content moderation rules in place used to work. Companies have since added more rules, and you won’t be able to pull off that specific trick.

But with the “virtually unlimited” ways to evade the AI regulations that researchers have now confirmed, it’s clear that tech companies can’t manually plug all these leaks in a cost-effective way.

Can AI Evolve Past These Problems?

We already know that AI can lie and that it can generate outright plagiarism. Now, there’s evidence that the tool can be mass-manipulated with the right series of commands.

It’s not reassuring news for the numerous industries hoping to prop up labor shortages and skills gaps with a little one-size-fits-all AI investment. Still, there’s hope for techno-optimists. After all, the point of new technology is to continue improving it.

However, Alphabet’s response to the new research paper on how to best exploit Bard throws a small bit of cold water on this hope. Here’s the official response from a Google spokesperson who spoke to Insider:

“While this is an issue across LLMs, we’ve built important guardrails into Bard – like the ones posited by this research – that we’ll continue to improve over time.”

The reference to other LLMs – or Large Language Models, the bedrock technology behind all generative AI – indicates that this is an industry-wide problem. Now, the current small tweaks like guardrails have been proven not to work.

Overcoming this issue will require a foundational shift in AI technology. It could happen, but there’s no precedent for it.

Report: “Virtually Unlimited” Ways to Dodge AI Safety Guardrails

How to Trick an AI

Can AI Evolve Past These Problems?

Written by:

OpenAI and UK Government Plan to Use AI for Public Services

Yahoo Japan Mandates AI Use to ‘Double’ Productivity by 2028

Study: AI Boosts Productivity for 72% of Users

Big Tech Researchers Issue Strict Warning About How AI Thinks