OpenAI's Attempts to Stop AI Going Rogue Has Had Mixed Results

OpenAI’s “Superalignment” team has announced a small breakthrough in how humans could be able to reign in AI once it supersedes our level of intelligence – a possible scenario known as AGI superintelligence.

Researching superintelligent machines has limitations though, as even once the technology becomes realized it may try and hide its true behavior from humans, according to Ilya Sutskever – OpenAI’s Co-founder and the driving force behind Sam Altman’s temporary ousting as CEO.

While superintelligence still remains hypothetical, OpenAI appears to be taking the concerns seriously, dedicating a 5th of its computing power to risk mitigation and investing $10 million into super alignment research.

OpenAI’s AGI Test is Promising, But Has Flaws

As OpenAI becomes increasingly concerned about the looming threat of AI superintelligence, its Superalignment team has just released its first research update – and its results are mixed.

The team, which was founded by Ilya Sutskever and OpenAI scientist Jan Leike, looked into how superintelligence systems could be supervised once they surpassed human capabilities, but since these machines don’t exist yet, OpenAI’s control test used GPT-2 and GPT-4 systems as stand-ins.

This just in! View
the top business tech deals for 2025 👨‍💻

More specifically, researchers tested how the less sophisticated GPT-2 model would be able to supervise its most powerful model GPT-4 model. But was was their conclusion?

OpenAI graphic displaying their approach to super alignment

Well, after training GPT-2 to perform different tasks, including chess puzzles and sentiment analysis, and using these responses to train GPT-4, they found that OpenAI’s latest model performed 20-70% better than GPT-2, but still fell far short of its own potential.

GPT-4 also avoided many mistakes made by the inferior GPT-2 model. According to the researchers, this is evidence of a phenomenon called ‘weak-to-strong generalization’ that occurs when a model has implicit knowledge of how to perform tasks, and is able to execute them correctly despite receiving poor-quality instructions.

According to researchers, evidence of weak-to-stong generalization in the test suggests this phenomenon may also be present when humans supervise superintelligent AI. If this is true, when AGI models exist, and are instructed not to cause catastrophic harm, they may be able to tell even better than their human supervisors when their actions enter dangerous territory.

However, despite these positive findings, GPT-4’s performance was still weakened after being trained by GPT-2, and the experiment didn’t guarantee the stronger model would behave perfectly under supervision. These shortcomings reveal that more research has to be conducted on the matter before humans can be trusted as suitable supervisors.

OpenAI: Concerns Around Superintelligence Are “Obvious”

Despite being a frontrunner in AI development, OpenAI hasn’t been shy in voicing the potential dangers of superintelligent AI.

Co-founder Sutskever recently came out saying preventing AI from going rogue is an “obvious” concern, and the company even announced in a statement that technology could be very dangerous, and could lead to the disempowerment of humanity or even human extinction if left unchecked.

“We’re gonna see superhuman models, they’re gonna have vast capabilities and they could be very, very dangerous, and we don’t yet have the methods to control them.” – Leopold Aschenbrenner, OpenAI Researcher

But why are researchers so scared of these machines going rogue? Well, according to Sutskever, once superhuman AI models exceed human-level intelligence, they could become capable of masking their own behavior, opening us up to worrying and unpredictable and worrying circumstances.

OpenAI Claim to Be Taking AI Safety Seriously

While many experts believe these fears are overblown, OpenAI is already taking several steps to address safety concerns.

The company recently announced it would be investing $10 million into super alignment research, in the form of $2 million grants to university labs, and $150,000 grants to individual graduate students. Open AI also revealed it would be dedicating a fifth of its computing power to the Superalignment project, as it continues to preemptively research how to govern AGI.

Disputes around superintelligence and AI safety were even rumored to be the real reason behind Sam Altman’s shock, yet brief, deposition from the company last month. However, with Altman currently back at the helm, and OpenAI continuing to prioritize commercial interests with its AGI project ‘Q*’, the company could still be doing a lot more to to mitigate risks to public safety.

OpenAI’s Attempts to Stop Future AI Going Rogue Has Had Mixed Results

OpenAI’s AGI Test is Promising, But Has Flaws

OpenAI: Concerns Around Superintelligence Are “Obvious”

OpenAI Claim to Be Taking AI Safety Seriously

Written by:

Study: 28% of Employees Would Use AI at Work Even If Prohibited

Sources Say Meta Is Considering Downsizing Its AI Division

MIT Finds 95% of Enterprise AI Pilots Fail to Boost Revenues

Claude AI Can Now Terminate Dangerous Interactions