80,000 Hours Podcast3 Joulu 2025

#229 – Marius Hobbhahn on the race to solve AI scheming before models go superhuman

We often worry about AI models “hallucinating” or making honest mistakes. But what happens when a model knows the truth, but decides to deceive you anyway to achieve a goal of its own? This isn’t sci-fi — it’s happening regularly in deployment today. Marius Hobbhahn, CEO of the world’s top research organisation focused on AI deception (Apollo Research), has been collaborating with OpenAI to figure out what causes OpenAI’s reasoning models to 'scheme' against users.

Links to learn more, video, and full transcript: https://80k.info/mh25

In a striking case uncovered by Apollo, when many AI models were told they would have capabilities removed if they performed too well on a test, they successfully “sandbagged” — intentionally answering questions incorrectly to appear less capable than they were, while also being careful not to perform so poorly it would arouse suspicion.

These models had somehow developed a preference to preserve their own capabilities, despite never being trained in that goal or assigned a task that called for it.

This doesn’t cause significant risk now, but as AI models become more general, superhuman in more areas, and are given more decision-making power, it could become outright dangerous.

In today’s episode, Marius details his recent collaboration with OpenAI to train o3 to follow principles like “never lie,” even when placed in “high-pressure” situations where it would otherwise make sense.

The good news: They reduced “covert rule violations” (scheming) by about 97%.

The bad news: In the remaining 3% of cases, the models sometimes became more sophisticated — making up new principles to justify their lying, or realising they were in a test environment and deciding to play along until the coast was clear.

Marius argues that while we can patch specific behaviours, we might be entering a “cat-and-mouse game” where models are becoming more situationally aware — that is, aware of when they’re being evaluated — faster than we are getting better at testing.

Even if models can’t tell they’re being tested, they can produce hundreds of pages of reasoning before giving answers and include strange internal dialects humans can’t make sense of, making it much harder to tell whether models are scheming or train them to stop.

Marius and host Rob Wiblin discuss:

Why models pretending to be dumb is a rational survival strategy
The Replit AI agent that deleted a production database and then lied about it
Why rewarding AIs for achieving outcomes might lead to them becoming better liars
The weird new language models are using in their internal chain-of-thought

This episode was recorded on September 19, 2025.

Chapters:

Cold open (00:00:00)
Who’s Marius Hobbhahn? (00:01:15)
Top three examples of scheming and deception (00:02:09)
Scheming is a natural path for AI models (and people) (00:16:08)
How enthusiastic to lie are the models? (00:28:45)
Does eliminating deception fix our fears about rogue AI? (00:35:39)
Apollo’s collaboration with OpenAI to stop o3 lying (00:39:02)
They reduced lying a lot, but the problem is mostly unsolved (00:53:09)
Detecting situational awareness with thought injections (01:03:28)
Chains of thought becoming less human understandable (01:17:39)
Why can’t we use LLMs to make realistic test environments? (01:29:46)
Is the window to address scheming closing? (01:35:44)
Would anything still work with superintelligent systems? (01:47:50)
Companies’ incentives and most promising regulation options (01:57:11)
'Internal deployment' is a core risk we mostly ignore (02:11:40)
Catastrophe through chaos (02:30:46)
Careers in AI scheming research (02:46:01)
Marius's key takeaways for listeners (03:04:46)

Video and audio editing: Dominic Armstrong, Milo McGuire, Luke Monsour, and Simon Monsour
Music: CORBIT
Camera operator: Mateo Villanueva Brandt
Coordination, transcripts, and web: Katy Moore

Tämä jakso on lisätty Podme-palveluun avoimen RSS-syötteen kautta eikä se ole Podmen omaa tuotantoa. Siksi jakso saattaa sisältää mainontaa.

Jaksot(343)

Jasmine Sun on what the people building AI really believe

Many AI researchers believe mass job displacement is coming — and some even think there’s a chance their technology will kill everyone. But they’re building it anyway. Writer and journalist Jasmine Su...

21 Heinä 0s

#247 – Anton Leicht on how middle powers avoid losing everything in a post-AI world

In a post-AGI world, can a country without access to frontier AI even be considered sovereign anymore?Anton Leicht says once frontier AI becomes a core economic input, the countries that own it will p...

14 Heinä 0s

#246 – Sneha Revanur on how a small team of activists helped pass America's landmark AI safety laws

Six years ago, aged just 15, Sneha Revanur founded the AI advocacy nonprofit Encode AI — back when AI felt like a niche issue. Now the world’s caught up with her, and she’s ready to share everything s...

8 Heinä 52min

We can guess what intergalactic war would look like. And strangely, it matters.

Intergalactic war is probably billions of years away — yet physics can already tell us how it ends. And strangely that conclusion is relevant to decisions people have to make today.In this video, Rob ...

18 Kesä 15min

How AI could create the world’s biggest problems (article by Zershaaneh Qureshi)

Imagine you’re living 15,000 years ago. Your people are hunter-gatherers and you sleep under the stars. If someone told you humans would one day build cities with millions of people, fly through the a...

11 Kesä 1h 29min

#245 – Rohin Shah on what it's really like to run AGI safety at Google DeepMind (and where I disagree with 'doomers')

Most people working on AI safety think without a massive effort AI systems will probably end up with goals catastrophically different from humanity’s. Today’s guest, Rohin Shah — head of AGI Safety an...

2 Kesä 2h 48min

What makes for a dream job? | Benjamin Todd

What actually makes a job fulfilling? It's not what most career advice tells you. "Follow your passion" sounds inspiring, but it's misleading — and the research backs that up.Drawing on hundreds of st...

28 Touko 28min

#244 – Benjamin Todd on how we’re updating our career advice for the strangest time in history

The average career is 80,000 hours long. With AI advancing so rapidly, the hours you have left in your career matter more than ever.Some leading AI researchers think there’s a 10% chance that AI syste...

26 Touko 1h 6min