#214 – Buck Shlegeris on controlling AI that wants to take over – so we can use it anyway

#214 – Buck Shlegeris on controlling AI that wants to take over – so we can use it anyway

Most AI safety conversations centre on alignment: ensuring AI systems share our values and goals. But despite progress, we’re unlikely to know we’ve solved the problem before the arrival of human-level and superhuman systems in as little as three years.

So some are developing a backup plan to safely deploy models we fear are actively scheming to harm us — so-called “AI control.” While this may sound mad, given the reluctance of AI companies to delay deploying anything they train, not developing such techniques is probably even crazier.

Today’s guest — Buck Shlegeris, CEO of Redwood Research — has spent the last few years developing control mechanisms, and for human-level systems they’re more plausible than you might think. He argues that given companies’ unwillingness to incur large costs for security, accepting the possibility of misalignment and designing robust safeguards might be one of our best remaining options.

Links to learn more, highlights, video, and full transcript.

As Buck puts it: "Five years ago I thought of misalignment risk from AIs as a really hard problem that you’d need some really galaxy-brained fundamental insights to resolve. Whereas now, to me the situation feels a lot more like we just really know a list of 40 things where, if you did them — none of which seem that hard — you’d probably be able to not have very much of your problem."

Of course, even if Buck is right, we still need to do those 40 things — which he points out we’re not on track for. And AI control agendas have their limitations: they aren’t likely to work once AI systems are much more capable than humans, since greatly superhuman AIs can probably work around whatever limitations we impose.

Still, AI control agendas seem to be gaining traction within AI safety. Buck and host Rob Wiblin discuss all of the above, plus:

  • Why he’s more worried about AI hacking its own data centre than escaping
  • What to do about “chronic harm,” where AI systems subtly underperform or sabotage important work like alignment research
  • Why he might want to use a model he thought could be conspiring against him
  • Why he would feel safer if he caught an AI attempting to escape
  • Why many control techniques would be relatively inexpensive
  • How to use an untrusted model to monitor another untrusted model
  • What the minimum viable intervention in a “lazy” AI company might look like
  • How even small teams of safety-focused staff within AI labs could matter
  • The moral considerations around controlling potentially conscious AI systems, and whether it’s justified

Chapters:

  • Cold open |00:00:00|
  • Who’s Buck Shlegeris? |00:01:27|
  • What's AI control? |00:01:51|
  • Why is AI control hot now? |00:05:39|
  • Detecting human vs AI spies |00:10:32|
  • Acute vs chronic AI betrayal |00:15:21|
  • How to catch AIs trying to escape |00:17:48|
  • The cheapest AI control techniques |00:32:48|
  • Can we get untrusted models to do trusted work? |00:38:58|
  • If we catch a model escaping... will we do anything? |00:50:15|
  • Getting AI models to think they've already escaped |00:52:51|
  • Will they be able to tell it's a setup? |00:58:11|
  • Will AI companies do any of this stuff? |01:00:11|
  • Can we just give AIs fewer permissions? |01:06:14|
  • Can we stop human spies the same way? |01:09:58|
  • The pitch to AI companies to do this |01:15:04|
  • Will AIs get superhuman so fast that this is all useless? |01:17:18|
  • Risks from AI deliberately doing a bad job |01:18:37|
  • Is alignment still useful? |01:24:49|
  • Current alignment methods don't detect scheming |01:29:12|
  • How to tell if AI control will work |01:31:40|
  • How can listeners contribute? |01:35:53|
  • Is 'controlling' AIs kind of a dick move? |01:37:13|
  • Could 10 safety-focused people in an AGI company do anything useful? |01:42:27|
  • Benefits of working outside frontier AI companies |01:47:48|
  • Why Redwood Research does what it does |01:51:34|
  • What other safety-related research looks best to Buck? |01:58:56|
  • If an AI escapes, is it likely to be able to beat humanity from there? |01:59:48|
  • Will misaligned models have to go rogue ASAP, before they're ready? |02:07:04|
  • Is research on human scheming relevant to AI? |02:08:03|

This episode was originally recorded on February 21, 2025.

Video: Simon Monsour and Luke Monsour
Audio engineering: Ben Cordell, Milo McGuire, and Dominic Armstrong
Transcriptions and web: Katy Moore

Avsnitt(333)

'95% of AI Pilots Fail': The hidden agenda behind the viral stat that misled millions

'95% of AI Pilots Fail': The hidden agenda behind the viral stat that misled millions

You might have heard that '95% of corporate AI pilots' are failing. It was one of the most widely cited AI statistics of 2025, parroted by media outlets everywhere. It helped trigger a Nasdaq selloff ...

28 Apr 10min

#242 – Will MacAskill on how we survive the 'intelligence explosion,' AI character, and the case for 'viatopia'

#242 – Will MacAskill on how we survive the 'intelligence explosion,' AI character, and the case for 'viatopia'

Hundreds of millions already turn to AI on the most personal of topics — therapy, political opinions, and how to treat others. And as AI takes over more of the economy, the character of these systems ...

22 Apr 3h 9min

Risks from power-seeking AI systems (article narration by Zershaaneh Qureshi)

Risks from power-seeking AI systems (article narration by Zershaaneh Qureshi)

Hundreds of prominent AI scientists and other notable figures signed a statement in 2023 saying that mitigating the risk of extinction from AI should be a global priority. At 80,000 Hours, we’ve consi...

16 Apr 1h 29min

How scary is Claude Mythos? 303 pages in 21 minutes

How scary is Claude Mythos? 303 pages in 21 minutes

With Claude Mythos we have an AI that knows when it's being tested, can obscure its reasoning when it wants, and is better at breaking into (and out of) computers than any human alive. Rob Wiblin work...

10 Apr 21min

Village gossip, pesticide bans, and gene drives: 17 experts on the future of global health

Village gossip, pesticide bans, and gene drives: 17 experts on the future of global health

What does it really take to lift millions out of poverty and prevent needless deaths?In this special compilation episode, 17 past guests — including economists, nonprofit founders, and policy advisors...

7 Apr 4h 6min

What everyone is missing about Anthropic vs the Pentagon. And: The Meta leaks are worse than you think.

What everyone is missing about Anthropic vs the Pentagon. And: The Meta leaks are worse than you think.

When the Pentagon tried to strong-arm Anthropic into dropping its ban on AI-only kill decisions and mass domestic surveillance, the company refused. Its critics went on the attack: Anthropic and its s...

3 Apr 20min

#241 – Richard Moulange on how now AI codes viable genomes from scratch and outperforms virologists at lab work — what could go wrong?

#241 – Richard Moulange on how now AI codes viable genomes from scratch and outperforms virologists at lab work — what could go wrong?

Last September, scientists used an AI model to design genomes for entirely new bacteriophages (viruses that infect bacteria). They then built them in a lab. Many were viable. And despite being entirel...

31 Mars 3h 7min

#240 – Samuel Charap on how a Ukraine ceasefire could accidentally set Europe up for a bigger war

#240 – Samuel Charap on how a Ukraine ceasefire could accidentally set Europe up for a bigger war

Many people believe a ceasefire in Ukraine will leave Europe safer. But today's guest lays out how a deal could potentially generate insidious new risks — leaving us in a situation that's equally dang...

24 Mars 1h 12min

Populärt inom Utbildning

historiepodden-se
rss-bara-en-till-om-missbruk-medberoende-2
det-skaver
harrisons-dramatiska-historia
nu-blir-det-historia
johannes-hansen-podcast
roda-vita-rosen
not-fanny-anymore
sektledare
rss-viktmedicinpodden
allt-du-velat-veta
sa-in-i-sjalen
rss-foraldramotet-bring-lagercrantz
rss-sjalsligt-avkladd
i-vantan-pa-katastrofen
rss-dr-bjorklund
rikatillsammans-om-privatekonomi-rikedom-i-livet
rss-max-tant-med-max-villman
rss-basta-livet
rss-om-vi-ska-vara-arliga