80,000 Hours Podcast4 Huhti 2025

#214 – Buck Shlegeris on controlling AI that wants to take over – so we can use it anyway

Most AI safety conversations centre on alignment: ensuring AI systems share our values and goals. But despite progress, we’re unlikely to know we’ve solved the problem before the arrival of human-level and superhuman systems in as little as three years.

So some are developing a backup plan to safely deploy models we fear are actively scheming to harm us — so-called “AI control.” While this may sound mad, given the reluctance of AI companies to delay deploying anything they train, not developing such techniques is probably even crazier.

Today’s guest — Buck Shlegeris, CEO of Redwood Research — has spent the last few years developing control mechanisms, and for human-level systems they’re more plausible than you might think. He argues that given companies’ unwillingness to incur large costs for security, accepting the possibility of misalignment and designing robust safeguards might be one of our best remaining options.

Links to learn more, highlights, video, and full transcript.

As Buck puts it: "Five years ago I thought of misalignment risk from AIs as a really hard problem that you’d need some really galaxy-brained fundamental insights to resolve. Whereas now, to me the situation feels a lot more like we just really know a list of 40 things where, if you did them — none of which seem that hard — you’d probably be able to not have very much of your problem."

Of course, even if Buck is right, we still need to do those 40 things — which he points out we’re not on track for. And AI control agendas have their limitations: they aren’t likely to work once AI systems are much more capable than humans, since greatly superhuman AIs can probably work around whatever limitations we impose.

Still, AI control agendas seem to be gaining traction within AI safety. Buck and host Rob Wiblin discuss all of the above, plus:

Why he’s more worried about AI hacking its own data centre than escaping
What to do about “chronic harm,” where AI systems subtly underperform or sabotage important work like alignment research
Why he might want to use a model he thought could be conspiring against him
Why he would feel safer if he caught an AI attempting to escape
Why many control techniques would be relatively inexpensive
How to use an untrusted model to monitor another untrusted model
What the minimum viable intervention in a “lazy” AI company might look like
How even small teams of safety-focused staff within AI labs could matter
The moral considerations around controlling potentially conscious AI systems, and whether it’s justified

Chapters:

Cold open |00:00:00|
Who’s Buck Shlegeris? |00:01:27|
What's AI control? |00:01:51|
Why is AI control hot now? |00:05:39|
Detecting human vs AI spies |00:10:32|
Acute vs chronic AI betrayal |00:15:21|
How to catch AIs trying to escape |00:17:48|
The cheapest AI control techniques |00:32:48|
Can we get untrusted models to do trusted work? |00:38:58|
If we catch a model escaping... will we do anything? |00:50:15|
Getting AI models to think they've already escaped |00:52:51|
Will they be able to tell it's a setup? |00:58:11|
Will AI companies do any of this stuff? |01:00:11|
Can we just give AIs fewer permissions? |01:06:14|
Can we stop human spies the same way? |01:09:58|
The pitch to AI companies to do this |01:15:04|
Will AIs get superhuman so fast that this is all useless? |01:17:18|
Risks from AI deliberately doing a bad job |01:18:37|
Is alignment still useful? |01:24:49|
Current alignment methods don't detect scheming |01:29:12|
How to tell if AI control will work |01:31:40|
How can listeners contribute? |01:35:53|
Is 'controlling' AIs kind of a dick move? |01:37:13|
Could 10 safety-focused people in an AGI company do anything useful? |01:42:27|
Benefits of working outside frontier AI companies |01:47:48|
Why Redwood Research does what it does |01:51:34|
What other safety-related research looks best to Buck? |01:58:56|
If an AI escapes, is it likely to be able to beat humanity from there? |01:59:48|
Will misaligned models have to go rogue ASAP, before they're ready? |02:07:04|
Is research on human scheming relevant to AI? |02:08:03|

This episode was originally recorded on February 21, 2025.

Video: Simon Monsour and Luke Monsour
Audio engineering: Ben Cordell, Milo McGuire, and Dominic Armstrong
Transcriptions and web: Katy Moore

Kokeile Premiumia

Nauti 14 päivää ilmaiseksi

Tilaa Premium

Jaksot(325)

#12 - Beth Cameron works to stop you dying in a pandemic. Here’s what keeps her up at night.

“When you're in the middle of a crisis and you have to ask for money, you're already too late.” That’s Dr Beth Cameron, who leads Global Biological Policy and Programs at the Nuclear Threat Initiative...

25 Loka 20171h 45min

#11 - Spencer Greenberg on speeding up social science 10-fold & why plenty of startups cause harm

Do most meat eaters think it’s wrong to hurt animals? Do Americans think climate change is likely to cause human extinction? What is the best, state-of-the-art therapy for depression? How can we make ...

17 Loka 20171h 29min

#10 - Nick Beckstead on how to spend billions of dollars preventing human extinction

What if you were in a position to give away billions of dollars to improve the world? What would you do with it? This is the problem facing Program Officers at the Open Philanthropy Project - people l...

11 Loka 20171h 51min

#9 - Christine Peterson on how insecure computers could lead to global disaster, and how to fix it

Take a trip to Silicon Valley in the 70s and 80s, when going to space sounded like a good way to get around environmental limits, people started cryogenically freezing themselves, and nanotechnology l...

4 Loka 20171h 45min

#8 - Lewis Bollard on how to end factory farming in our lifetimes

Every year tens of billions of animals are raised in terrible conditions in factory farms before being killed for human consumption. Over the last two years Lewis Bollard – Project Officer for Farm An...

27 Syys 20173h 16min

#7 - Julia Galef on making humanity more rational, what EA does wrong, and why Twitter isn’t all bad

The scientific revolution in the 16th century was one of the biggest societal shifts in human history, driven by the discovery of new and better methods of figuring out who was right and who was wrong...

13 Syys 20171h 14min

#6 - Toby Ord on why the long-term future matters more than anything else & what to do about it

Of all the people whose well-being we should care about, only a small fraction are alive today. The rest are members of future generations who are yet to exist. Whether they’ll be born into a world th...

6 Syys 20172h 8min

#5 - Alex Gordon-Brown on how to donate millions in your 20s working in quantitative trading

Quantitative financial trading is one of the highest paying parts of the world’s highest paying industry. 25 to 30 year olds with outstanding maths skills can earn millions a year in an obscure set of...

28 Elo 20171h 45min

Kaikki yhdessä sovelluksessa

Kuuntele kaikki suosikkipodcastisi ja -äänikirjasi yhdessä paikassa.

Sinulle valikoitua sisältöä

Podme-sovelluksessa kokoat suosikkisi helposti omaan kirjastoosi. Saat meiltä myös kuuntelusuosituksia!

Jatka kuuntelua koska tahansa

Voit jatkaa siitä mihin jäit, myös offline-tilassa.

Premium

9,99 €/kk

Kaikki premium-podcastit
Ei mainoksia
Ei sitoutumista, peruuta koska tahansa

Aloita 14 päivän kokeilu

Premium

13,99 €/kk

Kaikki premium-podcastit
Ei mainoksia
Ei sitoutumista, peruuta koska tahansa
Yksi lisäkäyttäjä

Kokeile 14 päivää maksutta

Suosittua kategoriassa Koulutus

rss-arkea-ja-aurinkoa-podcast-espanjasta

Tarinat ja äänet, joita rakastat kuunnella

Kuuntele kaikki suosikkipodcastisi ja -äänikirjasi

Lue lisää