#151 – Ajeya Cotra on accidentally teaching AI models to deceive us

#151 – Ajeya Cotra on accidentally teaching AI models to deceive us

Imagine you are an orphaned eight-year-old whose parents left you a $1 trillion company, and no trusted adult to serve as your guide to the world. You have to hire a smart adult to run that company, guide your life the way that a parent would, and administer your vast wealth. You have to hire that adult based on a work trial or interview you come up with. You don't get to see any resumes or do reference checks. And because you're so rich, tonnes of people apply for the job — for all sorts of reasons.

Today's guest Ajeya Cotra — senior research analyst at Open Philanthropy — argues that this peculiar setup resembles the situation humanity finds itself in when training very general and very capable AI models using current deep learning methods.

Links to learn more, summary and full transcript.

As she explains, such an eight-year-old faces a challenging problem. In the candidate pool there are likely some truly nice people, who sincerely want to help and make decisions that are in your interest. But there are probably other characters too — like people who will pretend to care about you while you're monitoring them, but intend to use the job to enrich themselves as soon as they think they can get away with it.

Like a child trying to judge adults, at some point humans will be required to judge the trustworthiness and reliability of machine learning models that are as goal-oriented as people, and greatly outclass them in knowledge, experience, breadth, and speed. Tricky!

Can't we rely on how well models have performed at tasks during training to guide us? Ajeya worries that it won't work. The trouble is that three different sorts of models will all produce the same output during training, but could behave very differently once deployed in a setting that allows their true colours to come through. She describes three such motivational archetypes:

  • Saints — models that care about doing what we really want
  • Sycophants — models that just want us to say they've done a good job, even if they get that praise by taking actions they know we wouldn't want them to
  • Schemers — models that don't care about us or our interests at all, who are just pleasing us so long as that serves their own agenda

And according to Ajeya, there are also ways we could end up actively selecting for motivations that we don't want.

In today's interview, Ajeya and Rob discuss the above, as well as:

  • How to predict the motivations a neural network will develop through training
  • Whether AIs being trained will functionally understand that they're AIs being trained, the same way we think we understand that we're humans living on planet Earth
  • Stories of AI misalignment that Ajeya doesn't buy into
  • Analogies for AI, from octopuses to aliens to can openers
  • Why it's smarter to have separate planning AIs and doing AIs
  • The benefits of only following through on AI-generated plans that make sense to human beings
  • What approaches for fixing alignment problems Ajeya is most excited about, and which she thinks are overrated
  • How one might demo actually scary AI failure mechanisms

Get this episode by subscribing to our podcast on the world’s most pressing problems and how to solve them: type ‘80,000 Hours’ into your podcasting app. Or read the transcript below.

Producer: Keiran Harris

Audio mastering: Ryan Kessler and Ben Cordell

Transcriptions: Katy Moore

Jaksot(318)

#179 Classic episode – Randy Nesse on why evolution left us so vulnerable to depression and anxiety

#179 Classic episode – Randy Nesse on why evolution left us so vulnerable to depression and anxiety

Mental health problems like depression and anxiety affect enormous numbers of people and severely interfere with their lives. By contrast, we don’t see similar levels of physical ill health in young p...

3 Helmi 2h 51min

Why 'Aligned AI' Would Still Kill Democracy | David Duvenaud, ex-Anthropic team lead

Why 'Aligned AI' Would Still Kill Democracy | David Duvenaud, ex-Anthropic team lead

Democracy might be a brief historical blip. That’s the unsettling thesis of a recent paper, which argues AI that can do all the work a human can do inevitably leads to the “gradual disempowerment” of ...

27 Tammi 2h 31min

#145 Classic episode – Christopher Brown on why slavery abolition wasn't inevitable

#145 Classic episode – Christopher Brown on why slavery abolition wasn't inevitable

In many ways, humanity seems to have become more humane and inclusive over time. While there’s still a lot of progress to be made, campaigns to give people of different genders, races, sexualities, et...

20 Tammi 2h 56min

#233 – James Smith on how to prevent a mirror life catastrophe

#233 – James Smith on how to prevent a mirror life catastrophe

When James Smith first heard about mirror bacteria, he was sceptical. But within two weeks, he’d dropped everything to work on it full time, considering it the worst biothreat that he’d seen described...

13 Tammi 2h 9min

#144 Classic episode – Athena Aktipis on why cancer is a fundamental universal phenomena

#144 Classic episode – Athena Aktipis on why cancer is a fundamental universal phenomena

What’s the opposite of cancer? If you answered “cure,” “antidote,” or “antivenom” — you’ve obviously been reading the antonym section at www.merriam-webster.com/thesaurus/cancer.But today’s guest Athe...

9 Tammi 3h 30min

#142 Classic episode – John McWhorter on why the optimal number of languages might be one, and other provocative claims about language

#142 Classic episode – John McWhorter on why the optimal number of languages might be one, and other provocative claims about language

John McWhorter is a linguistics professor at Columbia University specialising in research on creole languages. He's also a content-producing machine, never afraid to give his frank opinion on anything...

6 Tammi 1h 35min

2025 Highlight-o-thon: Oops! All Bests

2025 Highlight-o-thon: Oops! All Bests

It’s that magical time of year once again — highlightapalooza! Stick around for one top bit from each episode we recorded this year, including:Kyle Fish explaining how Anthropic’s AI Claude descends i...

29 Joulu 20251h 40min

#232 – Andreas Mogensen on what we owe 'philosophical Vulcans' and unconscious beings

#232 – Andreas Mogensen on what we owe 'philosophical Vulcans' and unconscious beings

Most debates about the moral status of AI systems circle the same question: is there something that it feels like to be them? But what if that’s the wrong question to ask? Andreas Mogensen — a senior ...

19 Joulu 20252h 37min

Suosittua kategoriassa Koulutus

rss-murhan-anatomia
psykopodiaa-podcast
voi-hyvin-meditaatiot-2
rss-niinku-asia-on
kesken
rss-duodecim-lehti
adhd-podi
aamukahvilla
aloita-meditaatio
rss-liian-kuuma-peruna
rss-valo-minussa-2
ihminen-tavattavissa-tommy-hellsten-instituutti
rss-elamankoulu
rss-psykalab
rss-narsisti
rahapuhetta
salainen-paivakirja
rss-uskonto-on-tylsaa
rss-vapaudu-voimaasi
rss-hereilla