#226 – Holden Karnofsky on unexploited opportunities to make AI safer — and all his AGI takes

#226 – Holden Karnofsky on unexploited opportunities to make AI safer — and all his AGI takes

For years, working on AI safety usually meant theorising about the ‘alignment problem’ or trying to convince other people to give a damn. If you could find any way to help, the work was frustrating and low feedback.

According to Anthropic’s Holden Karnofsky, this situation has now reversed completely.

There are now large amounts of useful, concrete, shovel-ready projects with clear goals and deliverables. Holden thinks people haven’t appreciated the scale of the shift, and wants everyone to see the large range of ‘well-scoped object-level work’ they could personally help with, in both technical and non-technical areas.

Video, full transcript, and links to learn more: https://80k.info/hk25

In today’s interview, Holden — previously cofounder and CEO of Open Philanthropy (now Coefficient Giving) — lists 39 projects he’s excited to see happening, including:

  • Training deceptive AI models to study deception and how to detect it
  • Developing classifiers to block jailbreaking
  • Implementing security measures to stop ‘backdoors’ or ‘secret loyalties’ from being added to models in training
  • Developing policies on model welfare, AI-human relationships, and what instructions to give models
  • Training AIs to work as alignment researchers

And that’s all just stuff he’s happened to observe directly, which is probably only a small fraction of the options available.

Holden makes a case that, for many people, working at an AI company like Anthropic will be the best way to steer AGI in a positive direction. He notes there are “ways that you can reduce AI risk that you can only do if you’re a competitive frontier AI company.” At the same time, he believes external groups have their own advantages and can be equally impactful.

Critics worry that Anthropic’s efforts to stay at that frontier encourage competitive racing towards AGI — significantly or entirely offsetting any useful research they do. Holden thinks this seriously misunderstands the strategic situation we’re in — and explains his case in detail with host Rob Wiblin.

Chapters:

  • Cold open (00:00:00)
  • Holden is back! (00:02:26)
  • An AI Chernobyl we never notice (00:02:56)
  • Is rogue AI takeover easy or hard? (00:07:32)
  • The AGI race isn't a coordination failure (00:17:48)
  • What Holden now does at Anthropic (00:28:04)
  • The case for working at Anthropic (00:30:08)
  • Is Anthropic doing enough? (00:40:45)
  • Can we trust Anthropic, or any AI company? (00:43:40)
  • How can Anthropic compete while paying the “safety tax”? (00:49:14)
  • What, if anything, could prompt Anthropic to halt development of AGI? (00:56:11)
  • Holden's retrospective on responsible scaling policies (00:59:01)
  • Overrated work (01:14:27)
  • Concrete shovel-ready projects Holden is excited about (01:16:37)
  • Great things to do in technical AI safety (01:20:48)
  • Great things to do on AI welfare and AI relationships (01:28:18)
  • Great things to do in biosecurity and pandemic preparedness (01:35:11)
  • How to choose where to work (01:35:57)
  • Overrated AI risk: Cyberattacks (01:41:56)
  • Overrated AI risk: Persuasion (01:51:37)
  • Why AI R&D is the main thing to worry about (01:55:36)
  • The case that AI-enabled R&D wouldn't speed things up much (02:07:15)
  • AI-enabled human power grabs (02:11:10)
  • Main benefits of getting AGI right (02:23:07)
  • The world is handling AGI about as badly as possible (02:29:07)
  • Learning from targeting companies for public criticism in farm animal welfare (02:31:39)
  • Will Anthropic actually make any difference? (02:40:51)
  • “Misaligned” vs “misaligned and power-seeking” (02:55:12)
  • Success without dignity: how we could win despite being stupid (03:00:58)
  • Holden sees less dignity but has more hope (03:08:30)
  • Should we expect misaligned power-seeking by default? (03:15:58)
  • Will reinforcement learning make everything worse? (03:23:45)
  • Should we push for marginal improvements or big paradigm shifts? (03:28:58)
  • Should safety-focused people cluster or spread out? (03:31:35)
  • Is Anthropic vocal enough about strong regulation? (03:35:56)
  • Is Holden biased because of his financial stake in Anthropic? (03:39:26)
  • Have we learned clever governance structures don't work? (03:43:51)
  • Is Holden scared of AI bioweapons? (03:46:12)
  • Holden thinks AI companions are bad news (03:49:47)
  • Are AI companies too hawkish on China? (03:56:39)
  • The frontier of infosec: confidentiality vs integrity (04:00:51)
  • How often does AI work backfire? (04:03:38)
  • Is AI clearly more impactful to work in? (04:18:26)
  • What's the role of earning to give? (04:24:54)

This episode was recorded on July 25 and 28, 2025.

Video editing: Simon Monsour, Luke Monsour, Dominic Armstrong, and Milo McGuire
Audio engineering: Milo McGuire, Simon Monsour, and Dominic Armstrong
Music: CORBIT
Coordination, transcriptions, and web: Katy Moore

Avsnitt(332)

#50 - David Denkenberger on how to feed all 8b people through an asteroid/nuclear winter

#50 - David Denkenberger on how to feed all 8b people through an asteroid/nuclear winter

If an asteroid impact or nuclear winter blocked the sun for years, our inability to grow food would result in billions dying of starvation, right? According to Dr David Denkenberger, co-author of Feed...

27 Dec 20182h 57min

#49 - Rachel Glennerster on a year's worth of education for 30c & other development 'best buys'

#49 - Rachel Glennerster on a year's worth of education for 30c & other development 'best buys'

If I told you it's possible to deliver an extra year of ideal primary-level education for under $1, would you believe me? Hopefully not - the claim is absurd on its face. But it may be true nonetheles...

20 Dec 20181h 35min

#48 - Brian Christian on better living through the wisdom of computer science

#48 - Brian Christian on better living through the wisdom of computer science

Please let us know if we've helped you: Fill out our annual impact survey Ever felt that you were so busy you spent all your time paralysed trying to figure out where to start, and couldn't get much ...

22 Nov 20183h 15min

#47 - Catherine Olsson & Daniel Ziegler on the fast path into high-impact ML engineering roles

#47 - Catherine Olsson & Daniel Ziegler on the fast path into high-impact ML engineering roles

After dropping out of a machine learning PhD at Stanford, Daniel Ziegler needed to decide what to do next. He’d always enjoyed building stuff and wanted to shape the development of AI, so he thought a...

2 Nov 20182h 4min

#46 - Hilary Greaves on moral cluelessness & tackling crucial questions in academia

#46 - Hilary Greaves on moral cluelessness & tackling crucial questions in academia

The barista gives you your coffee and change, and you walk away from the busy line. But you suddenly realise she gave you $1 less than she should have. Do you brush your way past the people now waitin...

23 Okt 20182h 49min

#45 - Tyler Cowen's case for maximising econ growth, stabilising civilization & thinking long-term

#45 - Tyler Cowen's case for maximising econ growth, stabilising civilization & thinking long-term

I've probably spent more time reading Tyler Cowen - Professor of Economics at George Mason University - than any other author. Indeed it's his incredibly popular blog Marginal Revolution that prompted...

17 Okt 20182h 30min

#44 - Paul Christiano on how we'll hand the future off to AI, & solving the alignment problem

#44 - Paul Christiano on how we'll hand the future off to AI, & solving the alignment problem

Paul Christiano is one of the smartest people I know. After our first session produced such great material, we decided to do a second recording, resulting in our longest interview so far. While challe...

2 Okt 20183h 51min

#43 - Daniel Ellsberg on the institutional insanity that maintains nuclear doomsday machines

#43 - Daniel Ellsberg on the institutional insanity that maintains nuclear doomsday machines

In Stanley Kubrick’s iconic film Dr. Strangelove, the American president is informed that the Soviet Union has created a secret deterrence system which will automatically wipe out humanity upon detect...

25 Sep 20182h 44min

Populärt inom Utbildning

rss-bara-en-till-om-missbruk-medberoende-2
historiepodden-se
det-skaver
harrisons-dramatiska-historia
nu-blir-det-historia
allt-du-velat-veta
johannes-hansen-podcast
not-fanny-anymore
rss-viktmedicinpodden
roda-vita-rosen
sektledare
rss-foraldramotet-bring-lagercrantz
i-vantan-pa-katastrofen
sa-in-i-sjalen
rss-max-tant-med-max-villman
rss-sjalsligt-avkladd
alska-oss
rss-om-vi-ska-vara-arliga
rikatillsammans-om-privatekonomi-rikedom-i-livet
sex-pa-riktigt-med-marika-smith