#226 – Holden Karnofsky on unexploited opportunities to make AI safer — and all his AGI takes

#226 – Holden Karnofsky on unexploited opportunities to make AI safer — and all his AGI takes

For years, working on AI safety usually meant theorising about the ‘alignment problem’ or trying to convince other people to give a damn. If you could find any way to help, the work was frustrating and low feedback.

According to Anthropic’s Holden Karnofsky, this situation has now reversed completely.

There are now large amounts of useful, concrete, shovel-ready projects with clear goals and deliverables. Holden thinks people haven’t appreciated the scale of the shift, and wants everyone to see the large range of ‘well-scoped object-level work’ they could personally help with, in both technical and non-technical areas.

Video, full transcript, and links to learn more: https://80k.info/hk25

In today’s interview, Holden — previously cofounder and CEO of Open Philanthropy (now Coefficient Giving) — lists 39 projects he’s excited to see happening, including:

  • Training deceptive AI models to study deception and how to detect it
  • Developing classifiers to block jailbreaking
  • Implementing security measures to stop ‘backdoors’ or ‘secret loyalties’ from being added to models in training
  • Developing policies on model welfare, AI-human relationships, and what instructions to give models
  • Training AIs to work as alignment researchers

And that’s all just stuff he’s happened to observe directly, which is probably only a small fraction of the options available.

Holden makes a case that, for many people, working at an AI company like Anthropic will be the best way to steer AGI in a positive direction. He notes there are “ways that you can reduce AI risk that you can only do if you’re a competitive frontier AI company.” At the same time, he believes external groups have their own advantages and can be equally impactful.

Critics worry that Anthropic’s efforts to stay at that frontier encourage competitive racing towards AGI — significantly or entirely offsetting any useful research they do. Holden thinks this seriously misunderstands the strategic situation we’re in — and explains his case in detail with host Rob Wiblin.

Chapters:

  • Cold open (00:00:00)
  • Holden is back! (00:02:26)
  • An AI Chernobyl we never notice (00:02:56)
  • Is rogue AI takeover easy or hard? (00:07:32)
  • The AGI race isn't a coordination failure (00:17:48)
  • What Holden now does at Anthropic (00:28:04)
  • The case for working at Anthropic (00:30:08)
  • Is Anthropic doing enough? (00:40:45)
  • Can we trust Anthropic, or any AI company? (00:43:40)
  • How can Anthropic compete while paying the “safety tax”? (00:49:14)
  • What, if anything, could prompt Anthropic to halt development of AGI? (00:56:11)
  • Holden's retrospective on responsible scaling policies (00:59:01)
  • Overrated work (01:14:27)
  • Concrete shovel-ready projects Holden is excited about (01:16:37)
  • Great things to do in technical AI safety (01:20:48)
  • Great things to do on AI welfare and AI relationships (01:28:18)
  • Great things to do in biosecurity and pandemic preparedness (01:35:11)
  • How to choose where to work (01:35:57)
  • Overrated AI risk: Cyberattacks (01:41:56)
  • Overrated AI risk: Persuasion (01:51:37)
  • Why AI R&D is the main thing to worry about (01:55:36)
  • The case that AI-enabled R&D wouldn't speed things up much (02:07:15)
  • AI-enabled human power grabs (02:11:10)
  • Main benefits of getting AGI right (02:23:07)
  • The world is handling AGI about as badly as possible (02:29:07)
  • Learning from targeting companies for public criticism in farm animal welfare (02:31:39)
  • Will Anthropic actually make any difference? (02:40:51)
  • “Misaligned” vs “misaligned and power-seeking” (02:55:12)
  • Success without dignity: how we could win despite being stupid (03:00:58)
  • Holden sees less dignity but has more hope (03:08:30)
  • Should we expect misaligned power-seeking by default? (03:15:58)
  • Will reinforcement learning make everything worse? (03:23:45)
  • Should we push for marginal improvements or big paradigm shifts? (03:28:58)
  • Should safety-focused people cluster or spread out? (03:31:35)
  • Is Anthropic vocal enough about strong regulation? (03:35:56)
  • Is Holden biased because of his financial stake in Anthropic? (03:39:26)
  • Have we learned clever governance structures don't work? (03:43:51)
  • Is Holden scared of AI bioweapons? (03:46:12)
  • Holden thinks AI companions are bad news (03:49:47)
  • Are AI companies too hawkish on China? (03:56:39)
  • The frontier of infosec: confidentiality vs integrity (04:00:51)
  • How often does AI work backfire? (04:03:38)
  • Is AI clearly more impactful to work in? (04:18:26)
  • What's the role of earning to give? (04:24:54)

This episode was recorded on July 25 and 28, 2025.

Video editing: Simon Monsour, Luke Monsour, Dominic Armstrong, and Milo McGuire
Audio engineering: Milo McGuire, Simon Monsour, and Dominic Armstrong
Music: CORBIT
Coordination, transcriptions, and web: Katy Moore

Jaksot(320)

#197 – Nick Joseph on whether Anthropic's AI safety policy is up to the task

#197 – Nick Joseph on whether Anthropic's AI safety policy is up to the task

The three biggest AI companies — Anthropic, OpenAI, and DeepMind — have now all released policies designed to make their AI models less likely to go rogue or cause catastrophic damage as they approach...

22 Elo 20242h 29min

#196 – Jonathan Birch on the edge cases of sentience and why they matter

#196 – Jonathan Birch on the edge cases of sentience and why they matter

"In the 1980s, it was still apparently common to perform surgery on newborn babies without anaesthetic on both sides of the Atlantic. This led to appalling cases, and to public outcry, and to campaign...

15 Elo 20242h 1min

#195 – Sella Nevo on who's trying to steal frontier AI models, and what they could do with them

#195 – Sella Nevo on who's trying to steal frontier AI models, and what they could do with them

"Computational systems have literally millions of physical and conceptual components, and around 98% of them are embedded into your infrastructure without you ever having heard of them. And an inordin...

1 Elo 20242h 8min

#194 – Vitalik Buterin on defensive acceleration and how to regulate AI when you fear government

#194 – Vitalik Buterin on defensive acceleration and how to regulate AI when you fear government

"If you’re a power that is an island and that goes by sea, then you’re more likely to do things like valuing freedom, being democratic, being pro-foreigner, being open-minded, being interested in trad...

26 Heinä 20243h 4min

#193 – Sihao Huang on navigating the geopolitics of US–China AI competition

#193 – Sihao Huang on navigating the geopolitics of US–China AI competition

"You don’t necessarily need world-leading compute to create highly risky AI systems. The biggest biological design tools right now, like AlphaFold’s, are orders of magnitude smaller in terms of comput...

18 Heinä 20242h 23min

#192 – Annie Jacobsen on what would happen if North Korea launched a nuclear weapon at the US

#192 – Annie Jacobsen on what would happen if North Korea launched a nuclear weapon at the US

"Ring one: total annihilation; no cellular life remains. Ring two, another three-mile diameter out: everything is ablaze. Ring three, another three or five miles out on every side: third-degree burns ...

12 Heinä 20241h 54min

#191 (Part 2) – Carl Shulman on government and society after AGI

#191 (Part 2) – Carl Shulman on government and society after AGI

This is the second part of our marathon interview with Carl Shulman. The first episode is on the economy and national security after AGI. You can listen to them in either order!If we develop artificia...

5 Heinä 20242h 20min

#191 (Part 1) – Carl Shulman on the economy and national security after AGI

#191 (Part 1) – Carl Shulman on the economy and national security after AGI

This is the first part of our marathon interview with Carl Shulman. The second episode is on government and society after AGI. You can listen to them in either order!The human brain does what it does ...

27 Kesä 20244h 14min

Suosittua kategoriassa Koulutus

rss-murhan-anatomia
psykopodiaa-podcast
voi-hyvin-meditaatiot-2
rss-narsisti
rss-niinku-asia-on
adhd-podi
rss-liian-kuuma-peruna
aamukahvilla
psykologia
rss-valo-minussa-2
rss-vapaudu-voimaasi
kesken
rss-koira-haudattuna
aloita-meditaatio
dear-ladies
esa-saarinen-filosofia-ja-systeemiajattelu
ihminen-tavattavissa-tommy-hellsten-instituutti
leveli
rss-luonnollinen-synnytys-podcast
filocast-filosofian-perusteet