#226 – Holden Karnofsky on unexploited opportunities to make AI safer — and all his AGI takes

#226 – Holden Karnofsky on unexploited opportunities to make AI safer — and all his AGI takes

For years, working on AI safety usually meant theorising about the ‘alignment problem’ or trying to convince other people to give a damn. If you could find any way to help, the work was frustrating and low feedback.

According to Anthropic’s Holden Karnofsky, this situation has now reversed completely.

There are now large amounts of useful, concrete, shovel-ready projects with clear goals and deliverables. Holden thinks people haven’t appreciated the scale of the shift, and wants everyone to see the large range of ‘well-scoped object-level work’ they could personally help with, in both technical and non-technical areas.

Video, full transcript, and links to learn more: https://80k.info/hk25

In today’s interview, Holden — previously cofounder and CEO of Open Philanthropy (now Coefficient Giving) — lists 39 projects he’s excited to see happening, including:

  • Training deceptive AI models to study deception and how to detect it
  • Developing classifiers to block jailbreaking
  • Implementing security measures to stop ‘backdoors’ or ‘secret loyalties’ from being added to models in training
  • Developing policies on model welfare, AI-human relationships, and what instructions to give models
  • Training AIs to work as alignment researchers

And that’s all just stuff he’s happened to observe directly, which is probably only a small fraction of the options available.

Holden makes a case that, for many people, working at an AI company like Anthropic will be the best way to steer AGI in a positive direction. He notes there are “ways that you can reduce AI risk that you can only do if you’re a competitive frontier AI company.” At the same time, he believes external groups have their own advantages and can be equally impactful.

Critics worry that Anthropic’s efforts to stay at that frontier encourage competitive racing towards AGI — significantly or entirely offsetting any useful research they do. Holden thinks this seriously misunderstands the strategic situation we’re in — and explains his case in detail with host Rob Wiblin.

Chapters:

  • Cold open (00:00:00)
  • Holden is back! (00:02:26)
  • An AI Chernobyl we never notice (00:02:56)
  • Is rogue AI takeover easy or hard? (00:07:32)
  • The AGI race isn't a coordination failure (00:17:48)
  • What Holden now does at Anthropic (00:28:04)
  • The case for working at Anthropic (00:30:08)
  • Is Anthropic doing enough? (00:40:45)
  • Can we trust Anthropic, or any AI company? (00:43:40)
  • How can Anthropic compete while paying the “safety tax”? (00:49:14)
  • What, if anything, could prompt Anthropic to halt development of AGI? (00:56:11)
  • Holden's retrospective on responsible scaling policies (00:59:01)
  • Overrated work (01:14:27)
  • Concrete shovel-ready projects Holden is excited about (01:16:37)
  • Great things to do in technical AI safety (01:20:48)
  • Great things to do on AI welfare and AI relationships (01:28:18)
  • Great things to do in biosecurity and pandemic preparedness (01:35:11)
  • How to choose where to work (01:35:57)
  • Overrated AI risk: Cyberattacks (01:41:56)
  • Overrated AI risk: Persuasion (01:51:37)
  • Why AI R&D is the main thing to worry about (01:55:36)
  • The case that AI-enabled R&D wouldn't speed things up much (02:07:15)
  • AI-enabled human power grabs (02:11:10)
  • Main benefits of getting AGI right (02:23:07)
  • The world is handling AGI about as badly as possible (02:29:07)
  • Learning from targeting companies for public criticism in farm animal welfare (02:31:39)
  • Will Anthropic actually make any difference? (02:40:51)
  • “Misaligned” vs “misaligned and power-seeking” (02:55:12)
  • Success without dignity: how we could win despite being stupid (03:00:58)
  • Holden sees less dignity but has more hope (03:08:30)
  • Should we expect misaligned power-seeking by default? (03:15:58)
  • Will reinforcement learning make everything worse? (03:23:45)
  • Should we push for marginal improvements or big paradigm shifts? (03:28:58)
  • Should safety-focused people cluster or spread out? (03:31:35)
  • Is Anthropic vocal enough about strong regulation? (03:35:56)
  • Is Holden biased because of his financial stake in Anthropic? (03:39:26)
  • Have we learned clever governance structures don't work? (03:43:51)
  • Is Holden scared of AI bioweapons? (03:46:12)
  • Holden thinks AI companions are bad news (03:49:47)
  • Are AI companies too hawkish on China? (03:56:39)
  • The frontier of infosec: confidentiality vs integrity (04:00:51)
  • How often does AI work backfire? (04:03:38)
  • Is AI clearly more impactful to work in? (04:18:26)
  • What's the role of earning to give? (04:24:54)

This episode was recorded on July 25 and 28, 2025.

Video editing: Simon Monsour, Luke Monsour, Dominic Armstrong, and Milo McGuire
Audio engineering: Milo McGuire, Simon Monsour, and Dominic Armstrong
Music: CORBIT
Coordination, transcriptions, and web: Katy Moore

Jaksot(320)

#172 – Bryan Caplan on why you should stop reading the news

#172 – Bryan Caplan on why you should stop reading the news

Is following important political and international news a civic duty — or is it our civic duty to avoid it?It's common to think that 'staying informed' and checking the headlines every day is just wha...

17 Marras 20232h 23min

#171 – Alison Young on how top labs have jeopardised public health with repeated biosafety failures

#171 – Alison Young on how top labs have jeopardised public health with repeated biosafety failures

"Rare events can still cause catastrophic accidents. The concern that has been raised by experts going back over time, is that really, the more of these experiments, the more labs, the more opportunit...

9 Marras 20231h 46min

#170 – Santosh Harish on how air pollution is responsible for ~12% of global deaths — and how to get that number down

#170 – Santosh Harish on how air pollution is responsible for ~12% of global deaths — and how to get that number down

"One [outrageous example of air pollution] is municipal waste burning that happens in many cities in the Global South. Basically, this is waste that gets collected from people's homes, and instead of ...

1 Marras 20232h 57min

#169 – Paul Niehaus on whether cash transfers cause economic growth, and keeping theft to acceptable levels

#169 – Paul Niehaus on whether cash transfers cause economic growth, and keeping theft to acceptable levels

"One of our earliest supporters and a dear friend of mine, Mark Lampert, once said to me, “The way I think about it is, imagine that this money were already in the hands of people living in poverty. I...

26 Loka 20231h 47min

#168 – Ian Morris on whether deep history says we're heading for an intelligence explosion

#168 – Ian Morris on whether deep history says we're heading for an intelligence explosion

"If we carry on looking at these industrialised economies, not thinking about what it is they're actually doing and what the potential of this is, you can make an argument that, yes, rates of growth a...

23 Loka 20232h 43min

#167 – Seren Kell on the research gaps holding back alternative proteins from mass adoption

#167 – Seren Kell on the research gaps holding back alternative proteins from mass adoption

"There have been literally thousands of years of breeding and living with animals to optimise these kinds of problems. But because we're just so early on with alternative proteins and there's so much ...

18 Loka 20231h 54min

#166 – Tantum Collins on what he’s learned as an AI policy insider at the White House, DeepMind and elsewhere

#166 – Tantum Collins on what he’s learned as an AI policy insider at the White House, DeepMind and elsewhere

"If you and I and 100 other people were on the first ship that was going to go settle Mars, and were going to build a human civilisation, and we have to decide what that government looks like, and we ...

12 Loka 20233h 8min

#165 – Anders Sandberg on war in space, whether civilisations age, and the best things possible in our universe

#165 – Anders Sandberg on war in space, whether civilisations age, and the best things possible in our universe

"Now, the really interesting question is: How much is there an attacker-versus-defender advantage in this kind of advanced future? Right now, if somebody's sitting on Mars and you're going to war agai...

6 Loka 20232h 48min

Suosittua kategoriassa Koulutus

rss-murhan-anatomia
psykopodiaa-podcast
rss-narsisti
voi-hyvin-meditaatiot-2
aamukahvilla
rss-vapaudu-voimaasi
rss-niinku-asia-on
psykologia
rss-liian-kuuma-peruna
adhd-podi
kesken
dear-ladies
leveli
rss-duodecim-lehti
rss-koira-haudattuna
rss-luonnollinen-synnytys-podcast
rahapuhetta
aloita-meditaatio
ihminen-tavattavissa-tommy-hellsten-instituutti
rss-ai-mita-siskopodcast