#226 – Holden Karnofsky on unexploited opportunities to make AI safer — and all his AGI takes

#226 – Holden Karnofsky on unexploited opportunities to make AI safer — and all his AGI takes

For years, working on AI safety usually meant theorising about the ‘alignment problem’ or trying to convince other people to give a damn. If you could find any way to help, the work was frustrating and low feedback.

According to Anthropic’s Holden Karnofsky, this situation has now reversed completely.

There are now large amounts of useful, concrete, shovel-ready projects with clear goals and deliverables. Holden thinks people haven’t appreciated the scale of the shift, and wants everyone to see the large range of ‘well-scoped object-level work’ they could personally help with, in both technical and non-technical areas.

In today’s interview, Holden — previously cofounder and CEO of Open Philanthropy (now Coefficient Giving) — lists 39 projects he’s excited to see happening, including:

  • Training deceptive AI models to study deception and how to detect it
  • Developing classifiers to block jailbreaking
  • Implementing security measures to stop ‘backdoors’ or ‘secret loyalties’ from being added to models in training
  • Developing policies on model welfare, AI-human relationships, and what instructions to give models
  • Training AIs to work as alignment researchers

And that’s all just stuff he’s happened to observe directly, which is probably only a small fraction of the options available.

All this low-hanging fruit is one factor behind his decision to join Anthropic this year. That said, his wife was also a cofounder and president of the company, giving him a big financial stake in its success — and making it impossible for him to be seen as independent no matter where he worked.

Holden makes a case that, for many people, working at an AI company like Anthropic will be the best way to steer AGI in a positive direction. He notes there are “ways that you can reduce AI risk that you can only do if you’re a competitive frontier AI company.” At the same time, he believes external groups have their own advantages and can be equally impactful.

Outside critics worry that Anthropic’s efforts to stay at that frontier encourage competitive racing towards AGI — significantly or entirely offsetting any useful research they do. Holden thinks this seriously misunderstands the strategic situation we’re in.

“I work at an AI company, and a lot of people think that’s just inherently unethical,” he says. “They’re imagining [that] everyone wishes they could go slowly, but they’re going fast so they can beat everyone else. […] But I emphatically think this is not what’s going on in AI.”

The reality, in Holden’s view:

I think there’s too many players in AI who […] don’t want to slow down. They don’t believe in the risks. Maybe they don’t even care about the risks. […] If Anthropic were to say, “We’re out, we’re going to slow down,” they would say, ‘This is awesome! Now we have a better chance of winning, and this is even good for our recruiting” — because they have a better chance of getting people who want to be on the frontier and want to win.

Holden believes a frontier AI company can reduce risk by:

  • Developing cheap, practical safety measures other companies might adopt
  • Prototyping policies regulators could mandate
  • Gathering crucial data about what advanced AI can actually do

Host Rob Wiblin and Holden discuss the case for and against those strategies, and much more, in today’s episode.

Learn more and read the full transcript on the 80,000 Hours website.


Chapters:

• Cold open (00:00:00)
• Holden is back! (00:02:28)
• An AI Chernobyl we never notice (00:02:58)
• Is rogue AI takeover easy or hard? (00:07:39)
• The AGI race isn't a coordination failure (00:18:01)
• What Holden now does at Anthropic (00:28:30)
• The case for working at Anthropic (00:30:38)
• Is Anthropic doing enough? (00:41:30)
• Can we trust Anthropic, or any AI company? (00:44:30)
• How can Anthropic compete while paying the “safety tax”? (00:50:11)
• What, if anything, could prompt Anthropic to halt development of AGI? (00:57:13)
• Holden's retrospective on responsible scaling policies (01:00:04)
• Overrated work (01:15:45)
• Concrete shovel-ready projects Holden is excited about (01:17:58)
• Great things to do in technical AI safety (01:22:12)
• Great things to do on AI welfare and AI relationships (01:29:53)
• Great things to do in biosecurity and pandemic preparedness (01:36:51)
• How to choose where to work (01:37:37)
• Overrated AI risk: Cyberattacks (01:43:38)
• Overrated AI risk: Persuasion (01:53:28)
• Why AI R&D is the main thing to worry about (01:57:31)
• The case that AI-enabled R&D wouldn't speed things up much (02:09:30)
• AI-enabled human power grabs (02:13:26)
• Main benefits of getting AGI right (02:26:04)
• The world is handling AGI about as badly as possible (02:31:44)
• Learning from targeting companies for public criticism in farm animal welfare (02:34:18)
• Will Anthropic actually make any difference? (02:43:43)
• “Misaligned” vs “misaligned and power-seeking” (02:58:23)
• Success without dignity: how we could win despite being stupid (03:04:16)
• Holden sees less dignity but has more hope (

Det här avsnittet är hämtat från ett öppet RSS-flöde och publiceras inte av Podme. Det kan innehålla reklam.

Avsnitt(340)

We can guess what intergalactic war would look like. And strangely, it matters.

We can guess what intergalactic war would look like. And strangely, it matters.

Intergalactic war is probably billions of years away — yet physics can already tell us how it ends. And strangely that conclusion is relevant to decisions people have to make today.In this video, Rob ...

18 Juni 15min

How AI could create the world’s biggest problems (article by Zershaaneh Qureshi)

How AI could create the world’s biggest problems (article by Zershaaneh Qureshi)

Imagine you’re living 15,000 years ago. Your people are hunter-gatherers and you sleep under the stars. If someone told you humans would one day build cities with millions of people, fly through the a...

11 Juni 1h 29min

What it's really like to run AGI safety at Google DeepMind (and where I disagree with 'doomers') | Rohin Shah

What it's really like to run AGI safety at Google DeepMind (and where I disagree with 'doomers') | Rohin Shah

Most people working on AI safety think without a massive effort AI systems will probably end up with goals catastrophically different from humanity’s. Today’s guest, Rohin Shah — head of AGI Safety an...

2 Juni 2h 48min

What makes for a dream job? | Benjamin Todd

What makes for a dream job? | Benjamin Todd

What actually makes a job fulfilling? It's not what most career advice tells you. "Follow your passion" sounds inspiring, but it's misleading — and the research backs that up.Drawing on hundreds of st...

28 Maj 28min

We’re updating our career advice for the strangest time in history | Benjamin Todd, author of 80,000 Hours

We’re updating our career advice for the strangest time in history | Benjamin Todd, author of 80,000 Hours

The average career is 80,000 hours long. With AI advancing so rapidly, the hours you have left in your career matter more than ever.Some leading AI researchers think there’s a 10% chance that AI syste...

26 Maj 1h 6min

Can AIs already start 'rogue deployments' inside AI companies? (Landmark new METR report)

Can AIs already start 'rogue deployments' inside AI companies? (Landmark new METR report)

A red-teamer was embedded inside Anthropic for three weeks, told to imagine he was an evil Claude, and asked to figure out how to launch a ‘rogue AI deployment’ without getting caught. It’s one part o...

20 Maj 20min

#243 – 'Godfather of AI' Yoshua Bengio: "I now see a path" to safe superintelligent AI

#243 – 'Godfather of AI' Yoshua Bengio: "I now see a path" to safe superintelligent AI

The co-inventor of modern AI and the most cited living scientist believes he's figured out how to ensure AI is honest, incapable of deception, and never goes rogue. Yoshua Bengio – Turing Award Winner...

7 Maj 2h 35min

'95% of AI Pilots Fail': The hidden agenda behind the viral stat that misled millions

'95% of AI Pilots Fail': The hidden agenda behind the viral stat that misled millions

You might have heard that '95% of corporate AI pilots' are failing. It was one of the most widely cited AI statistics of 2025, parroted by media outlets everywhere. It helped trigger a Nasdaq selloff ...

28 Apr 10min

Populärt inom Utbildning

historiepodden-se
rss-bara-en-till-om-missbruk-medberoende-2
det-skaver
nu-blir-det-historia
harrisons-dramatiska-historia
sektledare
not-fanny-anymore
rss-viktmedicinpodden
johannes-hansen-podcast
roda-vita-rosen
allt-du-velat-veta
kan-jag-sa-kan-du-podden
i-vantan-pa-katastrofen
rikatillsammans-om-privatekonomi-rikedom-i-livet
sa-in-i-sjalen
rss-max-tant-med-max-villman
rss-foraldramotet-bring-lagercrantz
rss-ar-det-rimligt
rss-autismandan
rss-i-skenet-av-blaljus