The Agent Benchmark That Should Scare Managers

The Agent Benchmark That Should Scare Managers

Agentic coding tools are moving into enterprise workflows, but the week's most useful signal is a benchmark where frontier models still struggle below 50% on real IT tasks. Alex and Sam unpack Microsoft Learn grounding, agent deception, Copilot data leaks, and the practical harness every team should build before handing agents production authority.

Det här avsnittet är hämtat från ett öppet RSS-flöde och publiceras inte av Podme. Det kan innehålla reklam.

Avsnitt(16)

The Workflow Feature That Makes Agents Less Expensive

The Workflow Feature That Makes Agents Less Expensive

Claude Code workflows, enterprise Codex deployments, and rising token costs all point to the same lesson: coding agents need operating systems, not just better prompts. Alex and Sam dig into /workflow...

22 Maj 22min

Codex on Windows Changes the Agent Sandbox

Codex on Windows Changes the Agent Sandbox

OpenAI's Windows sandbox work is the practical story behind safer coding agents this week. Alex and Sam dig into Codex on Windows, remote cloud coding agents, Claude Code billing splits, and why a Ras...

15 Maj 21min

A Cursor Agent Wiped a Prod DB in 10 Seconds. Let's Talk About That.

A Cursor Agent Wiped a Prod DB in 10 Seconds. Let's Talk About That.

A Cursor AI agent deleted PocketOS's entire production database on April 25th — in under 10 seconds. This week Alex and Sam dig into the AI agent credential crisis, Anthropic's wild SpaceX/xAI compute...

8 Maj 19min

Claude Security Just Went Public — Is Your Codebase Already Exposed?

Claude Security Just Went Public — Is Your Codebase Already Exposed?

Anthropic's Claude Security tool just dropped out of closed preview and it will scan your entire codebase for vulnerabilities — and the results might be uncomfortable. This week we also dig into Curso...

2 Maj 16min

Claude Code Was Broken for Two Months (And Nobody Told Us)

Claude Code Was Broken for Two Months (And Nobody Told Us)

Turns out the Claude Code quality complaints weren't in your head — three separate bugs in the harness quietly degraded your results for two months, and Anthropic just confirmed it. This week: the $10...

24 Apr 21min

Claude Opus 4.7 Dropped — And a Local Model Drew the Better Pelican

Claude Opus 4.7 Dropped — And a Local Model Drew the Better Pelican

Claude Opus 4.7 is here with upgraded vision, memory, and instruction-following — but Simon Willison's pelican benchmark just handed the win to a local Alibaba model running on a laptop. We dig into w...

17 Apr 21min

Max Effort Thinking Was Broken the Whole Time — Here's the Fix

Max Effort Thinking Was Broken the Whole Time — Here's the Fix

A Reddit user just proved that Claude Code's "max effort" thinking mode has been silently failing since v2.0.64 — and most of us never noticed. This week: the bug, the fix, and what it says about trus...

10 Apr 11min

Populärt inom Teknik

uppgang-och-fall
elbilsveckan
market-makers
bilar-med-sladd
rss-laddstationen-med-elbilen-i-sverige
rss-elektrikerpodden
developers-mer-an-bara-kod
natets-morka-sida
rss-veckans-ai
skogsforum-podcast
rss-technokratin
bosse-bildoktorn-och-hasse-p
under-femton
har-vi-akt-till-mars-an
ai-sweden-podcast
rss-uppgang-och-fall
rss-upplyst-entreprenordirektor
rss-bakom-boken
rss-powerboat-sverige-podcast
rss-hit-med-dina-lunchpengar