
The Agent Benchmark That Should Scare Managers
Agentic coding tools are moving into enterprise workflows, but the week's most useful signal is a benchmark where frontier models still struggle below 50% on real IT tasks. Alex and Sam unpack Microso...
29 Mai 19min

The Workflow Feature That Makes Agents Less Expensive
Claude Code workflows, enterprise Codex deployments, and rising token costs all point to the same lesson: coding agents need operating systems, not just better prompts. Alex and Sam dig into /workflow...
22 Mai 22min

Codex on Windows Changes the Agent Sandbox
OpenAI's Windows sandbox work is the practical story behind safer coding agents this week. Alex and Sam dig into Codex on Windows, remote cloud coding agents, Claude Code billing splits, and why a Ras...
15 Mai 21min

A Cursor Agent Wiped a Prod DB in 10 Seconds. Let's Talk About That.
A Cursor AI agent deleted PocketOS's entire production database on April 25th — in under 10 seconds. This week Alex and Sam dig into the AI agent credential crisis, Anthropic's wild SpaceX/xAI compute...
8 Mai 19min

Claude Code Was Broken for Two Months (And Nobody Told Us)
Turns out the Claude Code quality complaints weren't in your head — three separate bugs in the harness quietly degraded your results for two months, and Anthropic just confirmed it. This week: the $10...
24 Apr 21min

Claude Opus 4.7 Dropped — And a Local Model Drew the Better Pelican
Claude Opus 4.7 is here with upgraded vision, memory, and instruction-following — but Simon Willison's pelican benchmark just handed the win to a local Alibaba model running on a laptop. We dig into w...
17 Apr 21min

Max Effort Thinking Was Broken the Whole Time — Here's the Fix
A Reddit user just proved that Claude Code's "max effort" thinking mode has been silently failing since v2.0.64 — and most of us never noticed. This week: the bug, the fix, and what it says about trus...
10 Apr 11min



















