Benchmarking AI Agents on Full-Stack Coding
AI + a16z28 Maalis 2025

Benchmarking AI Agents on Full-Stack Coding

In this episode, a16z General Partner Martin Casado sits down with Sujay Jayakar, co-founder and Chief Scientist at Convex, to talk about his team’s latest work benchmarking AI agents on full-stack coding tasks. From designing Fullstack Bench to the quirks of agent behavior, the two dig into what’s actually hard about autonomous software development, and why robust evals—and guardrails like type safety—matter more than ever. They also get tactical: which models perform best for real-world app building? How should developers think about trajectory management and variance across runs? And what changes when you treat your toolchain like part of the prompt? Whether you're a hobbyist developer or building the next generation of AI-powered devtools, Sujay’s systems-level insights are not to be missed.

Drawing from Sujay’s work developing the Fullstack-Bench, they cover:

  • Why full-stack coding is still a frontier task for autonomous agents
  • How type safety and other “guardrails” can significantly reduce variance and failure
  • What makes a good eval—and why evals might matter more than clever prompts
  • How different models perform on real-world app-building tasks (and what to watch out for)
  • Why your toolchain might be the most underrated part of the prompt
  • And what all of this means for devs—from hobbyists to infra teams building with AI in the loop

Learn More:

Introducing Fullstack-Bench

Follow everyone on X:

Sujay Jayakar

Martin Casado

Check out everything a16z is doing with artificial intelligence here, including articles, projects, and more podcasts.

Please note that the content here is for informational purposes only; should NOT be taken as legal, business, tax, or investment advice or be used to evaluate any investment or security; and is not directed at any investors or potential investors in any a16z fund. a16z and its affiliates may maintain investments in the companies discussed. For more details please see a16z.com/disclosures.


Hosted by Simplecast, an AdsWizz company. See pcm.adswizz.com for information about our collection and use of personal data for advertising.

Jaksot(81)

Scaling AI for the Coming Data Deluge

Scaling AI for the Coming Data Deluge

In this episode of the AI + a16z podcast, Anyscale cofounder and CEO Robert Nishihara joins a16z's Jennifer Li and Derrick Harris to discuss the challenges of training and running AI models at scale; ...

19 Heinä 202437min

ARCHIVE: The Dream of AI Is Alive in AlphaGo

ARCHIVE: The Dream of AI Is Alive in AlphaGo

In this archive episode from 2015, a16z's Sonal Chokshi, Frank Chen, and Steven Sinofsky discuss DeepMind's breakthrough AlphaGo system, which mastered the ancient Chinese game Go and introduced the p...

5 Heinä 202433min

Beyond Language: Inside a Hundred-Trillion-Token Video Model

Beyond Language: Inside a Hundred-Trillion-Token Video Model

In this episode of the AI + a16z podcast, Luma Chief Scientist Jiaming Song joins a16z General Partner Anjney MIdha to discuss Jiaming's esteemed career in video models, culminating thus far in Luma's...

3 Heinä 20241h 5min

Developer Tool UX in the Age of Generative AI

Developer Tool UX in the Age of Generative AI

In this episode, design engineer Alasdair Monk joins a16z's Yoko Li and Derrick Harris to discuss how generative AI is changing how developers — and the those building for developers — interact with t...

21 Kesä 202437min

Building Production Workflows for AI Applications

Building Production Workflows for AI Applications

In this episode, Inngest cofounder and CEO Tony Holdstock-Brown joins a16z partner Yoko Li, as well as Derrick Harris, to discuss the reality and complexity of running AI agents and other multistep AI...

14 Kesä 202443min

The Future of Image Models Is Multimodal

The Future of Image Models Is Multimodal

In this episode, Ideogram CEO Mohammad Norouzi joins a16z General Partner Jennifer Li, as well as Derrick Harris, to share his story of growing up in Iran, helping build influential text-to-image mode...

7 Kesä 202437min

ARCHIVE: Open Models (with Arthur Mensch) and Video Models (with Stefano Ermon)

ARCHIVE: Open Models (with Arthur Mensch) and Video Models (with Stefano Ermon)

For this holiday weekend (in the United States) episode, we've stitched together two archived episodes from the a16z Podcast, both featuring General Partner Anjney Midha. In the first half, from Decem...

24 Touko 20241h 5min

Open Models and Maturation: Assessing the Generative AI Market

Open Models and Maturation: Assessing the Generative AI Market

a16z partners Guido Appenzeller and Matt Bornstein join Derrick Harris to discuss the state of the generative AI market, about 18 months after it really kicked into high gear with the release of ChatG...

17 Touko 202440min

Suosittua kategoriassa Liike-elämä ja talous

sijotuskasti
psykopodiaa-podcast
mimmit-sijoittaa
rss-rahapodi
rss-lahtijat
rss-rahamania
rss-neuvottelija-sami-miettinen
rahapuhetta
ostan-asuntoja-podcast
rss-porssipuhetta
rss-laakispodi
rss-h-asselmoilanen
rss-startup-ministerio
rss-bisnesta-bebeja
taloudellinen-mielenrauha
pomojen-suusta
sijoituspodi
rss-rikasta-elamaa
rss-yrittajat-ymparillani
rss-porssipodi