Benchmarking AI Agents on Full-Stack Coding
AI + a16z28 Maalis 2025

Benchmarking AI Agents on Full-Stack Coding

In this episode, a16z General Partner Martin Casado sits down with Sujay Jayakar, co-founder and Chief Scientist at Convex, to talk about his team’s latest work benchmarking AI agents on full-stack coding tasks. From designing Fullstack Bench to the quirks of agent behavior, the two dig into what’s actually hard about autonomous software development, and why robust evals—and guardrails like type safety—matter more than ever. They also get tactical: which models perform best for real-world app building? How should developers think about trajectory management and variance across runs? And what changes when you treat your toolchain like part of the prompt? Whether you're a hobbyist developer or building the next generation of AI-powered devtools, Sujay’s systems-level insights are not to be missed.

Drawing from Sujay’s work developing the Fullstack-Bench, they cover:

  • Why full-stack coding is still a frontier task for autonomous agents
  • How type safety and other “guardrails” can significantly reduce variance and failure
  • What makes a good eval—and why evals might matter more than clever prompts
  • How different models perform on real-world app-building tasks (and what to watch out for)
  • Why your toolchain might be the most underrated part of the prompt
  • And what all of this means for devs—from hobbyists to infra teams building with AI in the loop

Learn More:

Introducing Fullstack-Bench

Follow everyone on X:

Sujay Jayakar

Martin Casado

Check out everything a16z is doing with artificial intelligence here, including articles, projects, and more podcasts.

Please note that the content here is for informational purposes only; should NOT be taken as legal, business, tax, or investment advice or be used to evaluate any investment or security; and is not directed at any investors or potential investors in any a16z fund. a16z and its affiliates may maintain investments in the companies discussed. For more details please see a16z.com/disclosures.


Hosted by Simplecast, an AdsWizz company. See pcm.adswizz.com for information about our collection and use of personal data for advertising.

Jaksot(82)

Open Models and Maturation: Assessing the Generative AI Market

Open Models and Maturation: Assessing the Generative AI Market

a16z partners Guido Appenzeller and Matt Bornstein join Derrick Harris to discuss the state of the generative AI market, about 18 months after it really kicked into high gear with the release of ChatG...

17 Touko 202440min

Security Founders Talk Shop About Generative AI

Security Founders Talk Shop About Generative AI

In this bonus episode, recorded live at our San Francisco office, security-startup founders Dean De Beer (Command Zero), Kevin Tian (Doppel), and Travis McPeak (Resourcely) share their thoughts on gen...

15 Touko 202422min

How to Think About Foundation Models for Cybersecurity

How to Think About Foundation Models for Cybersecurity

In this episode of the AI + a16z podcast, a16z General Partner Zane Lackey and a16z Partner Joel de la Garza sit down with Derrick Harris to discuss how generative AI — LLMs, in particular — and found...

10 Touko 202437min

Securing the Software Supply Chain with LLMs

Securing the Software Supply Chain with LLMs

Socket Founder and CEO Feross Aboukhadijeh joins a16z's Joel de la Garza and Derrick Harris to discuss the open-source software supply chain. Feross and Joel share their thoughts and insights on topic...

3 Touko 202438min

ARCHIVE: GPT-3 Hype

ARCHIVE: GPT-3 Hype

In this episode, though, we’re traveling back in time to distant — in AI years, at least — past of 2020. Because amid all the news over the past 18 or so months, it’s easy to forget that generative AI...

1 Touko 202433min

Vector Databases and the Power of RAG

Vector Databases and the Power of RAG

Pinecone Founder and CEO Edo Liberty joins a16z's Satish Talluri and Derrick Harris to discuss the promises, challenges, and opportunities for vector databases and retrieval augmented generation (RAG)...

26 Huhti 202436min

Remaking the UI for AI

Remaking the UI for AI

a16z General Partner Anjney Midha joins the podcast to discuss what's happening with hardware for artificial intelligence. Nvidia might have cornered the market on training workloads for now, but he b...

19 Huhti 202438min

Scoping the Enterprise LLM Market

Scoping the Enterprise LLM Market

Naveen Rao, vice president of generative AI at Databricks, joins a16z's Matt Bornstein and Derrick Harris to discuss enterprise usage of LLMs and generative AI. Naveen is particularly knowledgeable ab...

12 Huhti 202444min

Suosittua kategoriassa Liike-elämä ja talous

sijotuskasti
psykopodiaa-podcast
mimmit-sijoittaa
rss-rahapodi
rss-draivi
rss-lahtijat
oppimisen-psykologia
rss-rahamania
rss-porssipuhetta
taloudellinen-mielenrauha
rss-seuraava-potilas
rahapuhetta
rss-h-asselmoilanen
rss-paatos-podcast-suomen-kovimmat-paatoksentekijat-2
rss-paasipodi
rss-inderes
io-techin-tekniikkapodcast
pomojen-suusta
rss-viisas-raha-podi
rss-40-ajatusta-aanesta