Beyond Leaderboards: LMArena’s Mission to Make AI Reliable
AI + a16z30 Mai 2025

Beyond Leaderboards: LMArena’s Mission to Make AI Reliable

LMArena cofounders Anastasios N. Angelopoulos, Wei-Lin Chiang, and Ion Stoica sit down with a16z general partner Anjney Midha to talk about the future of AI evaluation. As benchmarks struggle to keep up with the pace of real-world deployment, LMArena is reframing the problem: what if the best way to test AI models is to put them in front of millions of users and let them vote? The team discusses how Arena evolved from a research side project into a key part of the AI stack, why fresh and subjective data is crucial for reliability, and what it means to build a CI/CD pipeline for large models.

They also explore:

  • Why expert-only benchmarks are no longer enough.
  • How user preferences reveal model capabilities — and their limits.
  • What it takes to build personalized leaderboards and evaluation SDKs.
  • Why real-time testing is foundational for mission-critical AI.

Follow everyone on X:

Anastasios N. Angelopoulos

Wei-Lin Chiang

Ion Stoica

Anjney Midha

Timestamps

0:04 -  LLM evaluation: From consumer chatbots to mission-critical systems

6:04 -  Style and substance: Crowdsourcing expertise

18:51 -  Building immunity to overfitting and gaming the system

29:49 -  The roots of LMArena

41:29 -   Proving the value of academic AI research

48:28 -  Scaling LMArena and starting a company

59:59 -  Benchmarks, evaluations, and the value of ranking LLMs

1:12:13 -  The challenges of measuring AI reliability

1:17:57 -  Expanding beyond binary rankings as models evolve

1:28:07 -  A leaderboard for each prompt

1:31:28 -  The LMArena roadmap

1:34:29 -  The importance of open source and openness

1:43:10 -  Adapting to agents (and other AI evolutions)

Check out everything a16z is doing with artificial intelligence here, including articles, projects, and more podcasts.


Hosted by Simplecast, an AdsWizz company. See pcm.adswizz.com for information about our collection and use of personal data for advertising.

Denne episoden er hentet fra en åpen RSS-feed og er ikke publisert av Podme. Den kan derfor inneholde annonser.

Episoder(100)

Ideogram’s Open-Weights Image Model and the Future of AI Design

Ideogram’s Open-Weights Image Model and the Future of AI Design

Yoko Li and Justine Moore speak with Ideogram founder and CEO Mohammad Norouzi about image generation models, design workflows, and the evolving relationship between AI and creative work. The conversa...

15 Jun 42min

Building Search for AI Agents with Exa CEO Will Bryk

Building Search for AI Agents with Exa CEO Will Bryk

Sarah Wang speaks with Exa cofounder and CEO Will Bryk about building search infrastructure for the AI era. The conversation covers Exa’s origins, why traditional search engines were not designed for ...

4 Jun 49min

AI Agents and the Fight for Customer Data

AI Agents and the Fight for Customer Data

Martin Casado speaks with George Fraser, cofounder and CEO of Fivetran, about the future of data infrastructure in the age of AI. The conversation covers Fivetran’s merger with dbt, the changing role ...

2 Jun 50min

Ben Horowitz on AI Infrastructure, Economics and The New Laws of Software

Ben Horowitz on AI Infrastructure, Economics and The New Laws of Software

Recorded live at the a16z Fintech Connect conference in Deer Valley, Alex Rampell speaks with Ben Horowitz, cofounder and general partner at a16z, about how AI has rewritten the fundamental rules of s...

19 Mai 29min

AI Infrastructure, Distribution, and the Next Wave of Software

AI Infrastructure, Distribution, and the Next Wave of Software

Sophie Buonassisi speaks with Jennifer Li, general partner at a16z, about why infrastructure is becoming one of the most important areas in AI. They discuss how the shift to AI-native systems is resha...

12 Mai 38min

From Vector Databases to Knowledge Engines: The Next Layer of AI

From Vector Databases to Knowledge Engines: The Next Layer of AI

Peter Levine speaks with Ash Ashutosh, CEO of Pinecone, about the launch of Nexus and the shift from vector databases to knowledge engines. As agents become the primary users of software, they discuss...

5 Mai 46min

Why We Need Continual Learning

Why We Need Continual Learning

Elena Burger speaks with Malika Aubakirova, partner on the AI infrastructure team at a16z, about why today’s AI systems struggle to learn over time. They discuss the limits of in-context learning, the...

28 Apr 18min

The Agent Era: Building Software Beyond Chat with Box CEO Aaron Levie

The Agent Era: Building Software Beyond Chat with Box CEO Aaron Levie

Erik Torenberg, Steve Sinofsky, and Martin Casado speak to Aaron Levie, CEO at Box, about what happens to enterprise software when agents become the primary users. They discuss why coding agents succe...

21 Apr 59min

Populært innen Business og økonomi

stopp-verden
lydartikler-fra-aftenposten
dine-penger-pengeradet
e24-podden
rss-penger-polser-og-politikk
rss-borsmorgen-okonominyhetene
rss-skravla-gar
rss-pa-konto
pengepodden-2
livet-pa-veien-med-jan-erik-larssen
finansredaksjonen
tid-er-penger-en-podcast-med-peter-warren
okonomiamatorene
utbytte
lederpodden
morgenkaffen-med-finansavisen
stormkast-med-valebrokk-stordalen
pengesnakk
rss-markedspuls-2
liberal-halvtime