Generative Benchmarking with Kelly Hong - #728
In this episode, Kelly Hong, a researcher at Chroma, joins us to discuss "Generative Benchmarking," a novel approach to evaluating retrieval systems, like RAG applications, using synthetic data. Kelly explains how traditional benchmarks like MTEB fail to represent real-world query patterns and how embedding models that perform well on public benchmarks often underperform in production. The conversation explores the two-step process of Generative Benchmarking: filtering documents to focus on relevant content and generating queries that mimic actual user behavior. Kelly shares insights from applying this approach to Weights & Biases' technical support bot, revealing how domain-specific evaluation provides more accurate assessments of embedding model performance. We also discuss the importance of aligning LLM judges with human preferences, the impact of chunking strategies on retrieval effectiveness, and how production queries differ from benchmark queries in ambiguity and style. Throughout the episode, Kelly emphasizes the need for systematic evaluation approaches that go beyond "vibe checks" to help developers build more effective RAG applications. The complete show notes for this episode can be found at https://twimlai.com/go/728.

Denne episoden er hentet fra en åpen RSS-feed og er ikke publisert av Podme. Den kan derfor inneholde annonser.

Episoder(788)

Why AI Agents Break the GenAI Security Model with Devvret Rishi - #770

Why AI Agents Break the GenAI Security Model with Devvret Rishi - #770

In this episode, Sam talks with Dev Rishi, GM of AI at Rubrik, about what happens when agents move beyond answering questions and start taking action across tools, systems, and business processes. We...

16 Jun 56min

Is RAG Dead? Lessons from Building AI for Tax Law with Alex Bowcut - #769

Is RAG Dead? Lessons from Building AI for Tax Law with Alex Bowcut - #769

As context windows grow into the millions of tokens, many AI practitioners are questioning whether retrieval-augmented generation (RAG) is still necessary. If modern models can ingest entire libraries...

9 Jun 51min

Relational Foundation Models for Enterprise Data with Jure Leskovec - #768

Relational Foundation Models for Enterprise Data with Jure Leskovec - #768

In this episode, Jure Leskovec, co-founder and chief scientist at Kumo and professor of computer science at Stanford, joins us to explore two fronts of his work: AI for science and relational deep lea...

21 Mai 1h 6min

How to Find the Agent Failures Your Evals Miss with Scott Clark - #767

How to Find the Agent Failures Your Evals Miss with Scott Clark - #767

In this episode, Scott Clark, co-founder and CEO of Distributional, joins us to explore how teams can reliably operate and improve complex LLM systems and agents in production. Scott introduces a Masl...

7 Mai 53min

How to Engineer AI Inference Systems with Philip Kiely - #766

How to Engineer AI Inference Systems with Philip Kiely - #766

In this episode, Philip Kiely, head of AI education at Baseten, joins us to unpack the fast-evolving discipline of inference engineering. We explore why inference has become the stickiest and most cri...

30 Apr 54min

How Capital One Delivers Multi-Agent Systems with Rashmi Shetty - #765

How Capital One Delivers Multi-Agent Systems with Rashmi Shetty - #765

In this episode, Rashmi Shetty, senior director of enterprise generative AI platform at Capital One, joins us to explore how the company is designing, deploying, and scaling multi-agent systems in a h...

16 Apr 54min

The Race to Production-Grade Diffusion LLMs with Stefano Ermon - #764

The Race to Production-Grade Diffusion LLMs with Stefano Ermon - #764

Today, we're joined by Stefano Ermon, associate professor at Stanford University and CEO of Inception Labs to discuss diffusion language models. We dig into how diffusion approaches—traditionally used...

26 Mar 1h 3min

Agent Swarms and Knowledge Graphs for Autonomous Software Development with Siddhant Pardeshi - #763

Agent Swarms and Knowledge Graphs for Autonomous Software Development with Siddhant Pardeshi - #763

In this episode, Sid Pardeshi, co-founder and CTO of Blitzy, joins us to discuss building autonomous development systems able to deliver production-ready software at enterprise scale. Sid contrasts AI...

10 Mar 1h 16min

Populært innen Politikk og nyheter

giver-og-gjengen-vg
aftenpodden
aftenpodden-usa
fotballpodden-2
forklart
stopp-verden
popradet
nokon-ma-ga
lydartikler-fra-aftenposten
rss-espen-lee-usensurert
det-store-bildet
rss-gukild-johaug
dine-penger-pengeradet
rss-ness
aftenbla-bla
hanna-de-heldige
rss-utenrikskomiteen-med-bogen-og-grasvik
frokostshowet-pa-p5
e24-podden
rss-penger-polser-og-politikk