How I AI13 Loka 2025

Evals, error analysis, and better prompts: A systematic approach to improving your AI products | Hamel Husain (ML engineer)

Hamel Husain, an AI consultant and educator, shares his systematic approach to improving AI product quality through error analysis, evaluation frameworks, and prompt engineering. In this episode, he demonstrates how product teams can move beyond “vibe checking” their AI systems to implement data-driven quality improvement processes that identify and fix the most common errors. Using real examples from client work with Nurture Boss (an AI assistant for property managers), Hamel walks through practical techniques that product managers can implement immediately to dramatically improve their AI products.

What you’ll learn:

1. A step-by-step error analysis framework that helps identify and categorize the most common AI failures in your product

2. How to create custom annotation systems that make reviewing AI conversations faster and more insightful

3. Why binary evaluations (pass/fail) are more useful than arbitrary quality scores for measuring AI performance

4. Techniques for validating your LLM judges to ensure they align with human quality expectations

5. A practical approach to prioritizing fixes based on frequency counting rather than intuition

6. Why looking at real user conversations (not just ideal test cases) is critical for understanding AI product failures

7. How to build a comprehensive quality system that spans from manual review to automated evaluation

—

Brought to you by:

GoFundMe Giving Funds—One account. Zero hassle: https://gofundme.com/howiai

Persona—Trusted identity verification for any use case: https://withpersona.com/lp/howiai

—

Where to find Hamel Husain:

Website: https://hamel.dev/

Twitter: https://twitter.com/HamelHusain

Course: https://maven.com/parlance-labs/evals

GitHub: https://github.com/hamelsmu

—

Where to find Claire Vo:

ChatPRD: https://www.chatprd.ai/

Website: https://clairevo.com/

LinkedIn: https://www.linkedin.com/in/clairevo/

X: https://x.com/clairevo

—

In this episode, we cover:

(00:00) Introduction to Hamel Husain

(03:05) The fundamentals: why data analysis is critical for AI products

(06:58) Understanding traces and examining real user interactions

(13:35) Error analysis: a systematic approach to finding AI failures

(17:40) Creating custom annotation systems for faster review

(22:23) The impact of this process

(25:15) Different types of evaluations

(29:30) LLM-as-a-Judge

(33:58) Improving prompts and system instructions

(38:15) Analyzing agent workflows

(40:38) Hamel’s personal AI tools and workflows

(48:02) Lighting round and final thoughts

—

Tools referenced:

• Claude: https://claude.ai/

• Braintrust: https://www.braintrust.dev/docs/start

• Phoenix: https://phoenix.arize.com/

• AI Studio: https://aistudio.google.com/

• ChatGPT: https://chat.openai.com/

• Gemini: https://gemini.google.com/

—

Other references:

• Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences: https://dl.acm.org/doi/10.1145/3654777.3676450

• Nurture Boss: https://nurtureboss.io

• Rechat: https://rechat.com/

• Your AI Product Needs Evals: https://hamel.dev/blog/posts/evals/

• A Field Guide to Rapidly Improving AI Products: https://hamel.dev/blog/posts/field-guide/

• Creating a LLM-as-a-Judge That Drives Business Results: https://hamel.dev/blog/posts/llm-judge/

• Lenny’s List on Maven: https://maven.com/lenny

—

Production and marketing by https://penname.co/. For inquiries about sponsoring the podcast, email jordan@penname.co.

Kokeile Premiumia

Nauti 14 päivää ilmaiseksi

Tilaa Premium

Jaksot(65)

I built a custom Slack inbox. It was easier than you’d think. | Yash Tekriwal (Clay)

Yash Tekriwal is the head of education at Clay. A self-described hyper-optimizer, Yash has built multiple custom productivity applications using Perplexity Computer and OpenClaw to manage his overwhel...

8 Huhti 44min

I gave Claude Code our entire codebase. Our customers noticed. | Al Chen (Galileo)

Al Chen is a field engineer at Galileo, an observability platform for AI applications, where he works on the front lines with enterprise customers asking highly technical questions. Despite never havi...

6 Huhti 45min

How to turn Claude Code into your personal life operating system | Hilary Gridley

Hilary Gridley is an entrepreneur, former product leader, and new mom who previously appeared on the podcast discussing AI for managers. She returns to share how she's transformed her approach to pers...

30 Maalis 51min

How Stripe built “minions”—AI coding agents that ship 1,300 PRs weekly from Slack reactions | Steve Kaliski (Stripe engineer)

Steve Kaliski is a software engineer at Stripe who has spent the past six and a half years building developer tools and payment infrastructure. He’s part of the team that created “minions”—Stripe’s in...

25 Maalis 41min

How Microsoft's AI VP automates everything with Warp | Marco Casalaina

Marco Casalaina, VP of Core AI Products and AI Futurist at Microsoft, demonstrates how he uses AI tools to automate administrative tasks that typically consume valuable time. Rather than using Warp as...

23 Maalis 34min

From journalist to iOS developer: How LinkedIn’s editor builds with Claude Code | Daniel Roth

Daniel Roth, editor in chief at LinkedIn, went from business writer to iOS app developer, without ever learning how to code. Using Claude Code, Daniel built and shipped multiple production-ready iOS a...

16 Maalis 38min

From Figma to Claude Code and back | Gui Seiz & Alex Kern (Figma)

Most teams are still passing static design files back and forth, and most Figma files are already out of date by the time they reach engineering. Gui Seiz (designer) and Alex Kern (engineer) from Figm...

11 Maalis 40min

Mastering Midjourney: How to create consistent, beautiful brand imagery without complex prompts | Jamey Gannon

Jamey Gannon is an AI creative director who specializes in creating consistent, beautiful brand imagery using AI tools. In this episode, Jamey demonstrates her streamlined workflow for generating cohe...

9 Maalis 49min

Kaikki yhdessä sovelluksessa

Kuuntele kaikki suosikkipodcastisi ja -äänikirjasi yhdessä paikassa.

Sinulle valikoitua sisältöä

Podme-sovelluksessa kokoat suosikkisi helposti omaan kirjastoosi. Saat meiltä myös kuuntelusuosituksia!

Jatka kuuntelua koska tahansa

Voit jatkaa siitä mihin jäit, myös offline-tilassa.

Premium

9,99 €/kk

Kaikki premium-podcastit
Ei mainoksia
Ei sitoutumista, peruuta koska tahansa

Aloita 14 päivän kokeilu

Premium

13,99 €/kk

Kaikki premium-podcastit
Ei mainoksia
Ei sitoutumista, peruuta koska tahansa
Yksi lisäkäyttäjä

Kokeile 14 päivää maksutta

Tarinat ja äänet, joita rakastat kuunnella

Kuuntele kaikki suosikkipodcastisi ja -äänikirjasi

Lue lisää