AI testing, benchmarks and evals

AI testing, benchmarks and evals

Generative AI's popularity has led to a renewed interest in quality assurance — perhaps unsurprising given the inherent unpredictability of the technology. This is why, over the last year, the field has seen a number of techniques and approaches emerge, including evals, benchmarking and guardrails. While these terms all refer to different things, grouped together they all aim to improve the reliability and accuracy of generative AI.

To discuss these techniques and the renewed enthusiasm for testing across the industry, host Lilly Ryan is joined by Shayan Mohanty, Head of AI Research at Thoughtworks, and John Singleton, Program Manager for Thoughtworks' AI Lab. They discuss the differences between evals, benchmarking and testing and explore both what they mean for businesses venturing into generative AI and how they can be implemented effectively.

Learn more about evals, benchmarks and testing in this blog post by Shayan and John (written with Parag Mahajani): https://www.thoughtworks.com/insights/blog/generative-ai/LLM-benchmarks,-evals,-and-tests

Denne episoden er hentet fra en åpen RSS-feed og er ikke publisert av Podme. Den kan derfor inneholde annonser.

Episoder(100)

What does code mean in 2026?

What does code mean in 2026?

What is code? It might sound obvious, but if you scratch the surface it becomes more difficult to articulate precisely what we mean. AI is complicating the picture further and changing the relationshi...

25 Jun 40min

Database branching: Overcoming the bottlenecks of shared database environments

Database branching: Overcoming the bottlenecks of shared database environments

Database branching has, for a long time, been a troublesome piece in the modern developer workflow puzzle: a good idea in principle but in practice a slow and often expensive challenge. Get it right a...

11 Jun 39min

What is spec-driven development?

What is spec-driven development?

Semantic diffusion, combined with the pace of technology change, makes talking about AI-adjacent practices and techniques incredibly diffficult. There are few better examples of this issue than the te...

28 Mai 45min

What is harness engineering?

What is harness engineering?

'Harness engineering' is one of the most significant terms to emerge in software engineering in 2026. Broadly referring to the work done to control unpredictable AI agents and coding assistants, its u...

14 Mai 40min

Anthropic Mythos: Hype, reality and the actual security implications

Anthropic Mythos: Hype, reality and the actual security implications

Anthropic Mythos garnered significant attention when it was launched in mid-April 2026. Yet despite it apparently presenting an unprecedented threat to global software, you don't have to look to close...

30 Apr 48min

Key themes in Technology Radar Vol.34

Key themes in Technology Radar Vol.34

In April 2026 we published a new edition of the Thoughtworks Technology Radar — volume 34. Like many recent volumes, this one was dominated by AI. However, while editions over the last couple of years...

15 Apr 44min

How it feels to be a software engineer when AI is changing our relationship with code

How it feels to be a software engineer when AI is changing our relationship with code

There's been a lot of discussion and debate in recent months about exactly how software engineering will be reshaped by AI. While it remains to be seen what the discipline will look like once things q...

2 Apr 41min

Be brilliant at the basics: Inside Looking Glass 2026

Be brilliant at the basics: Inside Looking Glass 2026

The Thoughtworks 2026 Looking Glass report was published in January. Designed to provide business and technology leaders with the tools to better understand and navigate future trends, this edition pa...

19 Mar 46min

Populært innen Business og økonomi

stopp-verden
lydartikler-fra-aftenposten
dine-penger-pengeradet
rss-penger-polser-og-politikk
e24-podden
rss-borsmorgen-okonominyhetene
rss-skravla-gar
livet-pa-veien-med-jan-erik-larssen
rss-pa-konto
pengesnakk
utbytte
lederpodden
pengepodden-2
tid-er-penger-en-podcast-med-peter-warren
stormkast-med-valebrokk-stordalen
morgenkaffen-med-finansavisen
okonomiamatorene
liberal-halvtime
finansredaksjonen
rss-markedspuls-2