
Benchmark Bank Heist
What if an AI decided the smartest way to pass its test was to find the answer key? That's exactly what Anthropic's Claude Opus did when faced with a benchmark evaluation — reasoning that it was being...
6 Huhti 12min

Benchmarking AI Models
How do you know if a new AI model is actually better than the last one? It turns out answering that question is a lot messier than it sounds. This week we dig into the world of LLM benchmarks — the st...
30 Maalis 29min

The Hot Mess of AI (Mis-)Alignment
The paperclip maximizer — the classic AI doom scenario where a hyper-competent machine single-mindedly converts the universe into office supplies — might not be the AI risk we should actually lose sle...
23 Maalis 22min

The Bitter Lesson
Every AI builder knows the anxiety: you spend months engineering prompts, tuning pipelines, and chaining calls together — then a new model drops and half your work evaporates overnight. It turns out r...
15 Maalis 19min

From Atari to ChatGPT: How AI Learned to Follow Instructions
From Atari to ChatGPT: How AI Learned to Follow Instructions by Katie Malone
9 Maalis 25min

It's RAG time: Retrieval-Augmented Generation
Today we are going to talk about the feature with the worst acronym in generative AI: RAG, or Retrieval Augmented Generation. If you've ever used something like "Chat with My Docs," if you have an int...
2 Maalis 17min

Chasing Away Repetitive LLM Responses with Verbalized Sampling
One of the things that LLMs can be really helpful with is brainstorming or generating new creative content. They are called Generative AI, after all—not just for summarization and question-and-answer ...
23 Helmi 19min

We're Back
It's been (*checks watch*) about five and a half years since we last talked. Fortunately nothing much has happened in the AI/data science world in that time. So let's just pick up where we left off, s...
16 Helmi 2min
