Using the Smartest AI to Rate Other AI

Using the Smartest AI to Rate Other AI

In this episode, I walk through a Fabric Pattern that assesses how well a given model does on a task relative to humans. This system uses your smartest AI model to evaluate the performance of other AIs—by scoring them across a range of tasks and comparing them to human intelligence levels.

I talk about:

1. Using One AI to Evaluate Another
The core idea is simple: use your most capable model (like Claude 3 Opus or GPT-4) to judge the outputs of another model (like GPT-3.5 or Haiku) against a task and input. This gives you a way to benchmark quality without manual review.

2. A Human-Centric Grading System
Models are scored on a human scale—from “uneducated” and “high school” up to “PhD” and “world-class human.” Stronger models consistently rate higher, while weaker ones rank lower—just as expected.

3. Custom Prompts That Push for Deeper Evaluation
The rating prompt includes instructions to emulate a 16,000+ dimensional scoring system, using expert-level heuristics and attention to nuance. The system also asks the evaluator to describe what would have been required to score higher, making this a meta-feedback loop for improving future performance.

Note: This episode was recorded a few months ago, so the AI models mentioned may not be the latest—but the framework and methodology still work perfectly with current models.

Subscribe to the newsletter at:
https://danielmiessler.com/subscribe

Join the UL community at:
https://danielmiessler.com/upgrade

Follow on X:
https://x.com/danielmiessler

Follow on LinkedIn:
https://www.linkedin.com/in/danielmiessler

See you in the next one!

Become a Member: https://danielmiessler.com/upgrade

See omnystudio.com/listener for privacy information.

Denne episoden er hentet fra en åpen RSS-feed og er ikke publisert av Podme. Den kan derfor inneholde annonser.

Episoder(541)

UL NO. 410: The Immigration/Identity Security Risk, Super Soldier Pentagon Talk, Okta&Me Updates, Teachable Agents

UL NO. 410: The Immigration/Identity Security Risk, Super Soldier Pentagon Talk, Okta&Me Updates, Teachable Agents

Meta bans AI-generated Political Ads, Google's new RETVec Anti-spam tool, a casual convo on Super Soldiers, and more… 📢Sponsored by Kolide🔒 Secure your world with device trust – manage all OS, empow...

6 Des 202325min

UL NO. 408: OpenAI Coup Theory, SEC vs. SolarWinds Analysis, Deepfake D&D Summaries

UL NO. 408: OpenAI Coup Theory, SEC vs. SolarWinds Analysis, Deepfake D&D Summaries

My Theory Of What Happened At OpenAI, A New Ransomware Tactic, Analysis Of What The SEC Case Will Do To Cybersecurity, Live David Attenborough Narration, And More… Read the episode here. 📢Sponsored b...

27 Nov 202335min

UL NO. 407: OpenAI Prompt Injection, Leaky GPTs, AGI by 2028, Huberman Routine AI

UL NO. 407: OpenAI Prompt Injection, Leaky GPTs, AGI by 2028, Huberman Routine AI

Extremist groups using AI for propaganda, NYC restaurant bots, Wegovy and Cannabis studies, my favorite collections of GPTs… 📢Sponsored by Moonlock — cybersecurity wing of MacPaw. Developers of Moonl...

14 Nov 202340min

OpenAI's New Releases Are a Watershed Moment for Human Creativity—and Prompt Injection

OpenAI's New Releases Are a Watershed Moment for Human Creativity—and Prompt Injection

Making it trivial to create and share AI Agents that connect to real-word APIs will have a drastic impact on Information Security.Become a Member: https://danielmiessler.com/upgradeSee omnystudio.com/...

13 Nov 20233min

Why I'm Not Getting the New Humane AI Pin

Why I'm Not Getting the New Humane AI Pin

Why I should be super excited by the Humane AI pin, but I'm not.Become a Member: https://danielmiessler.com/upgradeSee omnystudio.com/listener for privacy information.

13 Nov 20233min

UL NO. 406: OpenAI Launches Custom AIs, Okta's New Breach, EFF's Browser Privacy Checker

UL NO. 406: OpenAI Launches Custom AIs, Okta's New Breach, EFF's Browser Privacy Checker

DOJ and Pentagon emails hacked by Russians, OpenAI's DevDay announcements, when DeepMind thinks we'll see AGI, and more… 📢Sponsored by: Panoptica.app - Simplify container deployment, monitoring, and ...

10 Nov 202328min

UL NO. 404: ServiceNow Widget Flaws, North Korean Infiltrators, and the New Top-performing Prompt String…

UL NO. 404: ServiceNow Widget Flaws, North Korean Infiltrators, and the New Top-performing Prompt String…

In this edition we dive into North Korean IT Infiltration, the top performing prompt technique, Google's traffic optimization, American sick day increases, ServiceNow's Widget problem, the US murder r...

26 Okt 202326min

UL NO. 403: Signal Investigates Rumored Zero-Day Bug, AI Predicts New COVID-19 Strains, Dwindling US-China Scientific Collaboration...

UL NO. 403: Signal Investigates Rumored Zero-Day Bug, AI Predicts New COVID-19 Strains, Dwindling US-China Scientific Collaboration...

In This Edition We Look Into Signal's Investigation Into A Rumored Zero-Day Bug, How Harvard And Oxford Researchers Are Using AI To Predict New COVID-19 Strains, The Dwindling Collaboration Between Am...

16 Okt 202328min

Populært innen Teknologi

lydartikler-fra-aftenposten
romkapsel
teknisk-sett
tomprat-med-gunnar-tjomlid
energi-og-klima
teknologi-og-mennesker
shifter
elektropodden
rss-heis
nasjonal-sikkerhetsmyndighet-nsm
rss-ai-forklart
smart-forklart
fornybaren
pedagogisk-intelligens
rss-vi-leser-dommer-om-personvern
hans-petter-og-co
rss-for-alarmen-gar
rss-alt-vi-kan
rss-a-entelios-poden
rss-plateprat