Ep.6: How Real-World Messiness Impacts AI Agent Success

Ep.6: How Real-World Messiness Impacts AI Agent Success

In this episode of the Cisco AI Insights Podcast, hosts Rafael Herrera and Sónia Marques are joined by Cisco ML Engineer Paul Mutawe to explore the fascinating paper, "Measuring AI Ability to Complete Long Software Tasks," which introduces a novel time horizon metric to evaluate how autonomously AI agents can execute complex, multi-hour engineering projects. The discussion looks at the rapid evolution of these agents, highlighting the key finding that the fifty percent success time horizon is doubling every two hundred and seven days, while detailing how unbiased benchmarking environments like the Modular Public harness are used to evaluate frontier models alongside real-world complexities like the sixteen-item messiness factor, which significantly reduces agent success rates, and the critical need for human-in-the-loop oversight to combat context rot. A special thank you to the research team at Model Evaluation and Threat Research, who developed this paper. If you are interested in reading the paper yourself, please visit the link: https://arxiv.org/pdf/2503.14499

Tämä jakso on lisätty Podme-palveluun avoimen RSS-syötteen kautta eikä se ole Podmen omaa tuotantoa. Siksi jakso saattaa sisältää mainontaa.

Jaksot(500)

Shift Happens Episode 38: Rolling Out AI to 90,000 People w/Greg Sylvester

Shift Happens Episode 38: Rolling Out AI to 90,000 People w/Greg Sylvester

Most companies are still figuring out how to use AI. Greg Sylvester is figuring out how to deliver it to nearly 90,000 employees. As Cisco's VP of Enterprise AI Platform & Infrastructure, Greg is lead...

30 Kesä 43min

404 Script Not Found: Cisco Live 2026

404 Script Not Found: Cisco Live 2026

Strap in, folks—we’re recapping all the chaos, genius, and buffet lines from Cisco Live (just a few weeks late)! Kat and Ian are cutting through the noise to tell you what actually mattered (and wha...

25 Kesä 12min

S7 E7: Talking insights, strategy, and the human side of AI with Pascal Bornet and Sam Charrington

S7 E7: Talking insights, strategy, and the human side of AI with Pascal Bornet and Sam Charrington

Join AB for a candid conversation with AI luminaries Pascal Bornet and Sam Charrington. From the sci-fi films that sparked their early interest in tech to the strategic frameworks for mastering today'...

23 Kesä 31min

What Deep Space Operations Can Teach Us About Agentic AI

What Deep Space Operations Can Teach Us About Agentic AI

The Internet Report podcast examines lessons from deep space operations that can help developers build more resilient agentic AI systems that are capable of managing intermittent connectivity and stal...

22 Kesä 22min

404 Script Not Found: The Evolution of WiFi

404 Script Not Found: The Evolution of WiFi

This week starts with Ian having the kind of week that feels almost made up: his AC is out, the gym AC is out, the other gym AC is out, and somehow he is just supposed to act like everything is fine! ...

18 Kesä 15min

S7 E6: Talking strategy, M&A, and accelerating Cisco innovation with Ammar Maraqa

S7 E6: Talking strategy, M&A, and accelerating Cisco innovation with Ammar Maraqa

AB sits down with Ammar Maraqa, Cisco’s Chief Strategy Officer, to discuss topics such as turning an acquisition into a competitive advantage, fostering an agile and customer-centric culture, aligning...

16 Kesä 28min

404 Script Not Found: Talking Tech in Sports with Bryan Bedford

404 Script Not Found: Talking Tech in Sports with Bryan Bedford

Ever wonder how 80,000 people can upload to TikTok at the exact same time during a halftime show? Or how NFL teams are using AI to draft the next superstar? This week, Kat flies solo (while Ian is off...

11 Kesä 26min