The One With Damion Yates and Building AI systems

The One With Damion Yates and Building AI systems

How do you introduce Site Reliability Engineering to an AI research lab, bringing concepts of scale to engineers who are at the leading edge of AI systems?

In the latest episode of The Prodcast, hosts Steve McGhee and Florian Rathgeber chat with Damion Yates, who helped establish the reliability engineering culture at Google DeepMind. Damion shares his journey of bringing scalable infrastructure to DeepMind, supporting massive machine learning experiments.

Discover the unique challenges of supporting AI research, such as managing highly expensive "lockstep" training models where a single machine failure halts the entire process. Damion also explains why he believes "luck is our enemy" in systems engineering, and why protecting a research scientist's time is the ultimate metric for success.

Avsnitt(51)

The One With Data Centers and Peter Pellerzi

The One With Data Centers and Peter Pellerzi

This episode features guest, Peter Pellerzi (Distinguished Engineer, Google). Peter and the hosts, Matt Siegler and Steve McGhee, focus on the physical infrastructure side of SRE, discussing topics su...

28 Maj 202536min

The One With Security and Jessica Theodat

The One With Security and Jessica Theodat

Jessica Theodat (Senior SRE & Security Tech Lead, Google) joins hosts Jordan Greenberg and Steve McGhee to discuss the intersection of security and site reliability engineering at Google. Jessica touc...

21 Maj 202519min

We're back with Season 4!

We're back with Season 4!

In this "bumpisode", hosts and producers of Prodcast (including our new co-host, Matt Siegler!) reflect on the previous season and introduce the new season's focus on upcoming trends in Site Reliabili...

16 Apr 202515min

Special Episode: You Missed a Page from Telebot

Special Episode: You Missed a Page from Telebot

This episode features Javi Beltran, a Google engineering lead who created the "Telebot" theme song. With our beloved hosts, Steve McGhee and Jordan Greenberg, Beltran discusses the origins of the song...

29 Jan 202516min

Imperative vs. Declarative Change Workflows with Dominic Hutton & Niccolo' Cascarano

Imperative vs. Declarative Change Workflows with Dominic Hutton & Niccolo' Cascarano

In this episode of the Prodcast, guests Dominic Hutton (Staff SRE, HashiCorp) and Niccolo' Cascarano (Senior Staff SRE at Google) join hosts Steve McGhee and Jordan Greenberg to dive into configuratio...

11 Dec 202436min

Human Factors in Complex Systems with Casey Rosenthal and John Allspaw

Human Factors in Complex Systems with Casey Rosenthal and John Allspaw

This episode features Casey Rosenthal (Founder, Cirrusly.ai) and John Allspaw (Founder and Principal, Adaptive Capacity Labs), joining our hosts Steve McGhee and Jordan Greenberg. Together they discus...

4 Dec 202441min

Embracing Complexity with Christina Schulman & Dr. Laura Maguire

Embracing Complexity with Christina Schulman & Dr. Laura Maguire

In this episode of the Prodcast, we are joined by guests Christina Schulman (Staff SRE, Google) and Dr. Laura Maguire (Principal Engineer, Trace Cognitive Engineering). They emphasize the human elemen...

20 Nov 202433min

Maglev: load balancing at Google with Cody Smith and Trisha Weir

Maglev: load balancing at Google with Cody Smith and Trisha Weir

In this episode, Cody Smith (CTO and Co-founder, Camus Energy) & Trisha Weir (SRE Department Lead, Google) join hosts Steve McGhee and Jordan Greenberg, to discuss their experience developing Maglev, ...

13 Nov 202432min

Populärt inom Teknik

uppgang-och-fall
elbilsveckan
market-makers
skogsforum-podcast
rss-elektrikerpodden
rss-powerboat-sverige-podcast
bilar-med-sladd
rss-veckans-ai
developers-mer-an-bara-kod
rss-uppgang-och-fall
rss-laddstationen-med-elbilen-i-sverige
rss-technokratin
gubbar-som-tjotar-om-bilar
rss-fabriken-2
bli-saker-podden
hej-bruksbil
har-vi-akt-till-mars-an
natets-morka-sida
teknikveckan
rss-snacka-om-ai