The Stack Overflow Podcast22 Nov 2021

Who owns this outage? Building intelligent, automated escalation chains

Maxwell, a solution architect at xMatters, took a winding road to get to where he is. After a computer engineering education, he held jobs as field support engineer, product manager, SRE, and finally his current role as a solutions architect, where he serves as something of an SRE for SREs, helping them solve incident management problems with the help of xMatters.

When he moved to the SRE role, Maxwell wanted to get back to doing technical work. It was a lateral move within his company, which was migrating an on-prem solution into the cloud. It’s a journey that plenty of companies are making now: breaking an application into microservices, running processes in containers, and using Kubernetes to orchestrate the whole thing. Non-production environments would go down and waste SRE time, making it harder to address problems in the production pipeline.

At the heart of their issues was the incident response process. They had several bottlenecks that prevented them from delivering value to their customers quickly. Incidents would send emails to the relevant engineers, sometimes 20 on a single email, which made it easy for any one engineer to ignore the problem—someone else has got this. They had a bad silo problem, where escalating to the right person across groups became an issue of its own. And of course, most of this was manual. Their MTTR—mean time to resolve—was lagging.

Maxwell moved over to xMatters because they managed to solve these problems through clever automation. Their product automates the scheduling and notification process so that the right person knows about the incident as soon as possible. At the core of this process was a different MTTR—mean time to respond. Once an engineer started working to resolve a problem, it was all down to runbooks and skill. But the lag between the initial incident and that start was the real slowdown.

It’s not just the response from the first SRE on call. It’s the other escalations down the line—to data engineers, for example—that can eat away time. They’ve worked hard to make escalation configuration easy. It not only handles who's responsible for specific services and metrics, but who’s in the escalation chain from there. When the incident hits, the notifications go out through a series of configured channels; maybe it tries a chat program first, then email, then SMS.

The on-call process is often a source of dread, but automating the escalation process can take some of the sting out of it. Check out the episode to learn more.

See Privacy Policy at https://art19.com/privacy and California Privacy Notice at https://art19.com/privacy#do-not-sell-my-info.

Det här avsnittet är hämtat från ett öppet RSS-flöde och publiceras inte av Podme. Det kan innehålla reklam.

Avsnitt(966)

Partnerships can keep open source sustainable

Ryan welcomes VoidZero’s Evan You and Cloudflare’s Dane Knecht back to the show to discuss Cloudflare’s recent acquisition of VoidZero and what it means for JavaScript development, how partnerships li...

24 Juli 0s

The future of development is full-stack

Live from Snowflake Summit, Ryan talks with Snowflake’s Head of Developer Experience Umesh Unnikrishnan about the industry-wide shift from “vibe coding” for quick prototypes to agentic engineering for...

21 Juli 0s

Developers who move fast still need to do it together

At MS Build, Ryan is joined by Cassidy Williams, Senior Director of Developer Advocacy at GitHub and former Stack Overflow Podcast host, to discuss how agentic coding is shifting dev work towards high...

17 Juli 28min

Your AI is only as responsible as you are

Recorded at Microsoft Build, Ryan welcomes Sarah Bird, Microsoft’s Chief Product Officer for Responsible AI, about how we can build and use AI responsibly with the NIST approach, why most irresponsibl...

14 Juli 28min

Building more than just an agent harness

Live from Microsoft Build, Ryan is joined by Jay Parikh, Microsoft’s VP of AI Core, for a conversation on what enterprises need to build, deploy, and run AI agents at scale with demonstrable ROI; how ...

10 Juli 31min

What's left for infrastructure-as-code after AI moves in?

SPONSORED BY IBMRyan is joined by Rosemary Wang, Developer Advocate at IBM, to explore what infrastructure as code looks like once AI starts writing and deploying it. They discuss why guardrails still...

8 Juli 30min

Agent orchestration is so two-years ago

Ryan welcomes Saahil Jain, CTO of You.com, to discuss why building agents with a 2024 mindset is a mistake as modern models improve at long-horizon tasks, why heavy orchestration layers can hurt model...

7 Juli 31min

The good, the bad, and the AI apps

Ryan welcomes Benny Chen, co-founder of Fireworks AI, to the show to explore what actually makes an AI application good or not, how to balance qualitative signals with quantitative metrics when evalua...

3 Juli 25min