Who owns this outage? Building intelligent, automated escalation chains

Who owns this outage? Building intelligent, automated escalation chains

Maxwell, a solution architect at xMatters, took a winding road to get to where he is. After a computer engineering education, he held jobs as field support engineer, product manager, SRE, and finally his current role as a solutions architect, where he serves as something of an SRE for SREs, helping them solve incident management problems with the help of xMatters.

When he moved to the SRE role, Maxwell wanted to get back to doing technical work. It was a lateral move within his company, which was migrating an on-prem solution into the cloud. It’s a journey that plenty of companies are making now: breaking an application into microservices, running processes in containers, and using Kubernetes to orchestrate the whole thing. Non-production environments would go down and waste SRE time, making it harder to address problems in the production pipeline.

At the heart of their issues was the incident response process. They had several bottlenecks that prevented them from delivering value to their customers quickly. Incidents would send emails to the relevant engineers, sometimes 20 on a single email, which made it easy for any one engineer to ignore the problem—someone else has got this. They had a bad silo problem, where escalating to the right person across groups became an issue of its own. And of course, most of this was manual. Their MTTR—mean time to resolve—was lagging.

Maxwell moved over to xMatters because they managed to solve these problems through clever automation. Their product automates the scheduling and notification process so that the right person knows about the incident as soon as possible. At the core of this process was a different MTTR—mean time to respond. Once an engineer started working to resolve a problem, it was all down to runbooks and skill. But the lag between the initial incident and that start was the real slowdown.

It’s not just the response from the first SRE on call. It’s the other escalations down the line—to data engineers, for example—that can eat away time. They’ve worked hard to make escalation configuration easy. It not only handles who's responsible for specific services and metrics, but who’s in the escalation chain from there. When the incident hits, the notifications go out through a series of configured channels; maybe it tries a chat program first, then email, then SMS.

The on-call process is often a source of dread, but automating the escalation process can take some of the sting out of it. Check out the episode to learn more.

See Privacy Policy at https://art19.com/privacy and California Privacy Notice at https://art19.com/privacy#do-not-sell-my-info.

Det här avsnittet är hämtat från ett öppet RSS-flöde och publiceras inte av Podme. Det kan innehålla reklam.

Avsnitt(949)

Making the OWASP top ten in the vibe code era

Making the OWASP top ten in the vibe code era

Ryan welcomes back Tanya Janca, now part of the OWASP Top 10 team, to discuss what changed in the latest OWASP Top 10 release, how the list shifted from “outdated components” to a broader software sup...

5 Juni 34min

What it takes to be a player in the international AI game

What it takes to be a player in the international AI game

From the floor of HumanX, Ryan welcomes Songyee Yoon, managing partner at Principal Venture Partners (PVP), to chat about AI development outside the US, from the need to adapt models to local language...

2 Juni 26min

The find out stage of AI is just supply chain and password protection

The find out stage of AI is just supply chain and password protection

In this two-for-one special recorded at HumanX, Ryan is joined by Dataiku’s Florian Douetteau to chat about the governance, orchestration, and data requirements for serious agentic systems and 1Passwo...

29 Maj 30min

Do you have what it takes to run AI in production?

Do you have what it takes to run AI in production?

From the floor of HumanX, Ryan Donovan is joined by Peter Salanki, CTO and co-founder of CoreWeave, to chat about what it really takes to run AI in production; the growing importance of observability,...

26 Maj 27min

Breaking your AI storage bottlenecks

Breaking your AI storage bottlenecks

Recorded at HumanX, Ryan sits down with Garima Kapoor and Anand Babu Periasamy, co-founders and co-CEOs of MinIO, to chat about eliminating the storage bottlenecks that leave GPUs underutilized, their...

22 Maj 29min

Pack your agentic stack in Slack

Pack your agentic stack in Slack

SPONSORED BY SLACK BY SALESFORCERyan welcomes Jaime DeLanghe, chief product officer at Slack, to chat about how they’re preparing to integrate everybody’s agents in their chat application. They chat a...

20 Maj 29min

Your fridge could be a threat to national security

Your fridge could be a threat to national security

On the floor of HumanX, Ryan is joined by Adam Meyers,  Senior VP of Counter Adversary Operations at Crowdstrike, for a deep dive on their latest Global Threat Report that tracks over 281 adversaries ...

19 Maj 29min

Observability and human intuition in an AI world

Observability and human intuition in an AI world

In this two for one episode recorded at HumanX, Ryan is first joined by Christine Yen, CEO of Honeycomb, to discuss how AI compresses the software development lifecycle, making observability about cap...

15 Maj 29min

Populärt inom Business & ekonomi

framgangspodden
varvet
badfluence
rss-borsens-finest
uppgang-och-fall
avanzapodden
rss-dagen-med-di
lastbilspodden
fill-or-kill
rss-inga-dumma-fragor-om-pengar
bathina-en-podcast
borsmorgon
24fragor
rss-kort-lang-analyspodden-fran-di
tabberaset
kapitalet-en-podd-om-ekonomi
market-makers
rss-den-nya-ekonomin
bilar-med-sladd
svd-tech-brief