Get Smarter with Every Incident
1 min read

Get Smarter with Every Incident

Get Smarter with Every Incident

I had the opportunity to discuss building sustainable, proactive incident-response practices on a Lead Dev panel in January 2025. With our moderator Najla, Arjit, Dave and I dug into several questions related to improving incident response, reducing impact, and burnout, and setting on-call teams up for success.

Some of the questions we addressed:

  • How are you using AI to catch early warning signs and tune alerts—and where does it still create noise or blind spots?
  • In the middle of an incident, what slows triage down the most, and what would it take to get the right context in one place faster?
  • After an outage, why do the same problems keep coming back—and what’s actually worked to turn post-incident learnings into real change?
  • How do you shift from nonstop firefighting to actually fixing—what do you change in planning, priorities, or resourcing to make that real?
  • When does on-call fatigue stop being an individual problem and become a systemic one—and how do you pinpoint what’s driving it?
  • What should engineering leaders do differently to make on-call sustainable—especially protecting remediation time and backing teams when they pause roadmap work?

The recording is available on LeadDev

Get smarter with every incident
AI risks we will move faster and break things more. Learn about the guardrails that enable you to safely release in the Age of AI