Everything looked fine on dashboards. Error rates normal. Latency steady. But inside the system, cracks were spreading. And inside the team, trust was thinning.
Auditing an SRE team is not about checking a box. It’s about finding what’s under the surface before it breaks. You need to know if your incident response runbooks are actually used, if alerts are serving the operators instead of enslaving them, and if toil is creeping in like rust, slowing every fix and blinding every decision.
A proper SRE audit starts with clarity. Almost nothing hides forever when you track the right metrics, talk to the right people, and follow every lead. Start with on-call load and resolution times. How many times a week are people getting paged? Are they solving root causes or just clearing noise? Then move into service-level objectives (SLOs). Are they defined? Measured? Respected? Numbers that live on a slide are meaningless if no one owns them or believes in them.
Culture is part of the audit. You measure burnout the same way you measure latency: with data and patterns. Look for signs like skipped postmortems, unreviewed pull requests, and operational shortcuts. These are early warnings that trust in the process is breaking down.