Every SRE organization has a break-glass procedure. Very few have measured how much it actually costs them.
The metric nobody tracks
Organizations measure time-to-detect and time-to-resolve religiously but completely ignore what I'd call "access latency": the gap between an engineer knowing what to fix and actually being able to touch the system.
Picture a bad migration that just hit production. The on-call engineer knows exactly what to do: query the affected table, assess the blast radius, run a corrective script. The technical work might take ten minutes. But first they need to find the right credentials, connect through the VPN, figure out which bastion to use, and hope their access hasn't been rotated since the last time they needed it. That's the mechanical delay.
Then there's the human delay. The engineer pings Slack for help getting in. The person who manages access is asleep, or in a different timezone, or on PTO. Someone suggests a shared service account. Someone else isn't sure if that still works. Fifteen minutes of troubleshooting access later, nobody has looked at the actual problem yet.
I've seen the full access latency range from five minutes on a good day to forty-five on a bad one. During a P1, every one of those minutes is customer-facing downtime. Three months later, your compliance team asks: who accessed that database, what queries did they run, was any customer data exposed? That fast 2 a.m. fix just became a two-week audit scramble.
The compliance debt underneath
Access latency is the visible cost. The invisible one is the compliance debt from every workaround engineers use to get in fast.
Every break-glass procedure I've seen has two versions: the official one with a ticket, an approval, and a scoped credential - and the real one with a shared password in 1Password and a service account that "everyone knows about." When the site is down at 2 a.m., nobody files a ticket.
This puts directors in an impossible position. You're accountable for both reliability (fix it fast) and security posture (fix it safely). Your tooling forces engineers to pick one. The fallout is predictable: your SOC 2 auditor asks for evidence of time-bound production access, and you're reconstructing what happened from Slack messages and VPN logs. I've talked to directors who burn one to two full engineering weeks per quarter on this kind of retroactive audit prep.
Why tightening controls makes it worse
The instinct is to add more gates. MFA on every request, extra approval steps, shorter credential rotation, a PAM tool in front of everything. This addresses security but makes access latency worse. Your 2 a.m. incident now requires waking two approvers and authenticating through three systems. The engineer finds a workaround, and you're back to shadow access, only now you also have an expensive PAM license.