Traffic was spiking, the external load balancer wasn’t routing, and downstream services were timing out. Every second felt like a hammer on the system. We pulled logs, checked health checks, and hit API endpoints manually. Nothing moved.
External load balancer incidents are brutal because they sit at the front door of your system. When that door jams, no one gets in. It’s not just downtime. It’s a full lockout. The key to surviving it is to have a tested, fast, and repeatable incident response.
The first step is detection. Automated alerts on latency, failed health checks, and 5xx rates are non‑negotiable. Layer them. Watch from inside the network and from the public internet. If one edge is down but another is up, you’ll see it before your customers do.
Next comes verification. Too many teams waste time chasing phantom issues. Always confirm it’s the external load balancer and not an upstream API, origin server, or DNS resolution failure. Hit each point manually. Use cURL, dig, or browser dev tools. Know the path.
Containment is your race against the clock. Shift traffic to a working region, change DNS records, or fail over to a backup load balancer. Cache aggressively if the app allows it. Every request served from cache is one less hit on a faulty route.