You sit up, eyes burning, laptop already waking. The Slack channel is lit. The build was fine. The deploy was clean. But now your service is dead in production, and the error logs are screaming the same line over and over:
rpc error: code = Unavailable
When a gRPC error hits during an on-call shift, it’s not the time to dig through outdated docs or half-remembered blog posts. You need answers now. But the truth is that most gRPC errors look the same on the surface and hide a minefield of causes underneath.
How a Simple Unavailable Becomes a Night Killer
gRPC errors can stem from network failures, DNS misfires, connection pooling issues, deadline mismatches, or load balancer quirks. The Unavailable code is especially brutal because it’s a catch-all for “something went wrong in the transport layer.” It’s non-specific. It’s a moving target.
During on-call, the challenge isn’t just fixing the current outage—it’s knowing where to start. Was it a bad deploy creating a memory leak that killed connections? Is your backend actually rejecting calls? Is TLS failing silently? Was there a sudden spike in client retries hammering the service into collapse? You can waste hours chasing the wrong lead.
Why Most Logging Won’t Save You
If you rely on raw server logs, you’ll see the error, but not the root cause pattern. By the time you gather enough context, the incident has escalated, the SLA is dust, and the postmortem will be a confession: “We didn’t know fast enough.”