The server went dark. Logs were clean. The only clue: a single line—grpc: failed with error code=UNAVAILABLE, desc=all SubConns are in TransientFailure. It looked simple. It wasn’t.
When gRPC calls fail, and especially when the endpoint starts with the grpcs:// prefix, trouble hides in the details. The grpcs scheme tells gRPC to use TLS over HTTP/2. One wrong certificate, DNS mismatch, or transport setting, and you’re staring at errors that look harmless but cut deep into production availability.
The most common cause of grpcs-related errors is misalignment between clients and servers on connection security. If the server certificate’s hostname doesn’t match the value your client expects, the TLS handshake fails before the call even reaches your code. This failure cascades as connection retries, which appear in logs as transient failures, then eventually mark the channel as unavailable.
Another root cause comes from not loading CA certificates properly. Many teams assume the default system pool contains everything needed for public TLS. That assumption fails when your production stack uses private CAs, self-signed certs, or corporate PKI. In those cases, grpcs needs explicit configuration with the correct credentials.TransportCredentials.