Kerberos exists to verify identity over an untrusted network. It uses a Key Distribution Center (KDC) that issues time-limited tickets to clients and services. These tickets prove who you are without sending passwords over the wire. Kerberos SRE work is about keeping that chain intact—monitoring, scaling, and securing it under production load.
A Kerberos SRE must control time synchronization to the second. Skew kills authentication. Systems drift; tickets expire early or late; failures spread. Tight NTP configurations and auditing become baseline practice.
Ticket granting is the next fault line. The Ticket Granting Ticket (TGT) lifecycle is short by design. Observability here means tracking issuance rates, failures, and unusual patterns in real time. Unusual spikes can mean outages or attacks.
Scaling Kerberos under high request volume requires careful tuning of KDC performance. That means vertical performance optimization, horizontal replication, and ensuring the key database stays consistent. Any KDC running behind load balancers must respect session affinity where needed, or risk breaking authentication flows.