The cause was simple: the database needed a new column. The fix should have taken minutes, but the migration pipeline was slow, and the schema change touched too many parts of the application.
Adding a new column is not just an ALTER TABLE statement. It is an operation that can affect performance, locking, replication, and application compatibility. In high-traffic systems, schema migrations must be designed to avoid downtime. Even a single blocking column add can stall critical requests.
Best practice is to run migrations in a way that is backward-compatible. That means you can deploy the schema change without breaking existing code, then roll forward to the new logic once the change is complete. For example, first add the new column with a default value and NULL allowed, backfill data asynchronously, and only then enforce NOT NULL constraints or remove old columns.
In Postgres, adding a column without a default is fast, as it updates only metadata. Adding a column with a default on a large table can lock the table for a long time unless you use Postgres 11+ with optimized defaults. MySQL behaves differently: depending on engine and version, even a simple new column addition can copy and rebuild the table. Plan your DDL accordingly.