The root cause was simple: a new column, added without a plan.
Adding a new column to a database table is one of those changes that looks small but can cripple uptime if executed carelessly. The steps, timing, and tooling you choose matter more than the code itself. A careless ALTER TABLE on a high-traffic system can lock rows, stall queries, and force a rollback.
A clean deployment begins with understanding the impact on schema and data. Check the table size. Measure write and read frequency. For large tables, consider creating the new column with a null default in a non-blocking migration. Backfill in batches to avoid spikes in load.
In systems with strict SLAs, zero-downtime techniques are essential. Use online schema change tools such as pt-online-schema-change or gh-ost to introduce the column without locking. Align your migrations with a release that also updates the application code to handle the new field gracefully.