The migration broke in the dark hours before deployment. Logs showed nothing unusual. Tests still passed. The failure came from a missing new column in the database table that everyone thought was already in production.
Adding a new column sounds simple. But in systems with live traffic, it can trigger long locks, downtime, or silent data corruption. The key is to design schema changes that are atomic, verifiable, and reversible.
Start by creating the new column in a way that avoids blocking writes. Most relational databases allow adding nullable columns or columns with default values without a full table rewrite. Use this to stage the change with zero downtime.
Next, backfill the column in controlled batches. Run the backfill as an idempotent task so you can retry without side effects. Monitor query performance and replication lag during this step to avoid cascading slowdowns.