The migration broke at 2:14 a.m. because a single table needed a new column.
Adding a new column should be simple. In SQL, it can be. But in production systems with terabytes of data and strict uptime requirements, the smallest change can trigger cascading failures. The key is to treat schema changes as part of the release cycle, not as an afterthought.
A new column can be added with ALTER TABLE, but execution speed, table locking, and replication lag make timing critical. On MySQL, ALTER TABLE ... ADD COLUMN may lock the table for the full duration unless ALGORITHM=INPLACE or online DDL is supported by your storage engine. PostgreSQL handles many column additions instantly, but defaults and constraints can still trigger slow table rewrites.
Plan for backwards-compatible changes first. Deploy the new column without defaults or non-null constraints. Then backfill the data in small batches, verifying replication health between steps. When all rows are populated, add constraints in a separate migration. This staged approach reduces downtime and rollback complexity.