Skip to main content

Common DB schema change mistakes

· 7 min read
Common DB schema change mistakes

In his article "Lesser Known PostgreSQL Features", @be_haki describes 18 Postgres features many people don't know. I enjoyed that article, and it inspired me to write about "anti-features" – things that everyone should avoid when working in probably the riskiest field of application development – so-called "schema migrations".

Moreover, a strong desire to help people avoid such mistakes led me to invent of the Database Lab Engine – a technology for thin cloning of databases, essential for development and testing. With it, you can clone a 10 TiB database in 10 seconds, test schema changes, and understand the risks before deployment. Most cases discussed in this article can be easily detected by such testing, and it can be done automatically in CI/CD pipelines.

As usual, I'll be focusing on OLTP use cases (mobile and web apps), for which query execution that exceeds 1 second is normally considered too slow. Some cases discussed here are hard to notice in small databases with low activity. But I'm pretty confident that you'll encounter most of them when your database grows to ~10 TiB in size and its load reaches ~10⁴–10⁵ transactions per second (of course, some cases will be seen – unless deliberately prevented – much, much earlier).

I advise you to read GitLab's great documentation – their Migration Style Guide is full of wisdom written by those who have experience in deploying numerous Postgres schema changes in a fully automated fashion to a huge number of instances, including GitLab.com itself.

I also encourage everyone to watch PGCon-2022 – one of the key Postgres conferences; this time, it's happening online again. On Thursday, May 26, I'll give two talks, and one of them is called "Common DB schema change mistakes", you can find the slide deck here. If you missed it, no worries – @DLangille, who has organized the conference since 2006 (thank you, Dan!), promises to publish talk videos in a few weeks.

Terminology

The term "DB migration" may be confusing, it's often used to describe the task of switching from one database system to another, moving database and minimizing possible negative effects (such as long downtime).

In this article, I'm going to talk about the second meaning of the term – DB schema changes having the following properties:

  1. Versioned and automated – each change has a unique identifier (version), and the process of applying changes is automated
  2. Immutable – once applied to production, a change cannot be modified (you can only add new changes on top)
  3. Incremental – the database schema evolves step by step

My personal score: 1000+ migrations designed/reviewed/executed. Here are the most common mistakes I've seen (or made myself).

Mistake 1: Adding a column with DEFAULT and NOT NULL

What: Adding a column with both DEFAULT and NOT NULL constraints.

Why it's bad: In older PostgreSQL versions (before 11), this requires a full table rewrite, which can take hours or even days for large tables, during which the table is locked for writes.

Example:

-- Bad (pre-PG11)
ALTER TABLE users ADD COLUMN status text DEFAULT 'active' NOT NULL;

Better approach:

-- Step 1: Add column without NOT NULL
ALTER TABLE users ADD COLUMN status text DEFAULT 'active';

-- Step 2: (Optional) Update existing rows if needed
-- UPDATE users SET status = 'active' WHERE status IS NULL;

-- Step 3: Add NOT NULL constraint (fast in PG11+)
ALTER TABLE users ALTER COLUMN status SET NOT NULL;

PostgreSQL 11+ improvement: Starting with PostgreSQL 11, adding a column with a non-volatile DEFAULT value no longer requires a table rewrite.

Mistake 2: Adding an index without CONCURRENTLY

What: Creating indexes without the CONCURRENTLY option.

Why it's bad: Standard index creation acquires an exclusive lock on the table, blocking all writes (and sometimes reads) until the index is built.

Example:

-- Bad
CREATE INDEX idx_users_email ON users(email);

Better approach:

-- Good
CREATE INDEX CONCURRENTLY idx_users_email ON users(email);

Important notes:

  • CONCURRENTLY takes longer but doesn't block table access
  • Cannot be used inside a transaction block
  • May fail and leave an invalid index that needs to be dropped

Mistake 3: Dropping a column without a multi-step process

What: Dropping a column directly in production.

Why it's bad: Can cause application errors if the application code still references the column.

Better approach:

  1. Deploy application code that doesn't use the column
  2. Wait for all old application instances to be replaced
  3. Drop the column in a separate migration

Mistake 4: Changing column types unsafely

What: Changing column types that require table rewrites or are incompatible.

Why it's bad: Can cause downtime, data loss, or application errors.

Examples of problematic changes:

-- These require full table rewrites
ALTER TABLE users ALTER COLUMN id TYPE bigint;
ALTER TABLE users ALTER COLUMN email TYPE varchar(100);

Safer approaches:

  • For integer to bigint: Use a multi-step process with a new column
  • For reducing varchar length: Check data first, consider if really necessary

Mistake 5: Adding foreign key constraints without validation

What: Adding foreign key constraints directly on large tables.

Why it's bad: Requires scanning the entire table to validate existing data, causing long locks.

Better approach:

-- Step 1: Add constraint as NOT VALID (fast)
ALTER TABLE orders
ADD CONSTRAINT fk_orders_user_id
FOREIGN KEY (user_id) REFERENCES users(id)
NOT VALID;

-- Step 2: Validate the constraint (can be done during low-traffic period)
ALTER TABLE orders VALIDATE CONSTRAINT fk_orders_user_id;

Mistake 6: Not considering transaction size

What: Running migrations that modify too much data in a single transaction.

Why it's bad: Large transactions can cause:

  • Lock contention
  • Increased memory usage
  • Longer recovery times if something goes wrong
  • Replication lag

Better approach: Break large data migrations into smaller batches.

Mistake 7: Ignoring the impact on replication

What: Not considering how migrations affect read replicas and replication lag.

Why it's bad: Large migrations can cause significant replication lag, affecting read replica performance.

Mistake 8: Not testing migrations on production-like data

What: Testing migrations only on small development datasets.

Why it's bad: Performance characteristics can be completely different on large datasets.

Better approach: Use tools like Database Lab Engine to test on production-scale data clones.

Mistake 9: Forgetting about dependent objects

What: Not considering views, functions, triggers, or other objects that depend on the columns/tables being modified.

Why it's bad: Can cause cascade failures or require additional manual intervention.

Mistake 10: Not having a rollback plan

What: Not planning how to reverse a migration if something goes wrong.

Why it's bad: Can lead to extended downtime when issues occur in production.

Better approach: Always have a tested rollback plan, especially for destructive operations.

Conclusion

Schema migrations are one of the riskiest aspects of application development. The key to avoiding these mistakes is:

  1. Test on production-scale data using tools like Database Lab Engine
  2. Plan multi-step approaches for risky operations
  3. Use PostgreSQL-specific features like CONCURRENTLY and NOT VALID
  4. Monitor impact on performance and replication
  5. Always have a rollback plan

Remember: it's better to take a few extra steps and deploy safely than to cause production outages with risky shortcuts.

Nikolay Samokhvalov
Nikolay Samokhvalov

CEO & Founder of PostgresAI

Working on tools to balance Dev with Ops in DevOps

Database Lab
DBLab Engine 4.0

Instant database branching with O(1) economics.