Many DDL statements in PostgreSQL support modifiers
IF EXISTS /
IF NOT EXISTS. For example:
test=# create table if not exists mytable();CREATE TABLEtest=# drop table if exists mytable;DROP TABLE
I recommend using
IF EXISTS /
IF NOT EXISTS in DDL only when necessary. Here are three examples that demonstrate how the overuse of these words may lead to negative consequences.
Let's assume we have the following to be deployed using some DB schema versioning tool (such as Flyway, Sqitch, or something embedded to ORM/framework like Ruby on Rails Active Record migrations):
create table if not exists mytable( id int8 primary key, title text not null, created_at timestamptz not null default now());
The problem here is that strictly speaking, we cannot guarantee the result – there might be an existing table with the same name and a different structure. It might be created by a previous version of the same DB schema change attempt deployed by the same tool or something that was executed by another tool or manually:
test=# -- something from the pasttest=# create table if not exists mytable( id int8 primary key);CREATE TABLEtest=#test=# -- our attempt to create a tabletest=# create table if not exists mytable( id int8 primary key, title text not null, created_at timestamptz not null default now());NOTICE: relation "mytable" already exists, skippingCREATE TABLEtest=#test=# -- what do we have?test=# \d mytable Table "public.mytable" Column | Type | Collation | Nullable | Default--------+--------+-----------+----------+--------- id | bigint | | not null |Indexes: "mytable_pkey" PRIMARY KEY, btree (id)
Why do engineers add
IF NOT EXISTS in such cases? Because they are uncertain if their change was already deployed to lower environments: dev, QA, staging, and so on.
IF NOT EXISTS suppresses errors and allows to "deploy" the change many times. However, this approach just hides the imperfection of the management of those environments – instead of improving the workflow, we now have increased risks of having various schema change anomalies and even production incidents. Further, we will discuss two more cases similar in nature.
Database Lab by Postgres.ai
An open-source experimentation platform for PostgreSQL databases. Instantly create full-size clones of your production database and use them to test your database migrations, optimize SQL, or deploy full-size staging apps.
Most of the modern schema version control tools allow having both "do" and "undo" steps (they can be called somehow else – e.g., "deploy" and "revert"). There are opinions that the "undo" steps are useless – they may seem more or less valuable depending on circumstances, deployment workflow, and DevOps culture. However, if they are used, in some cases, we can find the following:
drop table if exists mytable;
What's bad with this approach? It can hide logical issues in the schema change scripts. For instance, let's assume we develop some software to be deployed in thousands of places, and our schema change scripts – both "do" and "undo" parts of them – should be reliably executed everywhere. We rely on the "undo" steps for the cases when our clients need to revert the upgrade; it may happen because of various issues not necessarily related to the database.
We have well-developed CI/CD pipelines testing our changes in advanced form to ensure that logic is correct and detect anomalies earlier – they always get tested using the chain "do -> undo -> do" in CI/CD pipelines.
Now let's consider this:
-- "do"create table table2( id int8 primary key);
-- "undo"drop table if exists table2;drop table if exists mytable; -- someone left it here just by mistake
What's wrong here? The CI/CD pipelines will not see any problems – the chain "do -> undo -> do" will work pretty well. The attempt to drop
mytable will not fail; we'll just get a notice:
test=# drop table if exists mytable;NOTICE: table "mytable" does not exist, skippingDROP TABLE
But what if
mytable was created by another DB schema migration long ago? If the "undo" step needs to be executed somewhere, we are going to drop that table. This is not what we expected! For such time of mistakes, we do need to have an error in CI tests - bug IF EXISTS is going to "mask" the problem. As a result, automated testing is not going to catch the problem, and this wrong change has risks to be released.
For large tables under load, it is recommended to use
CREATE INDEX CONCURRENTLY – it is going to work longer that
CREATE INEDEX but it won't cause downtime.
It is not uncommon to see how DBAs try various index ideas right on the production database, trying to find a band-aid for some suboptimal query, and then, once found, they suggest developers include the index to migrations – just to "register" it. In such cases, it is tempting to use
CREATE INDEX CONCURRENTLY IF NOT EXISTS.
However, the thing is that
CREATE INDEX CONCURRENTLY is not atomic: if an attempt to deploy it fails, then the index is going to remain to be defined – in the INVALID state. Here is a simple example:
test=# create table mytable( id int8 PRIMARY KEY, title text NOT NULL, created_at timestamptz NOT NULL DEFAULT now());CREATE TABLE test=# insert into mytable(title) select random()::text from generate_series(1, 10000000) i;INSERT 0 10000000 test=# set statement_timeout to '1ms';SETtest=# create index concurrently if not exists mytable_title_idx on mytable using btree (title);ERROR: canceling statement due to statement timeouttest=# reset statement_timeout;RESET test=# \d mytable Table "public.mytable" Column | Type | Collation | Nullable | Default------------+--------------------------+-----------+----------+--------- id | bigint | | not null | title | text | | not null | created_at | timestamp with time zone | | not null | now()Indexes: "mytable_pkey" PRIMARY KEY, btree (id) "mytable_title_idx" btree (title) INVALID
In this case, another attempt to deploy
CREATE INDEX CONCURRENTLY IF NOT EXISTS is going to do nothing and report "success":
test=# create index concurrently if not exists mytable_title_idx on mytable using btree (title);NOTICE: relation "mytable_title_idx" already exists, skippingCREATE INDEX
Interestingly, in this case – actually, we talk about the creation of any index on a large table under significant load in general – the
IF EXISTS can be helpful. We can use it in the "do" step to support cleanup –
DROP INDEX CONCURRENTLY IF EXISTS, leaving
CREATE INDEX CONCURRENTLY "clean":
test=# drop index concurrently if exists mytable_title_idx;DROP INDEXtest=#test=# create index concurrently if not exists mytable_title_idx on mytable using btree (title);CREATE INDEX
This may be enough – only if we forget about DBA's actions that left behind a VALID index! (Of course, it is not a good idea to test index ideas on production - for that, thin clones provided by Database Lan Engine are better in terms of risks for production and freedom for testing.)
How to drop the index but only if it's marked as INVALID? In this case, we could implement some logic to run
DROP INDEX CONCURRENTLY only if
pg_index.indisvalid for the index is
test=# select indisvalidfrom pg_indexwhere indexrelid = 'mytable_title_idx'::regclass::oid; indisvalid------------ f(1 row)
Unfortunately, this logic cannot be scripted in the form of anonymous DO block of PL/pgSQL code because
DROP INDEX CONCURRENTLY cannot be executed inside explicitly defined transactions, functions, or anonymous DO blocks – so the logic needs to be coded externally. Running
DROP INDEX in such a block would be possible, but this is not suitable for zero-downtime changes.
If we aim to have predictable results, it is better to avoid
IF NOT EXISTS /
IF EXISTS in DB schema migrations. They need to be very rare – one of the examples was demonstrated above (
DROP INDEX CONCURRENTLY IF EXISTS).
In some other cases, such as scripts used for benchmarks, they can be really useful. For example, we can write bootstrap scripts that create DB schema used for benchmarking in an idempotent manner and use it in multiple benchmark iterations – so it does not matter if it is the first iteration of the 100th, the result will always be predictable:
drop table if exists test_table;create table test_table( ...);
Or even dropping the whole "benchmarking" schema before re-creating it:
drop schema if exists benchmark cascade;create schema ...create ...
In other cases, especially dealing with production, multiple systems, multiple people – it is better to reduce the use of
IF NOT EXISTS /
IF EXISTS to a minimum.
And what do you think? Find me on Twitter (@samokhvalov) and let me know!