Skip to main content
View rawEdit

Database- and cluster-level metrics

PostgreSQL system-level metrics. In this stack they are exported by the db_stats, bgwriter, archive_lag / pg_archiver, pg_stat_replication, and settings metric groups, with the pgwatch_ prefix and the source column name (no _total/_seconds suffix convention).

Data sources​

Metric groupUnderlying viewDescription
db_statspg_stat_databaseDatabase-level aggregates (pgwatch_db_stats_*)
bgwriterpg_stat_bgwriterBackground writer statistics (pgwatch_bgwriter_*)
archive_lag / pg_archiverpg_stat_archiverWAL archiver status
pg_stat_replicationpg_stat_replicationReplication status
settingspg_settingsConfiguration parameters (pgwatch_settings_*)

Database metrics​

Transaction metrics​

MetricTypeDescription
pgwatch_db_stats_xact_commitGaugeTransactions committed
pgwatch_db_stats_xact_rollbackGaugeTransactions rolled back
pgwatch_db_stats_deadlocksGaugeDeadlocks detected
pgwatch_db_stats_conflictsGaugeRecovery conflicts (replicas)

Connection metrics​

MetricTypeDescription
pgwatch_db_stats_numbackendsGaugeActive connections
pgwatch_settings_numeric_value{setting_name="max_connections"}GaugeMaximum allowed connections (from the settings metric; there is no pgwatch_settings_max_connections series)

Buffer metrics​

MetricTypeDescription
pgwatch_db_stats_blks_readGaugeBlocks read from disk
pgwatch_db_stats_blks_hitGaugeBlocks found in buffer cache

Temporary file metrics​

MetricTypeDescription
pgwatch_db_stats_temp_filesGaugeTemporary files created
pgwatch_db_stats_temp_bytesGaugeTemporary file bytes written

Background writer metrics​

MetricTypeDescription
pgwatch_bgwriter_checkpoints_timedCounterScheduled checkpoints
pgwatch_bgwriter_checkpoints_reqCounterRequested checkpoints
pgwatch_bgwriter_checkpoint_write_timeCounterCheckpoint write time (ms)
pgwatch_bgwriter_checkpoint_sync_timeCounterCheckpoint sync time (ms)
pgwatch_bgwriter_buffers_checkpointCounterBuffers written during checkpoints
pgwatch_bgwriter_buffers_cleanCounterBuffers written by background writer
pgwatch_bgwriter_buffers_backendCounterBuffers written by backends
pgwatch_bgwriter_buffers_allocCounterBuffers allocated

WAL archiver metrics​

MetricTypeDescription
pgwatch_archive_lag_archived_countGaugeWAL files archived
pgwatch_archive_lag_failed_countGaugeFailed archive attempts
pgwatch_archive_lag_seconds_since_archiveGaugeSeconds since last successful archive
pgwatch_archive_lag_wal_files_behindGaugeWAL files behind the latest segment

WAL directory size​

New in 0.15, the collector reports the total size of the pg_wal directory using pg_ls_waldir(). This is the on-disk WAL size, which complements the archiver and replication metrics for diagnosing disk-fill risk.

MetricTypeDescription
pgwatch_pg_wal_size_bytesGaugeTotal size of regular files in pg_wal (excludes subdirectories like pg_wal/archive_status)
pgwatch_pg_wal_size_status_codeGaugeCollection status: 0 = success, 1 = pg_ls_waldir() unavailable, 2 = monitoring role lacks EXECUTE privilege

pg_wal growth that is not matched by archive or replica progress is a disk-fill warning — typically a stuck WAL archiver, an inactive replication slot retaining WAL, or sustained high WAL generation. Correlate pgwatch_pg_wal_size_bytes with the pgwatch_archive_lag_* and replication-slot metrics, and see How to troubleshoot a growing pg_wal directory.

# Alert when pg_wal exceeds, e.g., 20 GiB while archiving is failing
pgwatch_pg_wal_size_bytes > 20 * 1024 * 1024 * 1024

Replication metrics​

LSN positions come from the pg_stat_replication group; lag is reported in replication_*_lag_ms / _lag_b fields. There is no pg_replication_lag_seconds series.

MetricTypeDescription
pgwatch_pg_stat_replication_sent_lsnGaugeWAL sent to replica
pgwatch_pg_stat_replication_write_lsnGaugeWAL written on replica
pgwatch_pg_stat_replication_flush_lsnGaugeWAL flushed on replica
pgwatch_pg_stat_replication_replay_lsnGaugeWAL replayed on replica

Labels​

The cluster label is cluster (not cluster_name).

LabelDescriptionExample
datnameDatabase namemyapp
clusterCluster identifier (from custom_tags.cluster)production
node_nameNode identifierprimary

Common queries​

Transactions per second​

sum(rate(pgwatch_db_stats_xact_commit[5m]))

Rollback ratio​

rate(pgwatch_db_stats_xact_rollback[5m])
/
(rate(pgwatch_db_stats_xact_commit[5m]) + rate(pgwatch_db_stats_xact_rollback[5m]))

Buffer cache hit ratio​

sum(rate(pgwatch_db_stats_blks_hit[5m]))
/
(sum(rate(pgwatch_db_stats_blks_hit[5m])) + sum(rate(pgwatch_db_stats_blks_read[5m])))

Connection utilization​

sum(pgwatch_db_stats_numbackends)
/
scalar(max(pgwatch_settings_numeric_value{setting_name="max_connections"}))

Checkpoint frequency​

rate(pgwatch_bgwriter_checkpoints_timed[5m])
+
rate(pgwatch_bgwriter_checkpoints_req[5m])

Backend buffer writes (problematic)​

High values indicate checkpoint tuning needed:

rate(pgwatch_bgwriter_buffers_backend[5m])
/
(
rate(pgwatch_bgwriter_buffers_checkpoint[5m])
+
rate(pgwatch_bgwriter_buffers_clean[5m])
+
rate(pgwatch_bgwriter_buffers_backend[5m])
)

WAL archive status​

pgwatch_archive_lag_seconds_since_archive

Dashboard usage​

These metrics are used in:

Interpreting system metrics​

Buffer cache hit ratio​

Target: > 99% for frequently accessed data

select
datname,
blks_hit,
blks_read,
round(100.0 * blks_hit / nullif(blks_hit + blks_read, 0), 2) as hit_ratio
from pg_stat_database
where datname not like 'template%';

Low hit ratio causes:

  • shared_buffers too small
  • Working set larger than memory
  • Cold cache after restart

Checkpoint behavior​

Healthy checkpoint pattern:

  • Mostly timed checkpoints (scheduled)
  • Low requested checkpoints (forced by WAL volume)
  • Minimal backend buffer writes
select
checkpoints_timed,
checkpoints_req,
buffers_checkpoint,
buffers_clean,
buffers_backend
from pg_stat_bgwriter;

Temporary files​

High temporary file usage indicates:

  • work_mem too low
  • Complex queries with large sorts/joins
  • Hash joins spilling to disk
select
datname,
temp_files,
pg_size_pretty(temp_bytes) as temp_size
from pg_stat_database
where temp_files > 0;

Troubleshooting​

Metrics show zero after restart​

Background writer and database statistics reset on restart. This is expected.

Replication lag metrics missing​

Verify:

  1. Replication is configured and active
  2. Monitoring connects to primary (not replica)
  3. pg_stat_replication is accessible:
    select * from pg_stat_replication;

Connection count doesn't match application​

numbackends includes:

  • Application connections
  • Monitoring connections
  • Superuser reserved connections
  • Replication connections

Check pg_stat_activity for breakdown:

select backend_type, count(*)
from pg_stat_activity
group by backend_type;