Skip to main content
View rawEdit

Dashboard overview

PostgresAI monitoring includes 14 pre-built Grafana dashboards designed for expert-level PostgreSQL troubleshooting.

Dashboard categories​

Triage and overview​

#DashboardPurpose
01Node overviewHigh-level node health, wait events, sessions
02Query analysisTop-N queries by various metrics
03Single queryDeep-dive into specific queryid

Wait events and locks​

#DashboardPurpose
04Wait eventsActive session history (ASH-style)
13Lock contentionLock waits and blocking chains

Storage and maintenance​

#DashboardPurpose
05BackupsBackup status and WAL archiving
07Autovacuum & xmin horizonAutovacuum, dead tuples, bloat, and xmin-horizon root cause analysis
08Table statsAggregated table metrics
09Single tableDeep-dive into specific table
10Index healthIndex usage and bloat
11Single indexDeep-dive into specific index
12SLRUSLRU cache statistics

Replication and HA​

#DashboardPurpose
06ReplicationReplication lag and slot status

I/O​

#DashboardPurpose
14I/O statisticsI/O by backend type (pg_stat_io, PostgreSQL 16+)

Stack health​

#DashboardPurpose
--Self-monitoringMonitoring stack health

Common variables​

Most dashboards share these filter variables:

VariablePurposeExample
cluster_nameCluster identifierproduction, staging
node_nameNode within clusterprimary, replica-1
db_nameDatabase filtermyapp, All

Exceptions:

  • 06. Replication and Self-monitoring have no template variables at all (06 is a placeholder; self-monitoring reports on the single monitoring instance).
  • 14. I/O statistics has only cluster_name and node_name (no database filter — pg_stat_io is instance-level).
  • 11. Single index names its database variable datname (label "DB name") rather than db_name.

Incident response​

  1. Start with 01. Node overview

    • Check wait event distribution
    • Look for session count anomalies
    • Note TPS/QPS patterns
  2. Identify the bottleneck

    • High CPU wait events — Check queries (02)
    • High IO wait events — Check disk activity, queries
    • High LWLock — Check specific lock type (13)
  3. Drill down

    • Use 02. Query analysis to find problematic queries
    • Use 03. Single query for detailed metrics on specific queryid

Routine monitoring​

TaskDashboardWhat to look for
Query review02. Query analysisNew slow queries, regression
Index health10. Index healthUnused indexes, bloat
Table health08. Table statsBloat, sequential scans
Vacuum status07. Autovacuum & xmin horizonDead tuple accumulation, xmin-horizon blockers
I/O attribution14. I/O statisticsReads/writes by backend type (PG16+)

Legend options​

02. Query analysis has a Query texts variable (legend_label) that switches how query texts are rendered in legends:

OptionValueShows
Smart truncation (default)displayname_longQuery text with smart truncation
Raw textsdisplayname_raw_longFull raw query text

Select the format using the Query texts variable at the top of the dashboard.

Top-N filtering​

Many dashboards limit each panel to the top-N series (for example, the top_n variable on 02. Query analysis offers 5, 10, 15, 20, 50, 100, 500). These panels use plain PromQL topk($top_n, ...), which keeps only the highest-ranked series and drops the long tail — it does not sum the remainder into a separate bucket. The per-relation dashboards (08. Table stats, 10. Index health) use the same topk($top_n, ...) approach.

If the objects you care about are not visible, raise top_n or drill into the corresponding single-object dashboard to see the detail.

Time range tips​

Dashboards default to a now-1h time range in 0.15, tuned for readable, recent patterns out of the box.

  • Incident investigation: The default now-1h shows recent patterns; widen as needed
  • Trend analysis: Use 24h-7d for capacity planning
  • Comparison: Use "Compare to" feature for week-over-week analysis

Next steps​