Skip to main content
View rawEdit

Self-monitoring dashboard

Monitor the health of the monitoring stack itself.

Self-monitoring dashboard

Screenshot note

The screenshot shows a containerized demo environment. Some host-level panels (CPU, memory, disk, network) require node_exporter or cAdvisor with Docker socket access. In production environments with proper host metrics collection, all panels display data.

Purpose​

Ensure the monitoring infrastructure is functioning correctly:

  • Metrics collection is working
  • Storage has capacity
  • No data gaps
  • Alert pipeline is healthy

When to use​

  • Regular monitoring stack health checks
  • After monitoring stack updates
  • When dashboards show "No data"
  • Capacity planning for monitoring infrastructure

Key panels​

Scrape success rate​

What it shows:

  • Percentage of successful metric scrapes
  • Per-target breakdown

Healthy state:

  • 100% success rate
  • Consistent scrape intervals

Warning signs:

  • Scrape failures — check target availability
  • Timeouts — target may be overloaded

Metrics ingestion rate​

What it shows:

  • Samples ingested per second
  • Trend over time

Use for:

  • Capacity planning
  • Detecting metric explosion

Storage usage​

What it shows:

  • VictoriaMetrics disk usage
  • Projected capacity based on retention

Warning threshold:

  • Alert when > 80% capacity

Active time series​

What it shows:

  • Number of unique metric series
  • Growth trend

Monitoring series growth:

  • Sudden spikes may indicate cardinality explosion
  • Gradual growth expected as you add targets

Query performance​

What it shows:

  • Grafana query latency
  • Slow queries

Variables​

VariablePurpose
cluster_nameFilter by monitored cluster

Health check commands​

Check VictoriaMetrics status​

curl http://localhost:8428/api/v1/status/tsdb

Check pgwatch status​

docker compose logs pgwatch --tail=50

Check Prometheus/VM targets​

curl http://localhost:8428/api/v1/targets

Verify metrics collection​

curl 'http://localhost:8428/api/v1/query?query=up'

Common issues​

Dashboards show "No data"​

  1. Check scrape targets are up:

    curl http://localhost:8428/api/v1/targets | jq '.data.activeTargets[] | {job: .labels.job, health: .health}'
  2. Verify metric exists:

    curl 'http://localhost:8428/api/v1/label/__name__/values' | jq '.data[]' | grep pg_
  3. Check time range alignment

High storage growth​

  1. Check for cardinality explosion:

    curl 'http://localhost:8428/api/v1/status/tsdb' | jq '.data.totalSeries'
  2. Review high-cardinality metrics:

    curl 'http://localhost:8428/api/v1/status/tsdb' | jq '.data.seriesCountByMetricName | to_entries | sort_by(-.value) | .[0:10]'
  3. Adjust retention if needed:

    # docker-compose.yml
    victoriametrics:
    command:
    - "-retentionPeriod=30d" # Reduce from 90d

Scrape timeouts​

  1. Increase scrape timeout:

    # prometheus.yml
    scrape_configs:
    - job_name: 'pgwatch'
    scrape_timeout: 30s
  2. Check target database performance

  3. Review pgwatch resource allocation

Capacity planning​

Estimating storage needs​

FactorImpact
Number of databasesLinear increase
Scrape intervalShorter = more data
Retention periodLonger = more storage
Query cardinalityHigh = more series

Formula:

Daily storage ≈ (series_count × samples_per_day × bytes_per_sample) / compression_ratio

Typical values:

  • Bytes per sample: ~2-4 (compressed)
  • Compression ratio: 10-15x
  • Samples per day at 60s interval: 1,440

Scaling recommendations​

DatabasesRecommended resources
1-52 CPU, 2 GiB RAM, 20 GiB disk
5-204 CPU, 4 GiB RAM, 100 GiB disk
20-508 CPU, 8 GiB RAM, 500 GiB disk