Skip to main content
View rawEdit

Prometheus monitoring

DBLab Engine exposes Prometheus metrics via the /metrics endpoint. These metrics can be used to monitor the health and performance of the DBLab instance.

note

Prometheus metrics support was added in DBLab Engine 4.1.

Endpoint​

GET /metrics

The endpoint is publicly accessible (no authentication required) and returns metrics in Prometheus text format.

Available metrics​

Instance metrics​

Metric NameTypeLabelsDescription
dblab_instance_infoGaugeinstance_id, version, editionInformation about the DBLab instance (always 1)
dblab_instance_uptime_secondsGauge-Time in seconds since the DBLab instance started
dblab_instance_status_codeGauge-Status code of the DBLab instance (0=OK, 1=Warning, 2=Bad)
dblab_retrieval_statusGaugemode, statusStatus of data retrieval (1=active for status)

Disk/pool metrics​

Metric NameTypeLabelsDescription
dblab_disk_total_bytesGaugepoolTotal disk space in bytes
dblab_disk_free_bytesGaugepoolFree disk space in bytes
dblab_disk_used_bytesGaugepoolUsed disk space in bytes
dblab_disk_used_by_snapshots_bytesGaugepoolDisk space used by snapshots in bytes
dblab_disk_used_by_clones_bytesGaugepoolDisk space used by clones in bytes
dblab_disk_data_size_bytesGaugepoolSize of the data directory in bytes
dblab_disk_compress_ratioGaugepoolCompression ratio of the filesystem (ZFS)
dblab_pool_statusGaugepool, mode, statusStatus of the pool (1=active for status)

Clone metrics (aggregate)​

Metric NameTypeLabelsDescription
dblab_clones_totalGauge-Total number of clones
dblab_clones_by_statusGaugestatusNumber of clones by status
dblab_clone_max_age_secondsGauge-Maximum age of any clone in seconds
dblab_clone_total_diff_size_bytesGauge-Total extra disk space used by all clones (sum of diffs from snapshots)
dblab_clone_total_logical_size_bytesGauge-Total logical size of all clone data
dblab_clone_total_cpu_usage_percentGauge-Total CPU usage percentage across all clone containers
dblab_clone_avg_cpu_usage_percentGauge-Average CPU usage percentage across all clone containers with valid data
dblab_clone_total_memory_usage_bytesGauge-Total memory usage in bytes across all clone containers
dblab_clone_total_memory_limit_bytesGauge-Total memory limit in bytes across all clone containers
dblab_clone_protected_countGauge-Number of protected clones

Snapshot metrics (aggregate)​

Metric NameTypeLabelsDescription
dblab_snapshots_totalGauge-Total number of snapshots
dblab_snapshots_by_poolGaugepoolNumber of snapshots by pool
dblab_snapshot_max_age_secondsGauge-Maximum age of any snapshot in seconds
dblab_snapshot_total_physical_size_bytesGauge-Total physical disk space used by all snapshots
dblab_snapshot_total_logical_size_bytesGauge-Total logical size of all snapshot data
dblab_snapshot_max_data_lag_secondsGauge-Maximum data lag of any snapshot in seconds
dblab_snapshot_total_num_clonesGauge-Total number of clones across all snapshots

Branch metrics​

Metric NameTypeLabelsDescription
dblab_branches_totalGauge-Total number of branches

Dataset metrics​

Metric NameTypeLabelsDescription
dblab_datasets_totalGaugepoolTotal number of datasets (slots) in the pool
dblab_datasets_availableGaugepoolNumber of available (non-busy) dataset slots for reuse

Sync instance metrics (physical mode)​

These metrics are only available when DBLab is running in physical mode with a sync instance enabled. They track the WAL replay status of the sync instance.

Metric NameTypeLabelsDescription
dblab_sync_statusGaugestatusStatus of the sync instance (1=active for status code)
dblab_sync_wal_lag_secondsGauge-WAL replay lag in seconds for the sync instance
dblab_sync_uptime_secondsGauge-Uptime of the sync instance in seconds
dblab_sync_last_replayed_timestampGauge-Unix timestamp of the last replayed transaction

Observability metrics​

These metrics help monitor the health of the metrics collection system itself.

Metric NameTypeLabelsDescription
dblab_scrape_success_timestampGauge-Unix timestamp of last successful metrics collection
dblab_scrape_duration_secondsGauge-Duration of last metrics collection in seconds
dblab_scrape_errors_totalCounter-Total number of errors during metrics collection

Prometheus configuration​

Add the following to your prometheus.yml:

scrape_configs:
- job_name: 'dblab'
static_configs:
- targets: ['<dblab-host>:<dblab-port>']
metrics_path: /metrics

Replace <dblab-host> and <dblab-port> with your DBLab instance's host and API port (default: 2345).

Example queries​

Free disk space percentage​

100 * dblab_disk_free_bytes / dblab_disk_total_bytes

Number of active clones​

dblab_clones_total

Maximum clone age in hours​

dblab_clone_max_age_seconds / 3600

Data freshness (lag from current time)​

dblab_snapshot_max_data_lag_seconds / 60

WAL replay lag (physical mode)​

dblab_sync_wal_lag_seconds

Alerting examples​

Low disk space alert​

- alert: DBLabLowDiskSpace
expr: (dblab_disk_free_bytes / dblab_disk_total_bytes) * 100 < 20
for: 5m
labels:
severity: warning
annotations:
summary: "DBLab low disk space"
description: "DBLab pool {{ $labels.pool }} has less than 20% free disk space"

Stale snapshot alert​

- alert: DBLabStaleSnapshot
expr: dblab_snapshot_max_data_lag_seconds > 86400
for: 10m
labels:
severity: warning
annotations:
summary: "DBLab snapshot data is stale"
description: "DBLab snapshot data is more than 24 hours old"

High clone count alert​

- alert: DBLabHighCloneCount
expr: dblab_clones_total > 50
for: 5m
labels:
severity: warning
annotations:
summary: "DBLab has many clones"
description: "DBLab has {{ $value }} clones running"

High WAL replay lag alert (physical mode)​

- alert: DBLabHighWALLag
expr: dblab_sync_wal_lag_seconds > 3600
for: 10m
labels:
severity: warning
annotations:
summary: "DBLab sync instance has high WAL lag"
description: "DBLab sync instance WAL replay is {{ $value | humanizeDuration }} behind"

OpenTelemetry integration​

DBLab metrics can be exported to OpenTelemetry-compatible backends using the OpenTelemetry Collector. This allows you to send metrics to Grafana Cloud, Datadog, New Relic, and other observability platforms.

Quick start​

  1. Install the OpenTelemetry Collector:

    docker pull otel/opentelemetry-collector-contrib:latest
  2. Copy the example configuration from the DBLab Engine repository:

    cp engine/configs/otel-collector.example.yml otel-collector.yml
  3. Edit otel-collector.yml to configure your backend:

    exporters:
    otlp:
    endpoint: "your-otlp-endpoint:4317"
    headers:
    Authorization: "Bearer <your-token>"
  4. Run the collector:

    docker run -v $(pwd)/otel-collector.yml:/etc/otelcol/config.yaml \
    -p 4317:4317 -p 8889:8889 \
    otel/opentelemetry-collector-contrib:latest

Supported backends​

The OTel Collector can export to:

  • Grafana Cloud — use OTLP exporter with Grafana Cloud endpoint
  • Datadog — use the datadog exporter
  • New Relic — use OTLP exporter with New Relic endpoint
  • Prometheus Remote Write — use prometheusremotewrite exporter
  • AWS CloudWatch — use awsemf exporter
  • Any OTLP-compatible backend