Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
20 commits
Select commit Hold shift + click to select a range
04ef816
feat: add local telemetry playground and monitoring docs
udsmicrosoft Mar 5, 2026
26e2262
Address PR review: fix PG auth, add doc front matter, demo warnings
udsmicrosoft Mar 16, 2026
ec8e741
Replace ASCII architecture diagram with Mermaid in overview.md
udsmicrosoft Mar 16, 2026
37a5951
Add prerequisite note for gateway OTEL support (documentdb#443)
udsmicrosoft Mar 16, 2026
1c682ba
Address Copilot review comments
udsmicrosoft Mar 16, 2026
ea6352f
resolve merge conflict in mkdocs.yml
udsmicrosoft Mar 19, 2026
db29135
resolve merge conflict in mkdocs.yml
udsmicrosoft Mar 23, 2026
ffb48c7
Improve telemetry playground: OTel best practices, alerting, logs pan…
udsmicrosoft Mar 25, 2026
dae61fe
Fix monitoring docs: correct controller names, add CNPG caveat, clari…
udsmicrosoft Mar 25, 2026
8d97551
Fix namespace race in deploy.sh and update README with operator prere…
udsmicrosoft Mar 25, 2026
89745bf
Fix deploy.sh secret detection for CNPG 1.28 and add exposeViaService…
udsmicrosoft Mar 25, 2026
6355d50
Fix gateway dashboard instance variable to use db_client_operations_t…
udsmicrosoft Mar 25, 2026
dfeb8fb
Fix dashboards: remap to cnpg_* metrics, remove panels for unimplemen…
udsmicrosoft Mar 25, 2026
153a76e
Fix dashboards: remap PG ops to cnpg tup_ metrics, WAL size, split do…
udsmicrosoft Mar 25, 2026
39680c3
docs: clarify gateway OTel prerequisite — base instrumentation vs ful…
udsmicrosoft Mar 26, 2026
f28dd95
Make deploy.sh self-contained: install operator from GHCR, default to…
udsmicrosoft Mar 26, 2026
eac15b2
docs: remove stale operator prerequisite, fix latency description
udsmicrosoft Mar 26, 2026
ff27b7a
Fix alert rules to use cnpg_* metrics, replace python3 with grep in v…
udsmicrosoft Mar 26, 2026
f6f23a1
Merge remote-tracking branch 'origin/main' into users/urismiley/telem…
udsmicrosoft Apr 9, 2026
b63977b
Scope playground to merged gateway metrics, fix deploy and docs
udsmicrosoft Apr 9, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
370 changes: 370 additions & 0 deletions docs/operator-public-documentation/preview/monitoring/metrics.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,370 @@
---
title: Metrics Reference
description: Detailed reference of all metrics available when monitoring DocumentDB clusters, with PromQL examples.
tags:
- monitoring
- metrics
- prometheus
- opentelemetry
---

# Metrics Reference

This page documents the key metrics available when monitoring a DocumentDB cluster, organized by source. Each section includes the metric name, description, labels, and example PromQL queries.

## Container Resource Metrics

These metrics are collected via the kubelet/cAdvisor interface (or the OpenTelemetry `kubeletstats` receiver). They cover CPU, memory, network, and filesystem for the **postgres** and **documentdb-gateway** containers in each DocumentDB pod.

### CPU

| Metric | Type | Description |
|--------|------|-------------|
| `container_cpu_usage_seconds_total` | Counter | Cumulative CPU time consumed in seconds |
| `container_spec_cpu_quota` | Gauge | CPU quota (microseconds per `cpu_period`) |
| `container_spec_cpu_period` | Gauge | CPU CFS scheduling period (microseconds) |

**Common labels:** `namespace`, `pod`, `container`, `node`

#### Example Query

CPU usage rate per container over 5 minutes:

```promql
rate(container_cpu_usage_seconds_total{
container=~"postgres|documentdb-gateway",
pod=~".*documentdb.*"
}[5m])
```

### Memory

| Metric | Type | Description |
|--------|------|-------------|
| `container_memory_working_set_bytes` | Gauge | Current working set memory (bytes) |
| `container_memory_rss` | Gauge | Resident set size (bytes) |
| `container_memory_cache` | Gauge | Page cache memory (bytes) |
| `container_spec_memory_limit_bytes` | Gauge | Memory limit (bytes) |

**Common labels:** `namespace`, `pod`, `container`, `node`

#### Example Query

Memory utilization as a percentage of limit:

```promql
(container_memory_working_set_bytes{
container=~"postgres|documentdb-gateway",
pod=~".*documentdb.*"
}
/ container_spec_memory_limit_bytes{
container=~"postgres|documentdb-gateway",
pod=~".*documentdb.*"
}) * 100
```

### Network

| Metric | Type | Description |
|--------|------|-------------|
| `container_network_receive_bytes_total` | Counter | Bytes received |
| `container_network_transmit_bytes_total` | Counter | Bytes transmitted |

**Common labels:** `namespace`, `pod`, `interface`

#### Example Queries

Network throughput (bytes/sec) per pod:

```promql
sum by (pod) (
rate(container_network_receive_bytes_total{
pod=~".*documentdb.*"
}[5m])
+ rate(container_network_transmit_bytes_total{
pod=~".*documentdb.*"
}[5m])
)
```

### Filesystem

| Metric | Type | Description |
|--------|------|-------------|
| `container_fs_usage_bytes` | Gauge | Filesystem usage (bytes) |
| `container_fs_reads_bytes_total` | Counter | Filesystem read bytes |
| `container_fs_writes_bytes_total` | Counter | Filesystem write bytes |

**Common labels:** `namespace`, `pod`, `container`, `device`

#### Example Queries

Disk I/O rate for the postgres container:

```promql
rate(container_fs_writes_bytes_total{
container="postgres",
pod=~".*documentdb.*"
}[5m])
```

## Gateway Metrics

The DocumentDB Gateway exports application-level metrics via OTLP (OpenTelemetry Protocol) push. The gateway sidecar injector automatically sets `OTEL_EXPORTER_OTLP_ENDPOINT` and `OTEL_RESOURCE_ATTRIBUTES` (with `service.instance.id` set to the pod name) on each gateway container, so metrics are exported without manual configuration.

Metrics are exported to an OpenTelemetry Collector, which converts them to Prometheus format via the `prometheus` exporter.

!!! note "Gateway metric names may change between versions"
The metrics below are emitted by the DocumentDB Gateway binary, which is versioned independently from the operator. Metric names, labels, and semantics may change between gateway releases. Always verify metric availability against the gateway version deployed in your cluster.

### Operations

| Metric | Type | Description |
|--------|------|-------------|
| `db_client_operations_total` | Counter | Total MongoDB operations processed |
| `db_client_operation_duration_seconds_total` | Counter | Cumulative operation duration (can be broken down by `db_operation_phase`) |

**Common labels:** `db_operation_name` (e.g., `Find`, `Insert`, `Update`, `Aggregate`, `Delete`), `db_namespace`, `db_system_name`, `service_instance_id` (pod name), `error_type` (set on failed operations)

**Phase labels** (on `db_client_operation_duration_seconds_total`): `db_operation_phase` — values include `pg_query`, `cursor_iteration`, `bson_serialization`, `command_parsing`. Empty phase represents total duration.

#### Example Queries

Operations per second by command type:

```promql
sum by (db_operation_name) (
rate(db_client_operations_total[1m])
)
```

Average latency per operation (milliseconds):

```promql
sum by (db_operation_name) (
rate(db_client_operation_duration_seconds_total{db_operation_phase=""}[1m])
) / sum by (db_operation_name) (
rate(db_client_operations_total[1m])
) * 1000
```

Error rate as a percentage:

```promql
sum(rate(db_client_operations_total{error_type!=""}[1m]))
/ sum(rate(db_client_operations_total[1m])) * 100
```

Time spent in each operation phase per second:

```promql
sum by (db_operation_phase) (
rate(db_client_operation_duration_seconds_total{
db_operation_phase!=""
}[1m])
)
```

### Request/Response Size

| Metric | Type | Description |
|--------|------|-------------|
| `db_client_request_size_bytes_total` | Counter | Cumulative request payload size |
| `db_client_response_size_bytes_total` | Counter | Cumulative response payload size |

**Common labels:** `service_instance_id` (pod name)

#### Example Queries

Average request throughput (bytes/sec):

```promql
sum(rate(db_client_request_size_bytes_total[1m]))
```

## Operator Metrics (controller-runtime)

The DocumentDB operator binary exposes standard controller-runtime metrics on its metrics endpoint. These track reconciliation performance and work queue health.

### Reconciliation

| Metric | Type | Description |
|--------|------|-------------|
| `controller_runtime_reconcile_total` | Counter | Total reconciliations |
| `controller_runtime_reconcile_errors_total` | Counter | Total reconciliation errors |
| `controller_runtime_reconcile_time_seconds` | Histogram | Time spent in reconciliation |

**Common labels:** `controller` (e.g., `documentdb-controller`, `backup-controller`, `scheduled-backup-controller`, `certificate-controller`, `pv-controller`), `result` (`success`, `error`, `requeue`, `requeue_after`)

#### Example Queries

Reconciliation error rate by controller:

```promql
sum by (controller) (
rate(controller_runtime_reconcile_errors_total[5m])
)
```

P95 reconciliation latency for the DocumentDB controller:

```promql
histogram_quantile(0.95,
sum by (le) (
rate(controller_runtime_reconcile_time_seconds_bucket{
controller="documentdb-controller"
}[5m])
)
)
```

Reconciliation throughput (reconciles/sec):

```promql
sum by (controller) (
rate(controller_runtime_reconcile_total[5m])
)
```

### Work Queue

| Metric | Type | Description |
|--------|------|-------------|
| `workqueue_depth` | Gauge | Current number of items in the queue |
| `workqueue_adds_total` | Counter | Total items added |
| `workqueue_queue_duration_seconds` | Histogram | Time items spend in queue before processing |
| `workqueue_work_duration_seconds` | Histogram | Time spent processing items |
| `workqueue_retries_total` | Counter | Total retries |

**Common labels:** `name` (queue name, maps to controller name)

#### Example Queries

Work queue depth by controller:

```promql
workqueue_depth{name=~"documentdb-controller|backup-controller|scheduled-backup-controller|certificate-controller"}
```

Average time items spend waiting in queue:

```promql
rate(workqueue_queue_duration_seconds_sum{name="documentdb-controller"}[5m])
/ rate(workqueue_queue_duration_seconds_count{name="documentdb-controller"}[5m])
```

## CNPG / PostgreSQL Metrics

CloudNative-PG can expose PostgreSQL-level metrics from each managed pod. Additionally, the OpenTelemetry Collector's `postgresql` receiver collects metrics directly from PostgreSQL via SQL queries.

!!! warning "CNPG monitoring must be enabled separately"
The DocumentDB operator does **not** enable CNPG's built-in Prometheus metrics endpoint by default. The `cnpg_*` metrics listed below are only available if you manually configure CNPG monitoring on the underlying Cluster resource. The `postgresql_*` metrics from the OTel `postgresql` receiver work without additional configuration.

For the full CNPG metrics list, see the [CloudNative-PG monitoring docs](https://cloudnative-pg.io/documentation/current/monitoring/).

### Replication

| Metric | Type | Description |
|--------|------|-------------|
| `cnpg_pg_replication_lag` | Gauge | Replication lag in seconds (CNPG) |
| `postgresql_replication_data_delay_bytes` | Gauge | Replication data delay in bytes (OTel PG receiver) |

#### Example Queries

Replication lag per pod:

```promql
cnpg_pg_replication_lag{pod=~".*documentdb.*"}
```

### Connections

| Metric | Type | Description |
|--------|------|-------------|
| `cnpg_pg_stat_activity_count` | Gauge | Active backend connections by state (CNPG) |
| `postgresql_backends` | Gauge | Number of backends (OTel PG receiver) |
| `postgresql_connection_max` | Gauge | Maximum connections (OTel PG receiver) |

#### Example Queries

Active connections by state:

```promql
sum by (state) (
cnpg_pg_stat_activity_count{pod=~".*documentdb.*"}
)
```

Backend utilization:

```promql
postgresql_backends / postgresql_connection_max * 100
```

### Storage

| Metric | Type | Description |
|--------|------|-------------|
| `cnpg_pg_database_size_bytes` | Gauge | Total database size (CNPG) |
| `postgresql_db_size_bytes` | Gauge | Database size (OTel PG receiver) |
| `postgresql_wal_age_seconds` | Gauge | WAL age (OTel PG receiver) |

#### Example Queries

Database size in GiB:

```promql
postgresql_db_size_bytes / 1024 / 1024 / 1024
```

### Operations

| Metric | Type | Description |
|--------|------|-------------|
| `postgresql_commits_total` | Counter | Total committed transactions |
| `postgresql_rollbacks_total` | Counter | Total rolled-back transactions |
| `postgresql_operations_total` | Counter | Row operations (labels: `operation`) |

#### Example Queries

Transaction rate:

```promql
rate(postgresql_commits_total[1m])
```

Row operations per second by type:

```promql
sum by (operation) (rate(postgresql_operations_total[1m]))
```

### Cluster Health

| Metric | Type | Description |
|--------|------|-------------|
| `cnpg_collector_up` | Gauge | 1 if the CNPG metrics collector is running |
| `cnpg_pg_postmaster_start_time` | Gauge | PostgreSQL start timestamp |

#### Example Queries

Detect pods where the metrics collector is down:

```promql
cnpg_collector_up{pod=~".*documentdb.*"} == 0
```

## OpenTelemetry Metric Names

When using the OpenTelemetry `kubeletstats` receiver, metric names use the OpenTelemetry naming convention. These are **not identical** to cAdvisor/Prometheus metrics — they measure similar concepts but may differ in semantics (e.g., cumulative vs. gauge, different calculation methods):

| OpenTelemetry Name | Approximate Prometheus Equivalent |
|---|---|
| `k8s.container.cpu.time` | `container_cpu_usage_seconds_total` |
| `k8s.container.memory.usage` | `container_memory_working_set_bytes` |
| `k8s.container.cpu.limit` | `container_spec_cpu_quota` |
| `k8s.container.memory.limit` | `container_spec_memory_limit_bytes` |
| `k8s.pod.network.io` | `container_network_*_bytes_total` |

!!! note
The OTel Prometheus exporter converts dots to underscores, so `k8s.container.cpu.time` becomes `k8s_container_cpu_time` in Prometheus. Use the naming convention matching your collection method. The telemetry playground uses OpenTelemetry names; a direct Prometheus scrape of cAdvisor uses Prometheus-style names.
Loading
Loading