Skip to content

Commit 3d537d1

Browse files
committed
Update contrib/mixin alerting thresholds
With the help of @dgoodwin we were able to identify better threshold based on our fleet telemetry. Along with it, I wanted to contribute some minor improvements we did to our alerts over the years. Here's a summary by Claude: Alert Severity Changes - etcdMembersDown: Increased severity from warning to critical (alerts/alerts.libsonnet:10) Improved Alert Descriptions - etcdInsufficientMembers: Enhanced description with detailed troubleshooting guidance about control plane nodes, network connectivity, and the impact on Kubernetes APIs (alerts/alerts.libsonnet:20-21) Alert Query Improvements - etcdHighNumberOfLeaderChanges: Rewrote query to use changes(etcd_server_is_leader) instead of increase(etcd_server_leader_changes_seen_total), changed time window from 15m to 10m (alerts/alerts.libsonnet:30) More Aggressive Disk Performance Thresholds - etcdHighFsyncDurations (warning): Lowered threshold from 0.5s to 0.05s (alerts/alerts.libsonnet:47) - etcdHighFsyncDurations (critical): Lowered threshold from 1s to 0.07s (alerts/alerts.libsonnet:56) - etcdHighCommitDurations (warning): Lowered threshold from 0.25s to 0.08s (alerts/alerts.libsonnet:65) - etcdHighCommitDurations (critical): Added new critical alert at 0.1s threshold (alerts/alerts.libsonnet:74-87) Database Quota Alerts - etcdDatabaseQuotaLowSpace: Added tiered alerts at 65% (info), 75% (warning), and lowered critical from 95% to 85% (alerts/alerts.libsonnet:89-121) Signed-off-by: Thomas Jungblut <[email protected]>
1 parent 1fa64e5 commit 3d537d1

File tree

2 files changed

+65
-36
lines changed

2 files changed

+65
-36
lines changed

contrib/mixin/alerts/alerts.libsonnet

Lines changed: 52 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@
1818
||| % { etcd_instance_labels: $._config.etcd_instance_labels, etcd_selector: $._config.etcd_selector, network_failure_range: $._config.scrape_interval_seconds * 4 },
1919
'for': '20m',
2020
labels: {
21-
severity: 'warning',
21+
severity: 'critical',
2222
},
2323
annotations: {
2424
description: 'etcd cluster "{{ $labels.%s }}": members are down ({{ $value }}).' % $._config.clusterLabel,
@@ -35,8 +35,8 @@
3535
severity: 'critical',
3636
},
3737
annotations: {
38-
description: 'etcd cluster "{{ $labels.%s }}": insufficient members ({{ $value }}).' % $._config.clusterLabel,
39-
summary: 'etcd cluster has insufficient number of members.',
38+
description: 'etcd cluster "{{ $labels.%s }}": is reporting fewer instances are available than are needed ({{ $value }}). When etcd does not have a majority of instances available the Kubernetes APIs will reject read and write requests and operations that preserve the health of workloads cannot be performed. This can occur when multiple control plane nodes are powered off or are unable to connect to each other via the network. Check that all control plane nodes are powered on and that network connections between each machine are functional.' % $._config.clusterLabel,
39+
summary: 'etcd is reporting that a majority of instances are unavailable.',
4040
},
4141
},
4242
{
@@ -56,14 +56,14 @@
5656
{
5757
alert: 'etcdHighNumberOfLeaderChanges',
5858
expr: |||
59-
increase((max without (%(etcd_instance_labels)s) (etcd_server_leader_changes_seen_total{%(etcd_selector)s}) or 0*absent(etcd_server_leader_changes_seen_total{%(etcd_selector)s}))[15m:1m]) >= 4
59+
avg by (job) (changes(etcd_server_is_leader{%(etcd_selector)s}[10m])) > 5
6060
||| % $._config,
6161
'for': '5m',
6262
labels: {
6363
severity: 'warning',
6464
},
6565
annotations: {
66-
description: 'etcd cluster "{{ $labels.%s }}": {{ $value }} leader changes within the last 15 minutes. Frequent elections may be a sign of insufficient resources, high network latency, or disruptions by other components and should be investigated.' % $._config.clusterLabel,
66+
description: 'etcd cluster "{{ $labels.%s }}": {{ $value }} leader changes within the last 10 minutes. Frequent elections may be a sign of insufficient resources, high network latency, or disruptions by other components and should be investigated.' % $._config.clusterLabel,
6767
summary: 'etcd cluster has high number of leader changes.',
6868
},
6969
},
@@ -149,7 +149,7 @@
149149
alert: 'etcdHighFsyncDurations',
150150
expr: |||
151151
histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket{%(etcd_selector)s}[5m]))
152-
> 0.5
152+
> 0.05
153153
||| % $._config,
154154
'for': '10m',
155155
labels: {
@@ -164,7 +164,7 @@
164164
alert: 'etcdHighFsyncDurations',
165165
expr: |||
166166
histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket{%(etcd_selector)s}[5m]))
167-
> 1
167+
> 0.07
168168
||| % $._config,
169169
'for': '10m',
170170
labels: {
@@ -179,7 +179,7 @@
179179
alert: 'etcdHighCommitDurations',
180180
expr: |||
181181
histogram_quantile(0.99, rate(etcd_disk_backend_commit_duration_seconds_bucket{%(etcd_selector)s}[5m]))
182-
> 0.25
182+
> 0.08
183183
||| % $._config,
184184
'for': '10m',
185185
labels: {
@@ -190,10 +190,53 @@
190190
summary: 'etcd cluster 99th percentile commit durations are too high.',
191191
},
192192
},
193+
{
194+
alert: 'etcdHighCommitDurations',
195+
expr: |||
196+
histogram_quantile(0.99, rate(etcd_disk_backend_commit_duration_seconds_bucket{%(etcd_selector)s}[5m]))
197+
> 0.1
198+
||| % $._config,
199+
'for': '10m',
200+
labels: {
201+
severity: 'critical',
202+
},
203+
annotations: {
204+
description: 'etcd cluster "{{ $labels.%s }}": 99th percentile commit durations {{ $value }}s on etcd instance {{ $labels.instance }}.' % $._config.clusterLabel,
205+
summary: 'etcd cluster 99th percentile commit durations are too high.',
206+
},
207+
},
208+
{
209+
alert: 'etcdDatabaseQuotaLowSpace',
210+
expr: |||
211+
(last_over_time(etcd_mvcc_db_total_size_in_bytes{%(etcd_selector)s}[5m]) / last_over_time(etcd_server_quota_backend_bytes{%(etcd_selector)s}[5m]))*100 > 65
212+
||| % $._config,
213+
'for': '10m',
214+
labels: {
215+
severity: 'info',
216+
},
217+
annotations: {
218+
description: 'etcd cluster "{{ $labels.%s }}": database size is 65 percent of the defined quota on etcd instance {{ $labels.instance }}, please defrag or increase the quota as the writes to etcd will be disabled when it is full.' % $._config.clusterLabel,
219+
summary: 'etcd cluster database is using >= 65 percent of the defined quota.',
220+
},
221+
},
222+
{
223+
alert: 'etcdDatabaseQuotaLowSpace',
224+
expr: |||
225+
(last_over_time(etcd_mvcc_db_total_size_in_bytes{%(etcd_selector)s}[5m]) / last_over_time(etcd_server_quota_backend_bytes{%(etcd_selector)s}[5m]))*100 > 75
226+
||| % $._config,
227+
'for': '10m',
228+
labels: {
229+
severity: 'warning',
230+
},
231+
annotations: {
232+
description: 'etcd cluster "{{ $labels.%s }}": database size is 75 percent of the defined quota on etcd instance {{ $labels.instance }}, please defrag or increase the quota as the writes to etcd will be disabled when it is full.' % $._config.clusterLabel,
233+
summary: 'etcd cluster database is using >= 75 percent of the defined quota.',
234+
},
235+
},
193236
{
194237
alert: 'etcdDatabaseQuotaLowSpace',
195238
expr: |||
196-
(last_over_time(etcd_mvcc_db_total_size_in_bytes{%(etcd_selector)s}[5m]) / last_over_time(etcd_server_quota_backend_bytes{%(etcd_selector)s}[5m]))*100 > 95
239+
(last_over_time(etcd_mvcc_db_total_size_in_bytes{%(etcd_selector)s}[5m]) / last_over_time(etcd_server_quota_backend_bytes{%(etcd_selector)s}[5m]))*100 > 85
197240
||| % $._config,
198241
'for': '10m',
199242
labels: {

contrib/mixin/test.yaml

Lines changed: 13 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ tests:
2222
exp_alerts:
2323
- exp_labels:
2424
job: etcd
25-
severity: warning
25+
severity: critical
2626
exp_annotations:
2727
description: 'etcd cluster "etcd": members are down (3).'
2828
summary: etcd cluster members are down.
@@ -35,17 +35,17 @@ tests:
3535
job: etcd
3636
severity: critical
3737
exp_annotations:
38-
description: 'etcd cluster "etcd": insufficient members (1).'
39-
summary: etcd cluster has insufficient number of members.
38+
description: "etcd cluster \"etcd\": is reporting fewer instances are available than are needed (1). When etcd does not have a majority of instances available the Kubernetes APIs will reject read and write requests and operations that preserve the health of workloads cannot be performed. This can occur when multiple control plane nodes are powered off or are unable to connect to each other via the network. Check that all control plane nodes are powered on and that network connections between each machine are functional."
39+
summary: "etcd is reporting that a majority of instances are unavailable."
4040
- eval_time: 15m
4141
alertname: etcdInsufficientMembers
4242
exp_alerts:
4343
- exp_labels:
4444
job: etcd
4545
severity: critical
4646
exp_annotations:
47-
description: 'etcd cluster "etcd": insufficient members (0).'
48-
summary: etcd cluster has insufficient number of members.
47+
description: "etcd cluster \"etcd\": is reporting fewer instances are available than are needed (0). When etcd does not have a majority of instances available the Kubernetes APIs will reject read and write requests and operations that preserve the health of workloads cannot be performed. This can occur when multiple control plane nodes are powered off or are unable to connect to each other via the network. Check that all control plane nodes are powered on and that network connections between each machine are functional."
48+
summary: "etcd is reporting that a majority of instances are unavailable."
4949
- interval: 1m
5050
input_series:
5151
- series: up{job="etcd",instance="10.10.10.0"}
@@ -60,7 +60,7 @@ tests:
6060
exp_alerts:
6161
- exp_labels:
6262
job: etcd
63-
severity: warning
63+
severity: critical
6464
exp_annotations:
6565
description: 'etcd cluster "etcd": members are down (3).'
6666
summary: etcd cluster members are down.
@@ -78,40 +78,26 @@ tests:
7878
exp_alerts:
7979
- exp_labels:
8080
job: etcd
81-
severity: warning
81+
severity: critical
8282
exp_annotations:
8383
description: 'etcd cluster "etcd": members are down (1).'
8484
summary: etcd cluster members are down.
8585
- interval: 1m
8686
input_series:
87-
- series: etcd_server_leader_changes_seen_total{job="etcd",instance="10.10.10.0"}
88-
values: 0 0 2 0 0 1 0 0 0 0 0 0 0 0 0 0
89-
- series: etcd_server_leader_changes_seen_total{job="etcd",instance="10.10.10.1"}
90-
values: 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
91-
- series: etcd_server_leader_changes_seen_total{job="etcd",instance="10.10.10.2"}
92-
values: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
87+
- series: etcd_server_is_leader{job="etcd"}
88+
values: 0 0 1 0 0 1 0 1 0 1 0 1 1 0 1 0 1 0 1 0 1 0
9389
alert_rule_test:
94-
- eval_time: 10m
90+
- eval_time: 5m
91+
alertname: etcdHighNumberOfLeaderChanges
92+
- eval_time: 15m
9593
alertname: etcdHighNumberOfLeaderChanges
9694
exp_alerts:
9795
- exp_labels:
9896
job: etcd
9997
severity: warning
10098
exp_annotations:
101-
description: 'etcd cluster "etcd": 4 leader changes within the last 15 minutes. Frequent elections may be a sign of insufficient resources, high network latency, or disruptions by other components and should be investigated.'
99+
description: 'etcd cluster "etcd": 9 leader changes within the last 10 minutes. Frequent elections may be a sign of insufficient resources, high network latency, or disruptions by other components and should be investigated.'
102100
summary: etcd cluster has high number of leader changes.
103-
- interval: 1m
104-
input_series:
105-
- series: etcd_server_leader_changes_seen_total{job="etcd",instance="10.10.10.0"}
106-
values: 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0
107-
- series: etcd_server_leader_changes_seen_total{job="etcd",instance="10.10.10.1"}
108-
values: 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
109-
- series: etcd_server_leader_changes_seen_total{job="etcd",instance="10.10.10.2"}
110-
values: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
111-
alert_rule_test:
112-
- eval_time: 10m
113-
alertname: etcdHighNumberOfLeaderChanges
114-
exp_alerts:
115101
- interval: 1m
116102
input_series:
117103
- series: etcd_mvcc_db_total_size_in_bytes{job="etcd",instance="10.10.10.0"}

0 commit comments

Comments
 (0)