-
Notifications
You must be signed in to change notification settings - Fork 137
Open
Description
We have an SLO defined liket the following:
apiVersion: pyrra.dev/v1alpha1
kind: ServiceLevelObjective
metadata:
name: xxx-leadtime
namespace: xxx
labels:
prometheus: k8s
role: alert-rules
spec:
target: '99.9'
window: 30d
description: ...
indicator:
latency:
success:
metric: xxx_leadTime_seconds_bucket{job="xxx",le="1.073741824"}
total:
metric: xxx_leadTime_seconds_count{job="xxx"}
alerting:
name: HighDownstreamLeadTimeInternalThe generated recording rule is the follwing:
- expr: >-
sum(xxx_leadTime_seconds_count{job="xxx"})
labels:
slo: xxx-leadtime
record: pyrra_requests_total
Which is then visualized the following on the dashboard:

sum(rate(pyrra_requests_total{slo="xxx-leadtime"}[$__rate_interval]))
The issue are these spikes that always happen when an instance is shut down. Here the raw recorded metric:

pyrra_requests_total{slo="xxx-leadtime"}
Suggestion how to solve the issue: Sum only by instance (e.g. by pod) instead of a complete sum, then the query in the dashboard would work correctly.
Metadata
Metadata
Assignees
Labels
No labels