Bug: Dashboard: total requests shows (wrong) spikes when instances scale down

We have an SLO defined liket the following:
```yaml
apiVersion: pyrra.dev/v1alpha1
kind: ServiceLevelObjective
metadata:
  name: xxx-leadtime
  namespace: xxx
  labels:
    prometheus: k8s
    role: alert-rules
spec:
  target: '99.9'
  window: 30d
  description: ...
  indicator:
    latency:
      success:
        metric: xxx_leadTime_seconds_bucket{job="xxx",le="1.073741824"}
      total:
        metric: xxx_leadTime_seconds_count{job="xxx"}
  alerting:
    name: HighDownstreamLeadTimeInternal
```

The generated recording rule is the follwing: 

```
- expr: >-
            sum(xxx_leadTime_seconds_count{job="xxx"})
          labels:
            slo: xxx-leadtime
          record: pyrra_requests_total
```

Which is then visualized the following on the dashboard:

![Image](https://github.com/user-attachments/assets/7b6e9918-5941-47d8-8af3-8bcab7c65bbb)
`sum(rate(pyrra_requests_total{slo="xxx-leadtime"}[$__rate_interval]))`


The issue are these spikes that always happen when an instance is shut down. Here the raw recorded metric:

![Image](https://github.com/user-attachments/assets/b3638f11-7f30-473b-a643-db5bd82a5c0a)
`pyrra_requests_total{slo="xxx-leadtime"}`

---

Suggestion how to solve the issue: Sum only by instance (e.g. by pod) instead of a complete sum, then the query in the dashboard would work correctly. 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bug: Dashboard: total requests shows (wrong) spikes when instances scale down #1465

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Bug: Dashboard: total requests shows (wrong) spikes when instances scale down #1465

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions