Skip to content

Bug: Dashboard: total requests shows (wrong) spikes when instances scale down #1465

@jensbaitingerbosch

Description

@jensbaitingerbosch

We have an SLO defined liket the following:

apiVersion: pyrra.dev/v1alpha1
kind: ServiceLevelObjective
metadata:
  name: xxx-leadtime
  namespace: xxx
  labels:
    prometheus: k8s
    role: alert-rules
spec:
  target: '99.9'
  window: 30d
  description: ...
  indicator:
    latency:
      success:
        metric: xxx_leadTime_seconds_bucket{job="xxx",le="1.073741824"}
      total:
        metric: xxx_leadTime_seconds_count{job="xxx"}
  alerting:
    name: HighDownstreamLeadTimeInternal

The generated recording rule is the follwing:

- expr: >-
            sum(xxx_leadTime_seconds_count{job="xxx"})
          labels:
            slo: xxx-leadtime
          record: pyrra_requests_total

Which is then visualized the following on the dashboard:

Image
sum(rate(pyrra_requests_total{slo="xxx-leadtime"}[$__rate_interval]))

The issue are these spikes that always happen when an instance is shut down. Here the raw recorded metric:

Image
pyrra_requests_total{slo="xxx-leadtime"}


Suggestion how to solve the issue: Sum only by instance (e.g. by pod) instead of a complete sum, then the query in the dashboard would work correctly.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions