BoolGauge and pyrra_availability metric #1224

dharapvj · 2024-07-23T09:44:33Z

dharapvj
Jul 23, 2024

I am trying to provide and SLO for platform services like istio, nginx-ingress-controller, etc. None of the existing SLO types like ratio/latency etc seem to be helping since I want to evaluate uptime of the nginx as a service, istiod as a service etc.

So I attempted to use BoolGauge which is promised to work for blackbox exporter type of situation.

Here is my SLO config

apiVersion: pyrra.dev/v1alpha1
kind: ServiceLevelObjective
metadata:
  name: sample-svc-uptime-slo
  labels:
    role: alert-rules
    pyrra.dev/team: team-blue
spec:
  target: '99.95'
  window: "1w"
  alerting:
    disabled: true
  indicator:
    bool_gauge:
      metric: up{namespace="app4", job="kubernetes-pods"}

What I observe is that if the pods and the svc connected to those pods are up - I get pyrra_availability = 100% and also error budget = 100% But once I shutdown the pods to test error budget depletion.. availability metric as well as budget crashes to zero. I would have expected budget to burn down slowly.

If I change the timeslot to the area where pods were up.. The pyrra_availability is reported as 100%.

Any idea what configuration is being wrongly done here? OR is this a bug in Pyrra recording rule expressions?

My analysis showed me that: both below expressions have
sum(up:sum1w{job="kubernetes-pods",namespace="app4",slo="sample-svc-uptime-slo"}) and sum(up:count1w{job="kubernetes-pods",namespace="app4",slo="sample-svc-uptime-slo"})

which in-turn use..
sum by (__name__, job, namespace) (sum_over_time(up{job="kubernetes-pods",namespace="app4"}[1w]))
and
sum by (__name__, job, namespace) (count_over_time(up{job="kubernetes-pods",namespace="app4"}[1w]))

Both these expressions sum_over_time and count_over_time have identical graphs. which is why the availability plummets to zero, I think.

Is my usage of up as metric wrong for such kind of SLO evaluation?

dharapvj · 2024-07-23T10:44:01Z

dharapvj
Jul 23, 2024
Author

adding a or vector(0) to the sum and count query seems to make things better..

sum_over_time(sum(up{job="kubernetes-pods",namespace="app4"} or vector(0))[1w:]) / count_over_time(sum(up{job="kubernetes-pods",namespace="app4"} or vector(0))[1w:])

But this expression uses subquery (may be there is a better way to add vector(0) part)

We cannot create expression like this via Pyrra right now as I need [1w:] instead of [1w] in the expression.

1 reply

metalmatze Aug 23, 2024
Maintainer

I'm not 100% sure about how well-tested the bool_gauge and generic rule integration are.
If you find these things worth sending as a Pull Request, feel free to do so. It would be helpful!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

BoolGauge and pyrra_availability metric #1224

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

BoolGauge and pyrra_availability metric #1224

Uh oh!

dharapvj Jul 23, 2024

Replies: 1 comment · 1 reply

Uh oh!

dharapvj Jul 23, 2024 Author

Uh oh!

metalmatze Aug 23, 2024 Maintainer

dharapvj
Jul 23, 2024

Replies: 1 comment 1 reply

dharapvj
Jul 23, 2024
Author

metalmatze Aug 23, 2024
Maintainer