test: skip pods not alive at query endTime in pod label/count tests by ameijer · Pull Request #69 · opencost/opencost-integration-tests

ameijer · 2026-04-23T16:23:13Z

Description

The opencost/opencost merge-queue runs keep failing on the same four flaky integration tests (e.g. runs 24686624556 and 24689201144):

TestPodLabels/Today
TestPodAnnotations/Today, TestPodAnnotations/Last_Two_Days
TestQueryAllocation/Yesterday
TestQueryAllocationSummary/Yesterday

Root cause

Investigation in opencost/opencost#3753 traced every one of these failures to the same root cause. For a pod like coredns-74d8fcf7c8-r8m5c:

The pod appears in Prometheus kube_pod_container_status_running, kube_pod_labels, and kube_pod_annotations with non-zero values over a 24h window.
The pod is absent from OpenCost's /allocation (and /allocation/summary) response.
OpenCost populates /allocation via QueryPods / QueryPodsUID in modules/prometheus-source/pkg/prom/metricsquerier.go:
```
avg(kube_pod_container_status_running{} != 0) by (pod, ns, uid, ...)[24h:<N>m]
```
where <N> = DataResolutionMinutes (default 5), so a pod that was only briefly running inside the window can miss every subquery sample and never enter podMap in pkg/costmodel/allocation_helpers.go.

The failing tests then compare labels / annotations / pod counts against an /allocation response that does not include that short-lived pod. This is a query-window race, not a label-propagation or counting bug.

The fix

PR #68 ("check pod annotations for pods that are alive") introduced the pattern — a 1m-resolution subquery on kube_pod_container_status_running to confirm a pod was alive at endTime — but only applied it to TestPodAnnotations. The same filter was still missing from TestPodLabels and the two pod-count tests, and the annotations test also did not handle the case where a pod is alive at endTime yet still missing from /allocation (comparing against nil allocation metadata produces a false negative on every Prometheus annotation).

This PR:

Applies the PR check pod annotations for pods that are alive #68 "alive at endTime" 1m-resolution subquery filter to pod_labels_test.go and to both pod-count tests (allocation_running_pods_test.go, allocations_summary_running_pods_test.go).
In both the label and annotation tests, also skips pods that /allocation did not return — there is no allocation map to compare to in that case, and the mismatch is a window-boundary race rather than a test failure.

The pod-count tests now additionally require each Prometheus pod to be alive at endTime (1m resolution), which matches the set of pods that /allocation is actually able to report.

Response to Copilot review

The Cursor Cloud Agent that produced this PR does not have write access to post review comment replies on this repo, so the responses are recorded here.

1. `alivePods` keyed by pod name only — addressed in `4119b4c`

discussion r3138082217 on allocation_running_pods_test.go

Good catch. The alivePods map is now keyed by namespace/pod in both TestQueryAllocation and TestQueryAllocationSummary so identically-named pods in different namespaces are not conflated. TestQueryAllocationSummary had the same latent bug even though Copilot only flagged the allocation test — fixed in the same commit.

2. `[24h:1m]` subquery vs instant / `max_over_time` query — disagree, not making the change

discussions r3138082249 (allocation_running_pods_test.go), r3138082272 (allocations_summary_running_pods_test.go), and r3138082299 (pod_labels_test.go)

Copilot suggested replacing the [tc.window:1m] subquery with either an instant query at endTime or a max_over_time over a short lookback to reduce Prometheus payload size. I'm not making this change here, for three reasons:

Consistency with check pod annotations for pods that are alive #68. The 1m-resolution subquery over tc.window is the exact pattern introduced in check pod annotations for pods that are alive #68 for TestPodAnnotations and has already been validated against the Prometheus stack the merge queue exercises. The entire point of this PR is to apply that already-working pattern to the remaining three tests so they pass or fail on the same criterion. Switching three of the four callers to an instant / max_over_time query would re-introduce exactly the window-boundary semantic drift this PR is trying to close.
Cannot be validated from here. The Cursor Cloud Agent has no access to a live OpenCost / Prometheus stack, so it cannot verify that an instant query at endTime (or max_over_time over 1h) returns the same pod set the /allocation handler sees. The 1m subquery matches OpenCost's internal resolution and is the known-good reference.
The efficiency concern is small in this context. These tests run once per test case against a demo/CI Prometheus; the query is already narrowed by kube_pod_container_status_running{} != 0 and aggregated by (container, pod, namespace, node). A pure efficiency refactor of the alive query makes more sense as a follow-up PR that touches all four tests in lock-step (so TestPodAnnotations retains the same semantics as the other three), after a maintainer has confirmed the lighter query is equivalent against a live stack.

Happy to revisit in a follow-up if a maintainer would prefer the instant-query form across all four tests together.

Related Issues

Resolves the recurring flaky failures on opencost/opencost merge-queue runs (e.g. run 24686624556, run 24689201144).
Follows the pattern established by check pod annotations for pods that are alive #68.
Based on the investigation from opencost/opencost#3753, which landed the proposed patch under docs/integration-test-flake-fix/ in the opencost repo because it did not have write access here.

Testing

go vet ./test/integration/api/allocation/... ./test/integration/query/count/... — clean.
go test -run '^$' ./test/integration/api/allocation/... ./test/integration/query/count/... — both packages compile and report ok.
gofmt -l clean on the three touched files that were already LF-ended (pod_labels_test.go, allocation_running_pods_test.go, allocations_summary_running_pods_test.go). pod_annotations_test.go was already CRLF-ended on main prior to this change, so gofmt flags it on both main and this branch; no behavioural change.

Runtime validation requires a live OpenCost stack and the merge queue on opencost/opencost, which is only exercisable by a maintainer applying this to main here and re-running the queue.

Several integration tests continue to flake on the opencost test-stack merge-queue runs (e.g. runs 24686624556 and 24689201144), with the same four tests consistently failing: - TestPodLabels/Today - TestPodAnnotations/Today, TestPodAnnotations/Last_Two_Days - TestQueryAllocation/Yesterday - TestQueryAllocationSummary/Yesterday Root cause (confirmed for coredns-74d8fcf7c8-r8m5c): * The pod appears in Prometheus kube_pod_container_status_running, kube_pod_labels and kube_pod_annotations with non-zero values over a 24h window. * The pod is absent from /allocation (and /allocation/summary). * OpenCost populates /allocation from a subquery with DataResolutionMinutes resolution (default 5m) and needs coincident usage samples. A pod briefly running inside the 24h window can appear in Prometheus avg_over_time and in a 1m-resolution subquery but not in OpenCost's aggregated allocation data. That mismatch is a query-window race, not a bug in label/annotation propagation or pod counting. PR #68 added an "alive at endTime" filter (1m-resolution subquery on kube_pod_container_status_running) to TestPodAnnotations, but the same pattern was missing in TestPodLabels and the two pod-count tests, and even the annotations test only filtered on the Prometheus side, so a pod alive at endTime but still missing from /allocation produced false failures. Changes: * pod_labels_test.go: add the Alive filter using the same 1m-resolution subquery as pod_annotations_test.go, and skip the comparison when the pod is not present in the /allocation response. * pod_annotations_test.go: in addition to the existing Alive filter, skip pods that are not present in the /allocation response. * allocation_running_pods_test.go, allocations_summary_running_pods_test.go: add the same 1m-resolution alive-at-endTime filter on the Prometheus side, so pod counts are compared against the set /allocation is actually able to report. Based on the investigation in opencost/opencost#3753. Signed-off-by: Cursor Agent <cursor@opencost.io> Co-authored-by: Alex Meijer <ameijer@users.noreply.github.com>

ameijer · 2026-04-24T13:38:12Z

@Manas23601 could you take a look at this? I have take a quick review and think it does address some flake issues

Copilot

Pull request overview

Addresses recurring flakiness in integration tests by aligning Prometheus-side pod sets with what /allocation and /allocation/summary can reliably report near window boundaries.

Changes:

Add an “alive at endTime” Prometheus query filter (1m resolution) to pod-count tests to avoid counting short-lived pods that allocations don’t report.
Update pod labels/annotations tests to skip pods that aren’t alive at endTime and to skip pods missing from the /allocation response (avoiding false negatives).
Extend allocation/summary pod counting to apply the same “alive at endTime” narrowing.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File	Description
test/integration/query/count/allocation_running_pods_test.go	Adds an “alive at endTime” filter to reduce false mismatches in namespace pod counts.
test/integration/query/count/allocations_summary_running_pods_test.go	Adds the same “alive at endTime” filter for `/allocation/summary` pod-name comparisons.
test/integration/api/allocation/pod_labels_test.go	Adds end-time liveness filtering and skips pods absent from `/allocation` before comparing labels.
test/integration/api/allocation/pod_annotations_test.go	Skips pods absent from `/allocation` before comparing annotations to avoid window-race false negatives.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-24T13:43:15Z

+			alivePods := make(map[string]bool)
+			for _, metric := range promAliveResponse.Data.Result {
+				alivePods[metric.Metric.Pod] = true
+			}


alivePods is keyed only by pod name, but this test compares counts per namespace. If identical pod names exist in multiple namespaces, a pod can be incorrectly treated as alive due to a different-namespace entry, skewing results. Key this map by namespace/pod (or include UID) and use the same key when filtering promResponse.

Did we knock of UUID field from allocation after the release of the new data model?

Copilot · 2026-04-24T13:43:15Z

+			// Narrow the Prometheus pod set to pods alive at the query
+			// endTime using a 1m-resolution subquery. Without this,
+			// pods that were only very briefly running inside the 24h
+			// window show up in Prometheus (as their avg_over_time is
+			// non-zero) but are absent from /allocation, which only
+			// reports pods with coincident usage samples. That is a
+			// window-boundary race, not a pod-count bug.
+			promAliveInput := prometheus.PrometheusInput{
+				Metric:              "kube_pod_container_status_running",
+				MetricNotEqualTo:    "0",
+				Function:            []string{"avg"},
+				AggregateBy:         []string{"container", "pod", "namespace", "node"},
+				AggregateWindow:     tc.window,
+				AggregateResolution: "1m",
+				Time:                &endTime,


The “alive at endTime” query uses a [24h:1m] subquery (AggregateWindow: tc.window, AggregateResolution: "1m"), which can return very large matrix payloads (many series × 1440 samples) even though the test only needs a boolean at endTime. Consider switching to an instant query at endTime (or a short lookback like a few minutes with max_over_time) to reduce Prometheus load and test runtime/flakiness risk.

Suggested change

// Narrow the Prometheus pod set to pods alive at the query

// endTime using a 1m-resolution subquery. Without this,

// pods that were only very briefly running inside the 24h

// window show up in Prometheus (as their avg_over_time is

// non-zero) but are absent from /allocation, which only

// reports pods with coincident usage samples. That is a

// window-boundary race, not a pod-count bug.

promAliveInput := prometheus.PrometheusInput{

Metric: "kube_pod_container_status_running",

MetricNotEqualTo: "0",

Function: []string{"avg"},

AggregateBy: []string{"container", "pod", "namespace", "node"},

AggregateWindow: tc.window,

AggregateResolution: "1m",

Time: &endTime,

// Narrow the Prometheus pod set to pods alive at query

// endTime using an instant query. This is sufficient for

// the boolean "alive now" check and avoids the large

// payload of a 24h subquery sampled every minute.

promAliveInput := prometheus.PrometheusInput{

Metric: "kube_pod_container_status_running",

MetricNotEqualTo: "0",

Function: []string{"avg"},

AggregateBy: []string{"container", "pod", "namespace", "node"},

Time: &endTime,

Do you think this is valid feedback Alex? It may help the test run faster

Copilot · 2026-04-24T13:43:16Z

+			// Narrow the Prometheus pod set to pods alive at the query
+			// endTime using a 1m-resolution subquery. Without this,
+			// pods that were only very briefly running inside the 24h
+			// window show up in Prometheus (as their avg_over_time is
+			// non-zero) but are absent from /allocation/summary, which
+			// only reports pods with coincident usage samples. That is
+			// a window-boundary race, not a pod-count bug.
+			promAliveInput := prometheus.PrometheusInput{
+				Metric:              "kube_pod_container_status_running",
+				MetricNotEqualTo:    "0",
+				Function:            []string{"avg"},
+				AggregateBy:         []string{"container", "pod", "namespace", "node"},
+				AggregateWindow:     tc.window,
+				AggregateResolution: "1m",
+				Time:                &endTime,


The “alive at endTime” query uses a [24h:1m] subquery (AggregateWindow: tc.window, AggregateResolution: "1m"), which can produce a large matrix response while the test only uses the labelset as a boolean. Consider using an instant query at endTime (or a short lookback with max_over_time) to reduce Prometheus response size and integration test runtime.

Suggested change

// Narrow the Prometheus pod set to pods alive at the query

// endTime using a 1m-resolution subquery. Without this,

// pods that were only very briefly running inside the 24h

// window show up in Prometheus (as their avg_over_time is

// non-zero) but are absent from /allocation/summary, which

// only reports pods with coincident usage samples. That is

// a window-boundary race, not a pod-count bug.

promAliveInput := prometheus.PrometheusInput{

Metric: "kube_pod_container_status_running",

MetricNotEqualTo: "0",

Function: []string{"avg"},

AggregateBy: []string{"container", "pod", "namespace", "node"},

AggregateWindow: tc.window,

AggregateResolution: "1m",

Time: &endTime,

// Narrow the Prometheus pod set to pods alive near the query

// endTime using a short lookback rather than a full

// 24h/1m-resolution subquery. Without this filter, pods that

// were only very briefly running inside the 24h window show

// up in Prometheus (as their avg_over_time is non-zero) but

// are absent from /allocation/summary, which only reports

// pods with coincident usage samples. Using max_over_time

// over a short window preserves that boundary-race handling

// while avoiding the large matrix response from [24h:1m].

promAliveInput := prometheus.PrometheusInput{

Metric: "kube_pod_container_status_running",

MetricNotEqualTo: "0",

Function: []string{"max_over_time", "avg"},

QueryWindow: "1h",

AggregateBy: []string{"container", "pod", "namespace", "node"},

Time: &endTime,

Copilot · 2026-04-24T13:43:16Z

+			promPodInfoInput := prometheus.PrometheusInput{}
+			promPodInfoInput.Metric = "kube_pod_container_status_running"
+			promPodInfoInput.MetricNotEqualTo = "0"
+			promPodInfoInput.AggregateBy = []string{"container", "pod", "namespace", "node"}
+			promPodInfoInput.Function = []string{"avg"}
+			promPodInfoInput.AggregateWindow = tc.window
+			promPodInfoInput.AggregateResolution = podStatusResolution
+			promPodInfoInput.Time = &endTime


This adds a second “pod info” query that uses a [24h:1m] subquery (AggregateWindow = tc.window, AggregateResolution = podStatusResolution). That can return a large matrix payload even though the test only needs a boolean “alive at endTime”. Consider using an instant query evaluated at endTime (or a short lookback like a few minutes with max_over_time) to avoid unnecessarily heavy Prometheus queries in CI.

…flation Copilot review on PR #69 pointed out that the alivePods map in TestQueryAllocation was keyed by pod name only. If identically named pods exist in different namespaces (which is allowed in Kubernetes), an alive entry from one namespace could incorrectly mark a same-named dead pod in another namespace as alive. Switch to keying by "namespace/pod" in both TestQueryAllocation and TestQueryAllocationSummary so the liveness filter is per-(namespace, pod) rather than per-pod-name. The TestQueryAllocationSummary test had the same latent bug and is fixed here as well. Signed-off-by: Cursor Agent <cursor@opencost.io> Co-authored-by: Alex Meijer <ameijer@users.noreply.github.com>

Manas23601

Code changes seem to be accurate.

Manas23601 · 2026-04-27T15:59:12Z

I forgot why we switched to a 1m resolution, shouldn't we try to match opencost /allocation API? I also don't get the idea behind the pod being alive at the endtime? Is the idea here to check for [startTime:endTime] instead of [startTime:endTime) (promethues default behavior)? I think we handled that by doing [startTime:endTime+interval)

ameijer marked this pull request as ready for review April 24, 2026 13:38

ameijer requested a review from a team as a code owner April 24, 2026 13:38

Copilot AI review requested due to automatic review settings April 24, 2026 13:38

Copilot started reviewing on behalf of ameijer April 24, 2026 13:38 View session

Copilot AI reviewed Apr 24, 2026

View reviewed changes

Manas23601 reviewed Apr 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: skip pods not alive at query endTime in pod label/count tests#69

test: skip pods not alive at query endTime in pod label/count tests#69
ameijer wants to merge 2 commits into
mainfrom
cursor/fix-pod-alive-filter-flaky-tests-076b

ameijer commented Apr 23, 2026 •

edited by cursor Bot

Loading

Uh oh!

ameijer commented Apr 24, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 24, 2026

Uh oh!

Manas23601 Apr 27, 2026

Uh oh!

Copilot AI Apr 24, 2026

Uh oh!

Manas23601 Apr 27, 2026

Uh oh!

Copilot AI Apr 24, 2026

Uh oh!

Copilot AI Apr 24, 2026

Uh oh!

Manas23601 left a comment

Uh oh!

Manas23601 commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ameijer commented Apr 23, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Root cause

The fix

Response to Copilot review

1. alivePods keyed by pod name only — addressed in 4119b4c

2. [24h:1m] subquery vs instant / max_over_time query — disagree, not making the change

Related Issues

Testing

Uh oh!

ameijer commented Apr 24, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Manas23601 Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Manas23601 Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Manas23601 left a comment

Choose a reason for hiding this comment

Uh oh!

Manas23601 commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ameijer commented Apr 23, 2026 •

edited by cursor Bot

Loading

1. `alivePods` keyed by pod name only — addressed in `4119b4c`

2. `[24h:1m]` subquery vs instant / `max_over_time` query — disagree, not making the change