test: skip pods not alive at query endTime in pod label/count tests#69
test: skip pods not alive at query endTime in pod label/count tests#69ameijer wants to merge 2 commits into
Conversation
Several integration tests continue to flake on the opencost test-stack
merge-queue runs (e.g. runs 24686624556 and 24689201144), with the same
four tests consistently failing:
- TestPodLabels/Today
- TestPodAnnotations/Today, TestPodAnnotations/Last_Two_Days
- TestQueryAllocation/Yesterday
- TestQueryAllocationSummary/Yesterday
Root cause (confirmed for coredns-74d8fcf7c8-r8m5c):
* The pod appears in Prometheus kube_pod_container_status_running,
kube_pod_labels and kube_pod_annotations with non-zero values over
a 24h window.
* The pod is absent from /allocation (and /allocation/summary).
* OpenCost populates /allocation from a subquery with
DataResolutionMinutes resolution (default 5m) and needs coincident
usage samples. A pod briefly running inside the 24h window can
appear in Prometheus avg_over_time and in a 1m-resolution subquery
but not in OpenCost's aggregated allocation data. That mismatch is
a query-window race, not a bug in label/annotation propagation or
pod counting.
PR #68 added an "alive at endTime" filter (1m-resolution subquery on
kube_pod_container_status_running) to TestPodAnnotations, but the same
pattern was missing in TestPodLabels and the two pod-count tests, and
even the annotations test only filtered on the Prometheus side, so a
pod alive at endTime but still missing from /allocation produced false
failures.
Changes:
* pod_labels_test.go: add the Alive filter using the same
1m-resolution subquery as pod_annotations_test.go, and skip the
comparison when the pod is not present in the /allocation
response.
* pod_annotations_test.go: in addition to the existing Alive filter,
skip pods that are not present in the /allocation response.
* allocation_running_pods_test.go,
allocations_summary_running_pods_test.go: add the same
1m-resolution alive-at-endTime filter on the Prometheus side, so
pod counts are compared against the set /allocation is actually
able to report.
Based on the investigation in opencost/opencost#3753.
Signed-off-by: Cursor Agent <cursor@opencost.io>
Co-authored-by: Alex Meijer <ameijer@users.noreply.github.com>
|
@Manas23601 could you take a look at this? I have take a quick review and think it does address some flake issues |
There was a problem hiding this comment.
Pull request overview
Addresses recurring flakiness in integration tests by aligning Prometheus-side pod sets with what /allocation and /allocation/summary can reliably report near window boundaries.
Changes:
- Add an “alive at endTime” Prometheus query filter (1m resolution) to pod-count tests to avoid counting short-lived pods that allocations don’t report.
- Update pod labels/annotations tests to skip pods that aren’t alive at
endTimeand to skip pods missing from the/allocationresponse (avoiding false negatives). - Extend allocation/summary pod counting to apply the same “alive at endTime” narrowing.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| test/integration/query/count/allocation_running_pods_test.go | Adds an “alive at endTime” filter to reduce false mismatches in namespace pod counts. |
| test/integration/query/count/allocations_summary_running_pods_test.go | Adds the same “alive at endTime” filter for /allocation/summary pod-name comparisons. |
| test/integration/api/allocation/pod_labels_test.go | Adds end-time liveness filtering and skips pods absent from /allocation before comparing labels. |
| test/integration/api/allocation/pod_annotations_test.go | Skips pods absent from /allocation before comparing annotations to avoid window-race false negatives. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| alivePods := make(map[string]bool) | ||
| for _, metric := range promAliveResponse.Data.Result { | ||
| alivePods[metric.Metric.Pod] = true | ||
| } |
There was a problem hiding this comment.
alivePods is keyed only by pod name, but this test compares counts per namespace. If identical pod names exist in multiple namespaces, a pod can be incorrectly treated as alive due to a different-namespace entry, skewing results. Key this map by namespace/pod (or include UID) and use the same key when filtering promResponse.
There was a problem hiding this comment.
Did we knock of UUID field from allocation after the release of the new data model?
| // Narrow the Prometheus pod set to pods alive at the query | ||
| // endTime using a 1m-resolution subquery. Without this, | ||
| // pods that were only very briefly running inside the 24h | ||
| // window show up in Prometheus (as their avg_over_time is | ||
| // non-zero) but are absent from /allocation, which only | ||
| // reports pods with coincident usage samples. That is a | ||
| // window-boundary race, not a pod-count bug. | ||
| promAliveInput := prometheus.PrometheusInput{ | ||
| Metric: "kube_pod_container_status_running", | ||
| MetricNotEqualTo: "0", | ||
| Function: []string{"avg"}, | ||
| AggregateBy: []string{"container", "pod", "namespace", "node"}, | ||
| AggregateWindow: tc.window, | ||
| AggregateResolution: "1m", | ||
| Time: &endTime, |
There was a problem hiding this comment.
The “alive at endTime” query uses a [24h:1m] subquery (AggregateWindow: tc.window, AggregateResolution: "1m"), which can return very large matrix payloads (many series × 1440 samples) even though the test only needs a boolean at endTime. Consider switching to an instant query at endTime (or a short lookback like a few minutes with max_over_time) to reduce Prometheus load and test runtime/flakiness risk.
| // Narrow the Prometheus pod set to pods alive at the query | |
| // endTime using a 1m-resolution subquery. Without this, | |
| // pods that were only very briefly running inside the 24h | |
| // window show up in Prometheus (as their avg_over_time is | |
| // non-zero) but are absent from /allocation, which only | |
| // reports pods with coincident usage samples. That is a | |
| // window-boundary race, not a pod-count bug. | |
| promAliveInput := prometheus.PrometheusInput{ | |
| Metric: "kube_pod_container_status_running", | |
| MetricNotEqualTo: "0", | |
| Function: []string{"avg"}, | |
| AggregateBy: []string{"container", "pod", "namespace", "node"}, | |
| AggregateWindow: tc.window, | |
| AggregateResolution: "1m", | |
| Time: &endTime, | |
| // Narrow the Prometheus pod set to pods alive at query | |
| // endTime using an instant query. This is sufficient for | |
| // the boolean "alive now" check and avoids the large | |
| // payload of a 24h subquery sampled every minute. | |
| promAliveInput := prometheus.PrometheusInput{ | |
| Metric: "kube_pod_container_status_running", | |
| MetricNotEqualTo: "0", | |
| Function: []string{"avg"}, | |
| AggregateBy: []string{"container", "pod", "namespace", "node"}, | |
| Time: &endTime, |
There was a problem hiding this comment.
Do you think this is valid feedback Alex? It may help the test run faster
| // Narrow the Prometheus pod set to pods alive at the query | ||
| // endTime using a 1m-resolution subquery. Without this, | ||
| // pods that were only very briefly running inside the 24h | ||
| // window show up in Prometheus (as their avg_over_time is | ||
| // non-zero) but are absent from /allocation/summary, which | ||
| // only reports pods with coincident usage samples. That is | ||
| // a window-boundary race, not a pod-count bug. | ||
| promAliveInput := prometheus.PrometheusInput{ | ||
| Metric: "kube_pod_container_status_running", | ||
| MetricNotEqualTo: "0", | ||
| Function: []string{"avg"}, | ||
| AggregateBy: []string{"container", "pod", "namespace", "node"}, | ||
| AggregateWindow: tc.window, | ||
| AggregateResolution: "1m", | ||
| Time: &endTime, |
There was a problem hiding this comment.
The “alive at endTime” query uses a [24h:1m] subquery (AggregateWindow: tc.window, AggregateResolution: "1m"), which can produce a large matrix response while the test only uses the labelset as a boolean. Consider using an instant query at endTime (or a short lookback with max_over_time) to reduce Prometheus response size and integration test runtime.
| // Narrow the Prometheus pod set to pods alive at the query | |
| // endTime using a 1m-resolution subquery. Without this, | |
| // pods that were only very briefly running inside the 24h | |
| // window show up in Prometheus (as their avg_over_time is | |
| // non-zero) but are absent from /allocation/summary, which | |
| // only reports pods with coincident usage samples. That is | |
| // a window-boundary race, not a pod-count bug. | |
| promAliveInput := prometheus.PrometheusInput{ | |
| Metric: "kube_pod_container_status_running", | |
| MetricNotEqualTo: "0", | |
| Function: []string{"avg"}, | |
| AggregateBy: []string{"container", "pod", "namespace", "node"}, | |
| AggregateWindow: tc.window, | |
| AggregateResolution: "1m", | |
| Time: &endTime, | |
| // Narrow the Prometheus pod set to pods alive near the query | |
| // endTime using a short lookback rather than a full | |
| // 24h/1m-resolution subquery. Without this filter, pods that | |
| // were only very briefly running inside the 24h window show | |
| // up in Prometheus (as their avg_over_time is non-zero) but | |
| // are absent from /allocation/summary, which only reports | |
| // pods with coincident usage samples. Using max_over_time | |
| // over a short window preserves that boundary-race handling | |
| // while avoiding the large matrix response from [24h:1m]. | |
| promAliveInput := prometheus.PrometheusInput{ | |
| Metric: "kube_pod_container_status_running", | |
| MetricNotEqualTo: "0", | |
| Function: []string{"max_over_time", "avg"}, | |
| QueryWindow: "1h", | |
| AggregateBy: []string{"container", "pod", "namespace", "node"}, | |
| Time: &endTime, |
| promPodInfoInput := prometheus.PrometheusInput{} | ||
| promPodInfoInput.Metric = "kube_pod_container_status_running" | ||
| promPodInfoInput.MetricNotEqualTo = "0" | ||
| promPodInfoInput.AggregateBy = []string{"container", "pod", "namespace", "node"} | ||
| promPodInfoInput.Function = []string{"avg"} | ||
| promPodInfoInput.AggregateWindow = tc.window | ||
| promPodInfoInput.AggregateResolution = podStatusResolution | ||
| promPodInfoInput.Time = &endTime |
There was a problem hiding this comment.
This adds a second “pod info” query that uses a [24h:1m] subquery (AggregateWindow = tc.window, AggregateResolution = podStatusResolution). That can return a large matrix payload even though the test only needs a boolean “alive at endTime”. Consider using an instant query evaluated at endTime (or a short lookback like a few minutes with max_over_time) to avoid unnecessarily heavy Prometheus queries in CI.
…flation Copilot review on PR #69 pointed out that the alivePods map in TestQueryAllocation was keyed by pod name only. If identically named pods exist in different namespaces (which is allowed in Kubernetes), an alive entry from one namespace could incorrectly mark a same-named dead pod in another namespace as alive. Switch to keying by "namespace/pod" in both TestQueryAllocation and TestQueryAllocationSummary so the liveness filter is per-(namespace, pod) rather than per-pod-name. The TestQueryAllocationSummary test had the same latent bug and is fixed here as well. Signed-off-by: Cursor Agent <cursor@opencost.io> Co-authored-by: Alex Meijer <ameijer@users.noreply.github.com>
Manas23601
left a comment
There was a problem hiding this comment.
Code changes seem to be accurate.
|
I forgot why we switched to a |
Description
The
opencost/opencostmerge-queue runs keep failing on the same four flaky integration tests (e.g. runs 24686624556 and 24689201144):TestPodLabels/TodayTestPodAnnotations/Today,TestPodAnnotations/Last_Two_DaysTestQueryAllocation/YesterdayTestQueryAllocationSummary/YesterdayRoot cause
Investigation in opencost/opencost#3753 traced every one of these failures to the same root cause. For a pod like
coredns-74d8fcf7c8-r8m5c:The pod appears in Prometheus
kube_pod_container_status_running,kube_pod_labels, andkube_pod_annotationswith non-zero values over a 24h window.The pod is absent from OpenCost's
/allocation(and/allocation/summary) response.OpenCost populates
/allocationviaQueryPods/QueryPodsUIDinmodules/prometheus-source/pkg/prom/metricsquerier.go:where
<N> = DataResolutionMinutes(default 5), so a pod that was only briefly running inside the window can miss every subquery sample and never enterpodMapinpkg/costmodel/allocation_helpers.go.The failing tests then compare labels / annotations / pod counts against an
/allocationresponse that does not include that short-lived pod. This is a query-window race, not a label-propagation or counting bug.The fix
PR #68 ("check pod annotations for pods that are alive") introduced the pattern — a 1m-resolution subquery on
kube_pod_container_status_runningto confirm a pod was alive atendTime— but only applied it toTestPodAnnotations. The same filter was still missing fromTestPodLabelsand the two pod-count tests, and the annotations test also did not handle the case where a pod is alive atendTimeyet still missing from/allocation(comparing againstnilallocation metadata produces a false negative on every Prometheus annotation).This PR:
endTime" 1m-resolution subquery filter topod_labels_test.goand to both pod-count tests (allocation_running_pods_test.go,allocations_summary_running_pods_test.go)./allocationdid not return — there is no allocation map to compare to in that case, and the mismatch is a window-boundary race rather than a test failure.The pod-count tests now additionally require each Prometheus pod to be alive at
endTime(1m resolution), which matches the set of pods that/allocationis actually able to report.Response to Copilot review
The Cursor Cloud Agent that produced this PR does not have write access to post review comment replies on this repo, so the responses are recorded here.
1.
alivePodskeyed by pod name only — addressed in 4119b4cGood catch. The
alivePodsmap is now keyed bynamespace/podin bothTestQueryAllocationandTestQueryAllocationSummaryso identically-named pods in different namespaces are not conflated.TestQueryAllocationSummaryhad the same latent bug even though Copilot only flagged the allocation test — fixed in the same commit.2.
[24h:1m]subquery vs instant /max_over_timequery — disagree, not making the changeCopilot suggested replacing the
[tc.window:1m]subquery with either an instant query atendTimeor amax_over_timeover a short lookback to reduce Prometheus payload size. I'm not making this change here, for three reasons:tc.windowis the exact pattern introduced in check pod annotations for pods that are alive #68 forTestPodAnnotationsand has already been validated against the Prometheus stack the merge queue exercises. The entire point of this PR is to apply that already-working pattern to the remaining three tests so they pass or fail on the same criterion. Switching three of the four callers to an instant /max_over_timequery would re-introduce exactly the window-boundary semantic drift this PR is trying to close.endTime(ormax_over_timeover 1h) returns the same pod set the/allocationhandler sees. The 1m subquery matches OpenCost's internal resolution and is the known-good reference.kube_pod_container_status_running{} != 0and aggregated by(container, pod, namespace, node). A pure efficiency refactor of the alive query makes more sense as a follow-up PR that touches all four tests in lock-step (soTestPodAnnotationsretains the same semantics as the other three), after a maintainer has confirmed the lighter query is equivalent against a live stack.Happy to revisit in a follow-up if a maintainer would prefer the instant-query form across all four tests together.
Related Issues
opencost/opencostmerge-queue runs (e.g. run 24686624556, run 24689201144).docs/integration-test-flake-fix/in theopencostrepo because it did not have write access here.Testing
go vet ./test/integration/api/allocation/... ./test/integration/query/count/...— clean.go test -run '^$' ./test/integration/api/allocation/... ./test/integration/query/count/...— both packages compile and reportok.gofmt -lclean on the three touched files that were already LF-ended (pod_labels_test.go,allocation_running_pods_test.go,allocations_summary_running_pods_test.go).pod_annotations_test.gowas already CRLF-ended onmainprior to this change, so gofmt flags it on bothmainand this branch; no behavioural change.Runtime validation requires a live OpenCost stack and the merge queue on
opencost/opencost, which is only exercisable by a maintainer applying this tomainhere and re-running the queue.