Skip to content

test: skip pods not alive at query endTime in pod label/count tests#69

Open
ameijer wants to merge 2 commits into
mainfrom
cursor/fix-pod-alive-filter-flaky-tests-076b
Open

test: skip pods not alive at query endTime in pod label/count tests#69
ameijer wants to merge 2 commits into
mainfrom
cursor/fix-pod-alive-filter-flaky-tests-076b

Conversation

@ameijer
Copy link
Copy Markdown
Member

@ameijer ameijer commented Apr 23, 2026

Description

The opencost/opencost merge-queue runs keep failing on the same four flaky integration tests (e.g. runs 24686624556 and 24689201144):

  • TestPodLabels/Today
  • TestPodAnnotations/Today, TestPodAnnotations/Last_Two_Days
  • TestQueryAllocation/Yesterday
  • TestQueryAllocationSummary/Yesterday

Root cause

Investigation in opencost/opencost#3753 traced every one of these failures to the same root cause. For a pod like coredns-74d8fcf7c8-r8m5c:

  • The pod appears in Prometheus kube_pod_container_status_running, kube_pod_labels, and kube_pod_annotations with non-zero values over a 24h window.

  • The pod is absent from OpenCost's /allocation (and /allocation/summary) response.

  • OpenCost populates /allocation via QueryPods / QueryPodsUID in modules/prometheus-source/pkg/prom/metricsquerier.go:

    avg(kube_pod_container_status_running{} != 0) by (pod, ns, uid, ...)[24h:<N>m]
    

    where <N> = DataResolutionMinutes (default 5), so a pod that was only briefly running inside the window can miss every subquery sample and never enter podMap in pkg/costmodel/allocation_helpers.go.

The failing tests then compare labels / annotations / pod counts against an /allocation response that does not include that short-lived pod. This is a query-window race, not a label-propagation or counting bug.

The fix

PR #68 ("check pod annotations for pods that are alive") introduced the pattern — a 1m-resolution subquery on kube_pod_container_status_running to confirm a pod was alive at endTime — but only applied it to TestPodAnnotations. The same filter was still missing from TestPodLabels and the two pod-count tests, and the annotations test also did not handle the case where a pod is alive at endTime yet still missing from /allocation (comparing against nil allocation metadata produces a false negative on every Prometheus annotation).

This PR:

  1. Applies the PR check pod annotations for pods that are alive #68 "alive at endTime" 1m-resolution subquery filter to pod_labels_test.go and to both pod-count tests (allocation_running_pods_test.go, allocations_summary_running_pods_test.go).
  2. In both the label and annotation tests, also skips pods that /allocation did not return — there is no allocation map to compare to in that case, and the mismatch is a window-boundary race rather than a test failure.

The pod-count tests now additionally require each Prometheus pod to be alive at endTime (1m resolution), which matches the set of pods that /allocation is actually able to report.

Response to Copilot review

The Cursor Cloud Agent that produced this PR does not have write access to post review comment replies on this repo, so the responses are recorded here.

1. alivePods keyed by pod name only — addressed in 4119b4c

discussion r3138082217 on allocation_running_pods_test.go

Good catch. The alivePods map is now keyed by namespace/pod in both TestQueryAllocation and TestQueryAllocationSummary so identically-named pods in different namespaces are not conflated. TestQueryAllocationSummary had the same latent bug even though Copilot only flagged the allocation test — fixed in the same commit.

2. [24h:1m] subquery vs instant / max_over_time query — disagree, not making the change

discussions r3138082249 (allocation_running_pods_test.go), r3138082272 (allocations_summary_running_pods_test.go), and r3138082299 (pod_labels_test.go)

Copilot suggested replacing the [tc.window:1m] subquery with either an instant query at endTime or a max_over_time over a short lookback to reduce Prometheus payload size. I'm not making this change here, for three reasons:

  • Consistency with check pod annotations for pods that are alive #68. The 1m-resolution subquery over tc.window is the exact pattern introduced in check pod annotations for pods that are alive #68 for TestPodAnnotations and has already been validated against the Prometheus stack the merge queue exercises. The entire point of this PR is to apply that already-working pattern to the remaining three tests so they pass or fail on the same criterion. Switching three of the four callers to an instant / max_over_time query would re-introduce exactly the window-boundary semantic drift this PR is trying to close.
  • Cannot be validated from here. The Cursor Cloud Agent has no access to a live OpenCost / Prometheus stack, so it cannot verify that an instant query at endTime (or max_over_time over 1h) returns the same pod set the /allocation handler sees. The 1m subquery matches OpenCost's internal resolution and is the known-good reference.
  • The efficiency concern is small in this context. These tests run once per test case against a demo/CI Prometheus; the query is already narrowed by kube_pod_container_status_running{} != 0 and aggregated by (container, pod, namespace, node). A pure efficiency refactor of the alive query makes more sense as a follow-up PR that touches all four tests in lock-step (so TestPodAnnotations retains the same semantics as the other three), after a maintainer has confirmed the lighter query is equivalent against a live stack.

Happy to revisit in a follow-up if a maintainer would prefer the instant-query form across all four tests together.

Related Issues

Testing

  • go vet ./test/integration/api/allocation/... ./test/integration/query/count/... — clean.
  • go test -run '^$' ./test/integration/api/allocation/... ./test/integration/query/count/... — both packages compile and report ok.
  • gofmt -l clean on the three touched files that were already LF-ended (pod_labels_test.go, allocation_running_pods_test.go, allocations_summary_running_pods_test.go). pod_annotations_test.go was already CRLF-ended on main prior to this change, so gofmt flags it on both main and this branch; no behavioural change.

Runtime validation requires a live OpenCost stack and the merge queue on opencost/opencost, which is only exercisable by a maintainer applying this to main here and re-running the queue.

Open in Web Open in Cursor 

Several integration tests continue to flake on the opencost test-stack
merge-queue runs (e.g. runs 24686624556 and 24689201144), with the same
four tests consistently failing:

  - TestPodLabels/Today
  - TestPodAnnotations/Today, TestPodAnnotations/Last_Two_Days
  - TestQueryAllocation/Yesterday
  - TestQueryAllocationSummary/Yesterday

Root cause (confirmed for coredns-74d8fcf7c8-r8m5c):

  * The pod appears in Prometheus kube_pod_container_status_running,
    kube_pod_labels and kube_pod_annotations with non-zero values over
    a 24h window.
  * The pod is absent from /allocation (and /allocation/summary).
  * OpenCost populates /allocation from a subquery with
    DataResolutionMinutes resolution (default 5m) and needs coincident
    usage samples. A pod briefly running inside the 24h window can
    appear in Prometheus avg_over_time and in a 1m-resolution subquery
    but not in OpenCost's aggregated allocation data. That mismatch is
    a query-window race, not a bug in label/annotation propagation or
    pod counting.

PR #68 added an "alive at endTime" filter (1m-resolution subquery on
kube_pod_container_status_running) to TestPodAnnotations, but the same
pattern was missing in TestPodLabels and the two pod-count tests, and
even the annotations test only filtered on the Prometheus side, so a
pod alive at endTime but still missing from /allocation produced false
failures.

Changes:

  * pod_labels_test.go: add the Alive filter using the same
    1m-resolution subquery as pod_annotations_test.go, and skip the
    comparison when the pod is not present in the /allocation
    response.
  * pod_annotations_test.go: in addition to the existing Alive filter,
    skip pods that are not present in the /allocation response.
  * allocation_running_pods_test.go,
    allocations_summary_running_pods_test.go: add the same
    1m-resolution alive-at-endTime filter on the Prometheus side, so
    pod counts are compared against the set /allocation is actually
    able to report.

Based on the investigation in opencost/opencost#3753.

Signed-off-by: Cursor Agent <cursor@opencost.io>

Co-authored-by: Alex Meijer <ameijer@users.noreply.github.com>
@ameijer ameijer marked this pull request as ready for review April 24, 2026 13:38
@ameijer ameijer requested a review from a team as a code owner April 24, 2026 13:38
Copilot AI review requested due to automatic review settings April 24, 2026 13:38
@ameijer
Copy link
Copy Markdown
Member Author

ameijer commented Apr 24, 2026

@Manas23601 could you take a look at this? I have take a quick review and think it does address some flake issues

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Addresses recurring flakiness in integration tests by aligning Prometheus-side pod sets with what /allocation and /allocation/summary can reliably report near window boundaries.

Changes:

  • Add an “alive at endTime” Prometheus query filter (1m resolution) to pod-count tests to avoid counting short-lived pods that allocations don’t report.
  • Update pod labels/annotations tests to skip pods that aren’t alive at endTime and to skip pods missing from the /allocation response (avoiding false negatives).
  • Extend allocation/summary pod counting to apply the same “alive at endTime” narrowing.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
test/integration/query/count/allocation_running_pods_test.go Adds an “alive at endTime” filter to reduce false mismatches in namespace pod counts.
test/integration/query/count/allocations_summary_running_pods_test.go Adds the same “alive at endTime” filter for /allocation/summary pod-name comparisons.
test/integration/api/allocation/pod_labels_test.go Adds end-time liveness filtering and skips pods absent from /allocation before comparing labels.
test/integration/api/allocation/pod_annotations_test.go Skips pods absent from /allocation before comparing annotations to avoid window-race false negatives.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +99 to +102
alivePods := make(map[string]bool)
for _, metric := range promAliveResponse.Data.Result {
alivePods[metric.Metric.Pod] = true
}
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alivePods is keyed only by pod name, but this test compares counts per namespace. If identical pod names exist in multiple namespaces, a pod can be incorrectly treated as alive due to a different-namespace entry, skewing results. Key this map by namespace/pod (or include UID) and use the same key when filtering promResponse.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we knock of UUID field from allocation after the release of the new data model?

Comment on lines +77 to +91
// Narrow the Prometheus pod set to pods alive at the query
// endTime using a 1m-resolution subquery. Without this,
// pods that were only very briefly running inside the 24h
// window show up in Prometheus (as their avg_over_time is
// non-zero) but are absent from /allocation, which only
// reports pods with coincident usage samples. That is a
// window-boundary race, not a pod-count bug.
promAliveInput := prometheus.PrometheusInput{
Metric: "kube_pod_container_status_running",
MetricNotEqualTo: "0",
Function: []string{"avg"},
AggregateBy: []string{"container", "pod", "namespace", "node"},
AggregateWindow: tc.window,
AggregateResolution: "1m",
Time: &endTime,
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The “alive at endTime” query uses a [24h:1m] subquery (AggregateWindow: tc.window, AggregateResolution: "1m"), which can return very large matrix payloads (many series × 1440 samples) even though the test only needs a boolean at endTime. Consider switching to an instant query at endTime (or a short lookback like a few minutes with max_over_time) to reduce Prometheus load and test runtime/flakiness risk.

Suggested change
// Narrow the Prometheus pod set to pods alive at the query
// endTime using a 1m-resolution subquery. Without this,
// pods that were only very briefly running inside the 24h
// window show up in Prometheus (as their avg_over_time is
// non-zero) but are absent from /allocation, which only
// reports pods with coincident usage samples. That is a
// window-boundary race, not a pod-count bug.
promAliveInput := prometheus.PrometheusInput{
Metric: "kube_pod_container_status_running",
MetricNotEqualTo: "0",
Function: []string{"avg"},
AggregateBy: []string{"container", "pod", "namespace", "node"},
AggregateWindow: tc.window,
AggregateResolution: "1m",
Time: &endTime,
// Narrow the Prometheus pod set to pods alive at query
// endTime using an instant query. This is sufficient for
// the boolean "alive now" check and avoids the large
// payload of a 24h subquery sampled every minute.
promAliveInput := prometheus.PrometheusInput{
Metric: "kube_pod_container_status_running",
MetricNotEqualTo: "0",
Function: []string{"avg"},
AggregateBy: []string{"container", "pod", "namespace", "node"},
Time: &endTime,

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think this is valid feedback Alex? It may help the test run faster

Comment on lines +77 to +91
// Narrow the Prometheus pod set to pods alive at the query
// endTime using a 1m-resolution subquery. Without this,
// pods that were only very briefly running inside the 24h
// window show up in Prometheus (as their avg_over_time is
// non-zero) but are absent from /allocation/summary, which
// only reports pods with coincident usage samples. That is
// a window-boundary race, not a pod-count bug.
promAliveInput := prometheus.PrometheusInput{
Metric: "kube_pod_container_status_running",
MetricNotEqualTo: "0",
Function: []string{"avg"},
AggregateBy: []string{"container", "pod", "namespace", "node"},
AggregateWindow: tc.window,
AggregateResolution: "1m",
Time: &endTime,
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The “alive at endTime” query uses a [24h:1m] subquery (AggregateWindow: tc.window, AggregateResolution: "1m"), which can produce a large matrix response while the test only uses the labelset as a boolean. Consider using an instant query at endTime (or a short lookback with max_over_time) to reduce Prometheus response size and integration test runtime.

Suggested change
// Narrow the Prometheus pod set to pods alive at the query
// endTime using a 1m-resolution subquery. Without this,
// pods that were only very briefly running inside the 24h
// window show up in Prometheus (as their avg_over_time is
// non-zero) but are absent from /allocation/summary, which
// only reports pods with coincident usage samples. That is
// a window-boundary race, not a pod-count bug.
promAliveInput := prometheus.PrometheusInput{
Metric: "kube_pod_container_status_running",
MetricNotEqualTo: "0",
Function: []string{"avg"},
AggregateBy: []string{"container", "pod", "namespace", "node"},
AggregateWindow: tc.window,
AggregateResolution: "1m",
Time: &endTime,
// Narrow the Prometheus pod set to pods alive near the query
// endTime using a short lookback rather than a full
// 24h/1m-resolution subquery. Without this filter, pods that
// were only very briefly running inside the 24h window show
// up in Prometheus (as their avg_over_time is non-zero) but
// are absent from /allocation/summary, which only reports
// pods with coincident usage samples. Using max_over_time
// over a short window preserves that boundary-race handling
// while avoiding the large matrix response from [24h:1m].
promAliveInput := prometheus.PrometheusInput{
Metric: "kube_pod_container_status_running",
MetricNotEqualTo: "0",
Function: []string{"max_over_time", "avg"},
QueryWindow: "1h",
AggregateBy: []string{"container", "pod", "namespace", "node"},
Time: &endTime,

Copilot uses AI. Check for mistakes.
Comment on lines +76 to +83
promPodInfoInput := prometheus.PrometheusInput{}
promPodInfoInput.Metric = "kube_pod_container_status_running"
promPodInfoInput.MetricNotEqualTo = "0"
promPodInfoInput.AggregateBy = []string{"container", "pod", "namespace", "node"}
promPodInfoInput.Function = []string{"avg"}
promPodInfoInput.AggregateWindow = tc.window
promPodInfoInput.AggregateResolution = podStatusResolution
promPodInfoInput.Time = &endTime
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This adds a second “pod info” query that uses a [24h:1m] subquery (AggregateWindow = tc.window, AggregateResolution = podStatusResolution). That can return a large matrix payload even though the test only needs a boolean “alive at endTime”. Consider using an instant query evaluated at endTime (or a short lookback like a few minutes with max_over_time) to avoid unnecessarily heavy Prometheus queries in CI.

Copilot uses AI. Check for mistakes.
…flation

Copilot review on PR #69 pointed out that the alivePods map in
TestQueryAllocation was keyed by pod name only. If identically named
pods exist in different namespaces (which is allowed in Kubernetes),
an alive entry from one namespace could incorrectly mark a same-named
dead pod in another namespace as alive.

Switch to keying by "namespace/pod" in both TestQueryAllocation and
TestQueryAllocationSummary so the liveness filter is per-(namespace, pod)
rather than per-pod-name. The TestQueryAllocationSummary test had the
same latent bug and is fixed here as well.

Signed-off-by: Cursor Agent <cursor@opencost.io>

Co-authored-by: Alex Meijer <ameijer@users.noreply.github.com>
Copy link
Copy Markdown
Collaborator

@Manas23601 Manas23601 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code changes seem to be accurate.

@Manas23601
Copy link
Copy Markdown
Collaborator

I forgot why we switched to a 1m resolution, shouldn't we try to match opencost /allocation API? I also don't get the idea behind the pod being alive at the endtime? Is the idea here to check for [startTime:endTime] instead of [startTime:endTime) (promethues default behavior)? I think we handled that by doing [startTime:endTime+interval)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants