Skip to content

Conversation

@chinmay3012
Copy link
Contributor

Which problem is this PR solving?

Resolves #7617

Description of the changes

  • Update scripts/e2e/compare_metrics.py to support GLOBAL transient label patterns.
    Add global suppression for otel_scope_version (normalizing it to fixed string "version") to prevent spurious diffs when OpenTelemetry dependencies are upgraded.
    Add global suppression for namespace and k8s_namespace_name (normalizing to "namespace") to handle randomized namespaces in e2e tests.
    Fix logic in suppress_transient_labels to correctly apply these global patterns

How was this change tested?

Manually created checking script with dummy metric files containing different otel_scope_version values (e.g., 0.63.0 vs 0.64.0) and confirmed they are now reported as identical.
Verified that files with actual differences (e.g. different metric values or label keys) are still correctly flagged as different.
Verified that randomized namespace labels are correctly normalized and ignored in comparisons.

@chinmay3012 chinmay3012 requested a review from a team as a code owner December 9, 2025 18:35
@chinmay3012 chinmay3012 requested a review from jkowall December 9, 2025 18:35
@dosubot dosubot bot added the enhancement label Dec 9, 2025
@chinmay3012 chinmay3012 force-pushed the fix-flaky-metrics-7617 branch from 148ebf0 to ccd62cc Compare December 9, 2025 18:36
@codecov
Copy link

codecov bot commented Dec 9, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 95.59%. Comparing base (ca5482e) to head (300ac34).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #7720      +/-   ##
==========================================
- Coverage   95.60%   95.59%   -0.02%     
==========================================
  Files         311      311              
  Lines       15511    15511              
==========================================
- Hits        14829    14827       -2     
- Misses        534      535       +1     
- Partials      148      149       +1     
Flag Coverage Δ
badger_v1 9.90% <ø> (ø)
badger_v2 2.07% <ø> (ø)
cassandra-4.x-v1-manual 14.05% <ø> (ø)
cassandra-4.x-v2-auto 2.06% <ø> (ø)
cassandra-4.x-v2-manual 2.06% <ø> (ø)
cassandra-5.x-v1-manual 14.05% <ø> (ø)
cassandra-5.x-v2-auto 2.06% <ø> (ø)
cassandra-5.x-v2-manual 2.06% <ø> (ø)
clickhouse 1.98% <ø> (ø)
elasticsearch-6.x-v1 18.80% <ø> (ø)
elasticsearch-7.x-v1 18.83% <ø> (ø)
elasticsearch-8.x-v1 19.00% <ø> (ø)
elasticsearch-8.x-v2 2.07% <ø> (ø)
elasticsearch-9.x-v2 2.07% <ø> (ø)
grpc_v1 9.69% <ø> (ø)
grpc_v2 2.07% <ø> (ø)
kafka-3.x-v2 2.07% <ø> (ø)
memory_v2 2.07% <ø> (ø)
opensearch-1.x-v1 18.88% <ø> (ø)
opensearch-2.x-v1 18.88% <ø> (ø)
opensearch-2.x-v2 2.07% <ø> (ø)
opensearch-3.x-v2 2.07% <ø> (ø)
query 2.07% <ø> (ø)
tailsampling-processor 0.59% <ø> (ø)
unittests 94.15% <ø> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@github-actions
Copy link

github-actions bot commented Dec 9, 2025

Metrics Comparison Summary

Total changes across all snapshots: 53

Detailed changes per snapshot

summary_metrics_snapshot_cassandra

📊 Metrics Diff Summary

Total Changes: 53

  • 🆕 Added: 0 metrics
  • ❌ Removed: 53 metrics
  • 🔄 Modified: 0 metrics

❌ Removed Metrics

  • http_server_request_body_size_bytes (18 variants)
View diff sample
-http_server_request_body_size_bytes{http_request_method="GET",http_response_status_code="503",le="+Inf",network_protocol_name="http",network_protocol_version="1.1",otel_scope_name="go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp",otel_scope_schema_url="",otel_scope_version="version",server_address="localhost",server_port="13133",url_scheme="http"}
-http_server_request_body_size_bytes{http_request_method="GET",http_response_status_code="503",le="0",network_protocol_name="http",network_protocol_version="1.1",otel_scope_name="go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp",otel_scope_schema_url="",otel_scope_version="version",server_address="localhost",server_port="13133",url_scheme="http"}
-http_server_request_body_size_bytes{http_request_method="GET",http_response_status_code="503",le="10",network_protocol_name="http",network_protocol_version="1.1",otel_scope_name="go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp",otel_scope_schema_url="",otel_scope_version="version",server_address="localhost",server_port="13133",url_scheme="http"}
-http_server_request_body_size_bytes{http_request_method="GET",http_response_status_code="503",le="100",network_protocol_name="http",network_protocol_version="1.1",otel_scope_name="go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp",otel_scope_schema_url="",otel_scope_version="version",server_address="localhost",server_port="13133",url_scheme="http"}
-http_server_request_body_size_bytes{http_request_method="GET",http_response_status_code="503",le="1000",network_protocol_name="http",network_protocol_version="1.1",otel_scope_name="go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp",otel_scope_schema_url="",otel_scope_version="version",server_address="localhost",server_port="13133",url_scheme="http"}
-http_server_request_body_size_bytes{http_request_method="GET",http_response_status_code="503",le="10000",network_protocol_name="http",network_protocol_version="1.1",otel_scope_name="go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp",otel_scope_schema_url="",otel_scope_version="version",server_address="localhost",server_port="13133",url_scheme="http"}
-http_server_request_body_size_bytes{http_request_method="GET",http_response_status_code="503",le="25",network_protocol_name="http",network_protocol_version="1.1",otel_scope_name="go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp",otel_scope_schema_url="",otel_scope_version="version",server_address="localhost",server_port="13133",url_scheme="http"}
...
- `http_server_request_duration_seconds` (17 variants)
View diff sample
-http_server_request_duration_seconds{http_request_method="GET",http_response_status_code="503",le="+Inf",network_protocol_name="http",network_protocol_version="1.1",otel_scope_name="go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp",otel_scope_schema_url="",otel_scope_version="version",server_address="localhost",server_port="13133",url_scheme="http"}
-http_server_request_duration_seconds{http_request_method="GET",http_response_status_code="503",le="0.005",network_protocol_name="http",network_protocol_version="1.1",otel_scope_name="go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp",otel_scope_schema_url="",otel_scope_version="version",server_address="localhost",server_port="13133",url_scheme="http"}
-http_server_request_duration_seconds{http_request_method="GET",http_response_status_code="503",le="0.01",network_protocol_name="http",network_protocol_version="1.1",otel_scope_name="go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp",otel_scope_schema_url="",otel_scope_version="version",server_address="localhost",server_port="13133",url_scheme="http"}
-http_server_request_duration_seconds{http_request_method="GET",http_response_status_code="503",le="0.025",network_protocol_name="http",network_protocol_version="1.1",otel_scope_name="go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp",otel_scope_schema_url="",otel_scope_version="version",server_address="localhost",server_port="13133",url_scheme="http"}
-http_server_request_duration_seconds{http_request_method="GET",http_response_status_code="503",le="0.05",network_protocol_name="http",network_protocol_version="1.1",otel_scope_name="go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp",otel_scope_schema_url="",otel_scope_version="version",server_address="localhost",server_port="13133",url_scheme="http"}
-http_server_request_duration_seconds{http_request_method="GET",http_response_status_code="503",le="0.075",network_protocol_name="http",network_protocol_version="1.1",otel_scope_name="go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp",otel_scope_schema_url="",otel_scope_version="version",server_address="localhost",server_port="13133",url_scheme="http"}
-http_server_request_duration_seconds{http_request_method="GET",http_response_status_code="503",le="0.1",network_protocol_name="http",network_protocol_version="1.1",otel_scope_name="go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp",otel_scope_schema_url="",otel_scope_version="version",server_address="localhost",server_port="13133",url_scheme="http"}
...
- `http_server_response_body_size_bytes` (18 variants)
View diff sample
-http_server_response_body_size_bytes{http_request_method="GET",http_response_status_code="503",le="+Inf",network_protocol_name="http",network_protocol_version="1.1",otel_scope_name="go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp",otel_scope_schema_url="",otel_scope_version="version",server_address="localhost",server_port="13133",url_scheme="http"}
-http_server_response_body_size_bytes{http_request_method="GET",http_response_status_code="503",le="0",network_protocol_name="http",network_protocol_version="1.1",otel_scope_name="go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp",otel_scope_schema_url="",otel_scope_version="version",server_address="localhost",server_port="13133",url_scheme="http"}
-http_server_response_body_size_bytes{http_request_method="GET",http_response_status_code="503",le="10",network_protocol_name="http",network_protocol_version="1.1",otel_scope_name="go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp",otel_scope_schema_url="",otel_scope_version="version",server_address="localhost",server_port="13133",url_scheme="http"}
-http_server_response_body_size_bytes{http_request_method="GET",http_response_status_code="503",le="100",network_protocol_name="http",network_protocol_version="1.1",otel_scope_name="go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp",otel_scope_schema_url="",otel_scope_version="version",server_address="localhost",server_port="13133",url_scheme="http"}
-http_server_response_body_size_bytes{http_request_method="GET",http_response_status_code="503",le="1000",network_protocol_name="http",network_protocol_version="1.1",otel_scope_name="go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp",otel_scope_schema_url="",otel_scope_version="version",server_address="localhost",server_port="13133",url_scheme="http"}
-http_server_response_body_size_bytes{http_request_method="GET",http_response_status_code="503",le="10000",network_protocol_name="http",network_protocol_version="1.1",otel_scope_name="go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp",otel_scope_schema_url="",otel_scope_version="version",server_address="localhost",server_port="13133",url_scheme="http"}
-http_server_response_body_size_bytes{http_request_method="GET",http_response_status_code="503",le="25",network_protocol_name="http",network_protocol_version="1.1",otel_scope_name="go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp",otel_scope_schema_url="",otel_scope_version="version",server_address="localhost",server_port="13133",url_scheme="http"}
...

➡️ View full metrics file

return nil, err
}

if val, ok := tags[otelsemconv.OtelStatusCode]; ok {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a merge error? Seems unrelated to your PR. When merging main either do it via GitHub button or via running git rebase main from your branch (preferred).

labels_copy = labels.copy()

# Apply global patterns first
if 'GLOBAL' in TRANSIENT_LABEL_PATTERNS:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so if the PR is meant to fix the metrics divergence then should we expect zero diffs reported in the PR? This one has 106 differences still.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you should expect zero diffs now. I just pushed a fix that addresses the label value duplication bug I found during testing. The previous version was incorrectly normalizing labels which caused the 106 differences.

@chinmay3012 chinmay3012 force-pushed the fix-flaky-metrics-7617 branch from 64b510e to a2b20b2 Compare December 10, 2025 04:32
@chinmay3012 chinmay3012 force-pushed the fix-flaky-metrics-7617 branch from a2b20b2 to 3aadd9c Compare December 10, 2025 04:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[ci]: Metrics comparison is flaky

2 participants