-
Notifications
You must be signed in to change notification settings - Fork 1.5k
docs: Document all OPA metrics definitions #7929
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
docs: Document all OPA metrics definitions #7929
Conversation
✅ Deploy Preview for openpolicyagent ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
82d1b81 to
d3352e7
Compare
a03a2b6 to
928e84b
Compare
charlieegan3
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, I have left a few more comments here for you to have a think about. I noticed some metrics like counter_rego_builtin_regex_interquery_value_cache_hits seem to be missing.
While I appreciate the effort, it might be best to work on a smaller change set here. Perhaps we could look to focus on regex and http.send metrics since they are likely some of the more common ones used? Wdyt?
- Fix misleading 'aggregated' terminology - use 'instance-level' instead - Remove per-query metrics section from monitoring.md, add cross-references - Focus metrics documentation on commonly used regex and http.send built-ins - Add missing counter_rego_builtin_regex_interquery_value_cache_hits metric - Move admonition to after example in REST API documentation - Simplify and reduce scope of metrics documentation per reviewer guidance
Per reviewer feedback, removing blank line changes that were unintentionally included from merging PR open-policy-agent#7929.
- Fix misleading 'aggregated' terminology - use 'instance-level' instead - Remove per-query metrics section from monitoring.md, add cross-references - Focus metrics documentation on commonly used regex and http.send built-ins - Add missing counter_rego_builtin_regex_interquery_value_cache_hits metric - Move admonition to after example in REST API documentation - Simplify and reduce scope of metrics documentation per reviewer guidance Signed-off-by: Anivar A Aravind <[email protected]>
Per reviewer feedback, removing blank line changes that were unintentionally included from merging PR open-policy-agent#7929. Signed-off-by: Anivar A Aravind <[email protected]>
7304b49 to
dc63b7f
Compare
- Move metrics overview into Prometheus section for better flow - Add explicit /metrics path mention in Prometheus intro - Add links to Status API and Decision Logs documentation - Fix CLI tools to include proper documentation links - Clarify which metrics are enabled with instrument=true parameter - Remove inaccurate 'subset' terminology for Status API Addresses review comments from charlieegan3 on September 25, 2025 Signed-off-by: Anivar A Aravind <[email protected]>
15b9ec8 to
1221915
Compare
- Fix misleading 'aggregated' terminology - use 'instance-level' instead - Remove per-query metrics section from monitoring.md, add cross-references - Focus metrics documentation on commonly used regex and http.send built-ins - Add missing counter_rego_builtin_regex_interquery_value_cache_hits metric - Move admonition to after example in REST API documentation - Simplify and reduce scope of metrics documentation per reviewer guidance Signed-off-by: Anivar A Aravind <[email protected]>
Per reviewer feedback, removing blank line changes that were unintentionally included from merging PR open-policy-agent#7929. Signed-off-by: Anivar A Aravind <[email protected]>
- Move metrics overview into Prometheus section for better flow - Add explicit /metrics path mention in Prometheus intro - Add links to Status API and Decision Logs documentation - Fix CLI tools to include proper documentation links - Clarify which metrics are enabled with instrument=true parameter - Remove inaccurate 'subset' terminology for Status API Addresses review comments from charlieegan3 on September 25, 2025 Signed-off-by: Anivar A Aravind <[email protected]>
charlieegan3
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @anivar, I've left two comments about the built-in specific metrics I think that would be good to document and where I think they are best documented. I am not sure we're really going in the right direction with the rest of the PR, so if it's ok with you, I think can keep these notes on the suggested pages and get this in. We don't need to update monitoring and policy-performance this time around, for now let's just focus on the metrics for specific built-ins you've documented in here.
1221915 to
75b7706
Compare
- Fix misleading 'aggregated' terminology - use 'instance-level' instead - Remove per-query metrics section from monitoring.md, add cross-references - Focus metrics documentation on commonly used regex and http.send built-ins - Add missing counter_rego_builtin_regex_interquery_value_cache_hits metric - Move admonition to after example in REST API documentation - Simplify and reduce scope of metrics documentation per reviewer guidance Signed-off-by: Anivar A Aravind <[email protected]>
Per reviewer feedback, removing blank line changes that were unintentionally included from merging PR open-policy-agent#7929. Signed-off-by: Anivar A Aravind <[email protected]>
- Move metrics overview into Prometheus section for better flow - Add explicit /metrics path mention in Prometheus intro - Add links to Status API and Decision Logs documentation - Fix CLI tools to include proper documentation links - Clarify which metrics are enabled with instrument=true parameter - Remove inaccurate 'subset' terminology for Status API Addresses review comments from charlieegan3 on September 25, 2025 Signed-off-by: Anivar A Aravind <[email protected]>
|
@charlieegan3 Updated per your feedback:
Ready for review. |
75b7706 to
39184be
Compare
✅ Deploy Preview for openpolicyagent ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
✅ Deploy Preview for openpolicyagent ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
This pull request has been automatically marked as stale because it has not had any activity in the last 30 days. |
|
@charlieegan3 Review please |
Generate and document all OPA metrics in a central registry. Add operational metrics sections to monitoring docs. Fixes: open-policy-agent#6730 Signed-off-by: Anivar A Aravind <[email protected]>
Based on PR feedback, this commit: - Clearly distinguishes between /metrics endpoint (system-wide) and ?metrics=true (per-query) - Removes duplicate metrics listings to avoid maintenance burden - Adds cross-references between monitoring and REST API docs - Simplifies the documentation structure without automation Addresses review feedback from @charlieegan3 Signed-off-by: Anivar A Aravind <[email protected]>
As requested by @charlieegan3: - Remove auto-generation tooling (cmd/metrics-docs/main.go) - Remove auto-generated metrics registry file - Remove Makefile target for metrics generation The reviewer indicated metrics don't change frequently enough to warrant automation, and prefers avoiding duplicate lists. Signed-off-by: Anivar A Aravind <[email protected]>
- Fix misleading 'aggregated' terminology - use 'instance-level' instead - Remove per-query metrics section from monitoring.md, add cross-references - Focus metrics documentation on commonly used regex and http.send built-ins - Add missing counter_rego_builtin_regex_interquery_value_cache_hits metric - Move admonition to after example in REST API documentation - Simplify and reduce scope of metrics documentation per reviewer guidance Signed-off-by: Anivar A Aravind <[email protected]>
Per reviewer feedback, removing blank line changes that were unintentionally included from merging PR open-policy-agent#7929. Signed-off-by: Anivar A Aravind <[email protected]>
- Move metrics overview into Prometheus section for better flow - Add explicit /metrics path mention in Prometheus intro - Add links to Status API and Decision Logs documentation - Fix CLI tools to include proper documentation links - Clarify which metrics are enabled with instrument=true parameter - Remove inaccurate 'subset' terminology for Status API Addresses review comments from charlieegan3 on September 25, 2025 Signed-off-by: Anivar A Aravind <[email protected]>
Following final review guidance to document only builtin-specific metrics in their respective reference pages. Changes: - http.mdx: Document http.send timer and cache metrics - regex.mdx: Document regex cache hit metric - glob.mdx: Document glob cache hit metric - rest-api.md: Clarify available instrumentation metrics Removed broader metrics documentation from monitoring.md and policy-performance.md per reviewer request to keep changes focused. Fixes open-policy-agent#6730 Signed-off-by: Anivar A Aravind <[email protected]>
a0f55be to
143f1b2
Compare
| OPA provides two ways to access performance metrics: | ||
|
|
||
| 1. **System-wide metrics** via the `/metrics` Prometheus endpoint - Instance-level metrics across all OPA operations | ||
| 2. **Per-query metrics** via API responses with `?metrics=true` - Metrics for individual query executions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would put this in an admonition instead since it's related but not 100% on topic for the prometheus section, this section is just about /metrics but ?metrics=true is important further reading.
| The Prometheus `/metrics` endpoint exposes the following instance-level metrics: | ||
|
|
||
| - **URL**: `http://localhost:8181/metrics` (default configuration) | ||
| - **Method**: HTTP GET |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| - **Method**: HTTP GET |
This is not needed as it's the default.
|
|
||
| - **URL**: `http://localhost:8181/metrics` (default configuration) | ||
| - **Method**: HTTP GET | ||
| - **Format**: Prometheus text format |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| - **Format**: Prometheus text format |
This is not needed as it's expected to be in that format.
| - **URL**: `http://localhost:8181/metrics` (default configuration) | ||
| - **Method**: HTTP GET | ||
| - **Format**: Prometheus text format | ||
| - **Contents**: Instance-level counters, timers, histograms, Go runtime metrics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| - **Contents**: Instance-level counters, timers, histograms, Go runtime metrics | |
| - **Data**: HTTP request metrics (counters, latencies, status codes), Go runtime internals (memory allocation, garbage collection, goroutines etc.) |
|
|
||
| ## Available Metrics | ||
|
|
||
| The Prometheus `/metrics` endpoint exposes the following instance-level metrics: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This no longer really introduces the content in this section.
| - **Method**: HTTP GET | ||
| - **Format**: Prometheus text format | ||
| - **Contents**: Instance-level counters, timers, histograms, Go runtime metrics | ||
| - **Use case**: Monitoring dashboards, alerting, performance trends |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| - **Use case**: Monitoring dashboards, alerting, performance trends |
not really needed as I think users understand how to use the metrics if they're looking for which are available.
| - **Format**: Prometheus text format | ||
| - **Contents**: Instance-level counters, timers, histograms, Go runtime metrics | ||
| - **Use case**: Monitoring dashboards, alerting, performance trends | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please list
- http_request_duration_seconds - Request latency histogram by endpoint, method, and status code
- then link to https://pkg.go.dev/runtime/metrics#hdr-Supported_metrics for the go ones and leave them to document that
|
|
||
| Users are recommended to do performance testing to determine the optimal configuration for their use case. | ||
|
|
||
| ## Performance Metrics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We still seem to have the per built in metrics here in this doc as well as in the built in docs themselves, I think they're better in the built in docs only.
| for the compilation stages. They follow the format of `timer_compile_stage_*_ns` | ||
| and `timer_query_compile_stage_*_ns` for the query and module compilation stages. | ||
| When query instrumentation is enabled (`instrument=true`), the following additional detailed evaluation metrics are included: | ||
| - **timer_eval_op_***: Various evaluation operation timers (e.g., `timer_eval_op_plug_ns`, `timer_eval_op_resolve_ns`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be good to explain what these are rather than just 'various operation timers'
| and `timer_query_compile_stage_*_ns` for the query and module compilation stages. | ||
| When query instrumentation is enabled (`instrument=true`), the following additional detailed evaluation metrics are included: | ||
| - **timer_eval_op_***: Various evaluation operation timers (e.g., `timer_eval_op_plug_ns`, `timer_eval_op_resolve_ns`) | ||
| - **histogram_eval_op_***: Histograms tracking evaluation operation time distributions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| - **histogram_eval_op_***: Histograms tracking evaluation operation time distributions | |
| - **histogram_eval_op_***: Histograms tracking time distributions of individual eval operations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
histogram_eval_op_builtin_call, is one, it'd be good to give examples of them all.
| When instrumentation is enabled there are several additional performance metrics | ||
| for the compilation stages. They follow the format of `timer_compile_stage_*_ns` | ||
| and `timer_query_compile_stage_*_ns` for the query and module compilation stages. | ||
| When query instrumentation is enabled (`instrument=true`), the following additional detailed evaluation metrics are included: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here are some examples of how to learn what the different metrics are:
query
…/opa main ➜ curl --silent 'localhost:8181/v1/query?metrics' -X POST -H "Content-Type: application/json" -d @body.json | jq
{
"metrics": {
"timer_rego_query_compile_ns": 47708,
"timer_rego_query_eval_ns": 6750,
"timer_server_handler_ns": 271875
},
"result": [
{}
]
}
…/opa main ➜ curl --silent 'localhost:8181/v1/query?metrics&instrument=true' -X POST -H "Content-Type: application/json" -d @body.json | jq
{
"metrics": {
"histogram_eval_op_builtin_call": {
"75%": 1791,
"90%": 1791,
"95%": 1791,
"99%": 1791,
"99.9%": 1791,
"99.99%": 1791,
"count": 2,
"max": 1791,
"mean": 937.5,
"median": 937.5,
"min": 84,
"stddev": 853.5
},
"histogram_eval_op_plug": {
"75%": 739.25,
"90%": 833,
"95%": 833,
"99%": 833,
"99.9%": 833,
"99.99%": 833,
"count": 4,
"max": 833,
"mean": 406,
"median": 333,
"min": 125,
"stddev": 275.3170899163363
},
"timer_eval_op_builtin_call_ns": 1875,
"timer_eval_op_plug_ns": 1624,
"timer_query_compile_stage_build_comprehension_index_ns": 1458,
"timer_query_compile_stage_check_deprecated_builtins_ns": 42,
"timer_query_compile_stage_check_keyword_overrides_ns": 833,
"timer_query_compile_stage_check_safety_ns": 9708,
"timer_query_compile_stage_check_types_ns": 7000,
"timer_query_compile_stage_check_undefined_funcs_ns": 1333,
"timer_query_compile_stage_check_unsafe_builtins_ns": 708,
"timer_query_compile_stage_check_void_calls_ns": 791,
"timer_query_compile_stage_resolve_refs_ns": 5250,
"timer_query_compile_stage_rewrite_comprehension_terms_ns": 2792,
"timer_query_compile_stage_rewrite_dynamic_terms_ns": 1500,
"timer_query_compile_stage_rewrite_expr_terms_ns": 1958,
"timer_query_compile_stage_rewrite_local_vars_ns": 7542,
"timer_query_compile_stage_rewrite_print_calls_ns": 1542,
"timer_query_compile_stage_rewrite_to_capture_value_ns": 7334,
"timer_query_compile_stage_rewrite_with_values_ns": 833,
"timer_rego_query_compile_ns": 64542,
"timer_rego_query_eval_ns": 17167,
"timer_server_handler_ns": 222000
},
"result": [
{}
]
}
v1/data
…/opa main ➜ curl --silent 'localhost:8181/v1/data?metrics&instrument=true' -H "Content-Type: application/json" | jq
{
"metrics": {
"counter_eval_op_base_cache_miss": 1,
"counter_server_query_cache_hit": 0,
"histogram_eval_op_plug": {
"75%": 625,
"90%": 625,
"95%": 625,
"99%": 625,
"99.9%": 625,
"99.99%": 625,
"count": 1,
"max": 625,
"mean": 625,
"median": 625,
"min": 625,
"stddev": 0
},
"histogram_eval_op_resolve": {
"75%": 3291,
"90%": 3291,
"95%": 3291,
"99%": 3291,
"99.9%": 3291,
"99.99%": 3291,
"count": 1,
"max": 3291,
"mean": 3291,
"median": 3291,
"min": 3291,
"stddev": 0
},
"timer_eval_op_plug_ns": 625,
"timer_eval_op_resolve_ns": 3291,
"timer_query_compile_stage_build_comprehension_index_ns": 1333,
"timer_query_compile_stage_check_deprecated_builtins_ns": 42,
"timer_query_compile_stage_check_keyword_overrides_ns": 542,
"timer_query_compile_stage_check_safety_ns": 7958,
"timer_query_compile_stage_check_types_ns": 6625,
"timer_query_compile_stage_check_undefined_funcs_ns": 1292,
"timer_query_compile_stage_check_unsafe_builtins_ns": 666,
"timer_query_compile_stage_check_void_calls_ns": 708,
"timer_query_compile_stage_resolve_refs_ns": 2625,
"timer_query_compile_stage_rewrite_comprehension_terms_ns": 2458,
"timer_query_compile_stage_rewrite_dynamic_terms_ns": 1291,
"timer_query_compile_stage_rewrite_expr_terms_ns": 1916,
"timer_query_compile_stage_rewrite_local_vars_ns": 7500,
"timer_query_compile_stage_rewrite_print_calls_ns": 917,
"timer_query_compile_stage_rewrite_to_capture_value_ns": 4042,
"timer_query_compile_stage_rewrite_with_values_ns": 625,
"timer_rego_external_resolve_ns": 83,
"timer_rego_input_parse_ns": 708,
"timer_rego_query_compile_ns": 55291,
"timer_rego_query_eval_ns": 18750,
"timer_server_handler_ns": 183459
},
"result": {}
}
…/opa main ➜ curl --silent 'localhost:8181/v1/data?metrics' -H "Content-Type: application/json" | jq
{
"metrics": {
"counter_server_query_cache_hit": 1,
"timer_rego_external_resolve_ns": 125,
"timer_rego_input_parse_ns": 500,
"timer_rego_query_eval_ns": 34583,
"timer_server_handler_ns": 54375
},
"result": {}
}
This is not an exhaustive list, if you can, it'd be best to run some example queries for each endpoint so you can learn what the different metrics are. Also note that the different metrics will depend on the data you post, built in functions etc.
If you want to document this section, some significant research will be needed in order to gather what is available and what the metrics mean.
| | ------ | ----------- | | ||
| | `counter_rego_builtin_regex_interquery_value_cache_hits` | Number of regex cache hits for compiled patterns | | ||
|
|
||
| Effective regex caching improves performance when the same patterns are used repeatedly. High cache hit ratios indicate that regex compilation overhead is being minimized through caching. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Effective regex caching improves performance when the same patterns are used repeatedly. High cache hit ratios indicate that regex compilation overhead is being minimized through caching. | |
| Caching of parsed regular expressions improves performance when the same patterns are used repeatedly. High cache hit ratios indicate that regex compilation overhead is being minimized through caching. |
| | ------ | ----------- | | ||
| | `counter_rego_builtin_glob_interquery_value_cache_hits` | Number of inter-query cache hits for compiled glob patterns | | ||
|
|
||
| Effective glob pattern caching improves performance when the same patterns are used repeatedly across queries. High cache hit ratios indicate that glob compilation overhead is being minimized through caching. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Effective glob pattern caching improves performance when the same patterns are used repeatedly across queries. High cache hit ratios indicate that glob compilation overhead is being minimized through caching. | |
| Caching of parsed glob patterns improves performance when the same patterns are used repeatedly across queries. High cache hit ratios indicate that glob compilation overhead is being minimized through caching. |
Fixes #6730
Users were struggling to understand OPA metrics, especially builtin function metrics like
http.send(). The existing docs didn't explain what these metrics capture or how they behave with caching.This PR creates a comprehensive metrics registry that documents all currently discovered OPA metrics. Each metric now has a clear description and units.
Building on PR #7851 (which added the http.send network request counter), this completes the documentation for all http.send metrics:
timer_rego_builtin_http_send_nscaptures total time across ALL http.send() callscounter_rego_builtin_http_send_interquery_cache_hitstracks cache hitscounter_rego_builtin_http_send_network_requestscounts actual network requests (added in feat: Add counter metric for http.send network requests #7851)Beyond fixing the immediate documentation gap, I've added a generator tool in
cmd/metrics-docs/to keep the registry maintainable. Runmake generate-metrics-docsto regenerate when new metrics are added to OPA. The generator works from a manually curated list to ensure accurate descriptions.Also enhanced the existing monitoring and policy-performance docs with operational metrics sections and fixed a broken link in the REST API docs.
Files changed:
cmd/metrics-docs/main.goandREADME.md- Generator tooldocs/docs/metrics-registry.md- The complete registry (generated)docs/docs/monitoring.md- Added operational metrics sectionsdocs/docs/policy-performance.md- Enhanced performance metricsdocs/docs/rest-api.md- Fixed broken referenceThis should help users interpret metrics without diving into source code.