CP-311148/CP-311150: add a quicktest for XAPI memory overhead calculations by edwintorok · Pull Request #6858 · xapi-project/xen-api

edwintorok · 2026-01-22T18:57:26Z

Test that we can fill a host with 1 VM, with N VMs, based on maximise_memory/compute_memory_overhead.

Check that the constant factors used in XAPI are correct, e.g. amount of memory used/vcpu.
Can be used to validate these PRs:
#6855
#6854

There is also a pagetable overhead calculation, but something weird is going on there:

[2026-01-22T18:40:49.342348481-00:00|0000000000000000]  pagetables,memory_overhead_pages,coeff,vms
[2026-01-22T18:40:49.342333285-00:00|0000000000000000]  64,793,12.3906,9223372036854775807
[2026-01-22T18:40:49.342335974-00:00|0000000000000000]  192,1305,6.79688,9223372036854775807
[2026-01-22T18:40:49.342337658-00:00|0000000000000000]  448,2329,5.19866,9223372036854775807
[2026-01-22T18:40:49.342339751-00:00|0000000000000000]  962,4377,4.5499,9223372036854775807
[2026-01-22T18:40:49.342341392-00:00|0000000000000000]  263102,1048827,3.98639,9223372036854775807
[2026-01-22T18:40:49.342343128-00:00|0000000000000000]  526273,2097403,3.98539,9223372036854775807
[2026-01-22T18:40:49.342345071-00:00|0000000000000000]  708913,2825211,3.98527,9223372036854775807

That should be ~4, don't know why it'd be 13, it used to be reliably 4 previously, could be a bug in the test.
That'll need further investigation (also there is enough free memory on the host that this underestimate doesn't actually cause a failure, which is also unexpected).

edwintorok · 2026-01-22T21:59:51Z

Something is broken though in XAPI now (not sure whether a race condition, or a new bug inherit from another branch or master):

2026-01-22T21:52:22.281822457-00:00|4d6cefa9be6757ca] Dune__exe__Quicktest_vm_calibrate.host_mem_leak ERROR Server_error(INTERNAL_ERROR, [ VM not in expected power state after completing operation: OpaqueRef:f9a1c2b5-e33a-8670-4071-143bf46012dc, paused, halted ]) traceparent: 00-b5a629b3ed0330b97ff74ee25b342a3c-4d6cefa9be6757ca-00, traceparent: 00-b5a629b3ed0330b97ff74ee25b342a3c-4d6cefa9be6757ca-00, traceparent: 00-b5a629b3ed0330b97ff74ee25b342a3c-4d6cefa9be6757ca-00
[2026-01-22T21:52:22.370054464-00:00|4d6cefa9be6757ca]  assert_can_boot_here succeeded
[2026-01-22T21:52:22.703489012-00:00|4d6cefa9be6757ca] [duration:  +0.421667s]
[2026-01-22T21:52:22.703487921-00:00|4d6cefa9be6757ca]  error backtrace: Raised at Client.server_failure in file "ocaml/xapi-client/client.ml", line 7, characters 31-75 Called from Client.ClientF.rpc_wrapper.(fun) in file "ocaml/xapi-client/client.ml", line 19, characters 55-110 Called from Client.ClientF.VM.start_on in file "ocaml/xapi-client/client.ml", line 7937, characters 6-47 Called from Client.ClientF.call in file "ocaml/xapi-client/client.ml" (inlined), line 24, characters 33-51 Called from Quicktest_trace_api__Api.Object.with_call.(fun) in file "ocaml/quicktest/trace/api/api.ml", line 161, characters 8-19 Re-raised at Quicktest_trace_api__Api.Object.with_call.(fun) in file "ocaml/quicktest/trace/api/api.ml", line 167, characters 6-40 Called from Ambient_context_thread_local__Thread_local.with_ in file "vendor/thread_local/thread_local.ml", line 96, characters 14-19 Re-raised at Ambient_context_thread_local__Thread_local.with_ in file "vendor/thread_local/thread_local.ml", line 101, characters 4-11 Called from Ambient_context_thread_local__Thread_local.with_ in file "vendor/thread_local/thread_local.ml", line 96, characters 14-19 Re-raised at Ambient_context_thread_local__Thread_local.with_ in file "vendor/thread_local/thread_local.ml", line 101, characters 4-11 Called from Quicktest_trace__Trace.with_ in file "ocaml/quicktest/trace/trace.ml", line 97, characters 12-20 Re-raised at Quicktest_trace__Trace.with_ in file "ocaml/quicktest/trace/trace.ml", line 102, characters 4-40 Called from Dune__exe__Quicktest_vm_calibrate.host_mem_leak.(fun).loop in file "ocaml/quicktest/quicktest_vm_calibrate.ml", line 115, characters 4-97 Called from Xapi_stdext_pervasives__Pervasiveext.finally in file "ocaml/libs/xapi-stdext/lib/xapi-stdext-pervasives/pervasiveext.ml", line 24, characters 8-14 Re-raised at Xapi_stdext_pervasives__Pervasiveext.finally in file "ocaml/libs/xapi-stdext/lib/xapi-stdext-pervasives/pervasiveext.ml", line 39, characters 6-15 Called from Ambient_context_thread_local__Thread_local.with_ in file "vendor/thread_local/thread_local.ml", line 96, characters 14-19 Re-raised at Ambient_context_thread_local__Thread_local.with_ in file "vendor/thread_local/thread_local.ml", line 101, characters 4-11 Called from Ambient_context_thread_local__Thread_local.with_ in file "vendor/thread_local/thread_local.ml", line 96, characters 14-19 Re-raised at Ambient_context_thread_local__Thread_local.with_ in file "vendor/thread_local/thread_local.ml", line 101, characters 4-11 Called from Quicktest_trace__Trace.with_ in file "ocaml/quicktest/trace/trace.ml", line 97, characters 12-20

I'll look at this next week.

edwintorok · 2026-01-22T22:01:31Z

Also looks like this now calls something too early that breaks the CI (it calls some xenctrl function that is not implemented when run outside of Xen), it does work in koji.

ocaml/quicktest/quicktest_api_helpers.ml

edwintorok · 2026-01-23T10:11:16Z

Something is broken though in XAPI now (not sure whether a race condition, or a new bug inherit from another branch or master):

2026-01-22T21:52:22.281822457-00:00|4d6cefa9be6757ca] Dune__exe__Quicktest_vm_calibrate.host_mem_leak ERROR Server_error(INTERNAL_ERROR, [ VM not in expected power state after completing operation: OpaqueRef:f9a1c2b5-e33a-8670-4071-143bf46012dc, paused, halted ]) traceparent: 00-b5a629b3ed0330b97ff74ee25b342a3c-4d6cefa9be6757ca-00, traceparent: 00-b5a629b3ed0330b97ff74ee25b342a3c-4d6cefa9be6757ca-00, traceparent: 00-b5a629b3ed0330b97ff74ee25b342a3c-4d6cefa9be6757ca-00
[2026-01-22T21:52:22.370054464-00:00|4d6cefa9be6757ca]  assert_can_boot_here succeeded
[2026-01-22T21:52:22.703489012-00:00|4d6cefa9be6757ca] [duration:  +0.421667s]
[2026-01-22T21:52:22.703487921-00:00|4d6cefa9be6757ca]  error backtrace: Raised at Client.server_failure in file "ocaml/xapi-client/client.ml", line 7, characters 31-75 Called from Client.ClientF.rpc_wrapper.(fun) in file "ocaml/xapi-client/client.ml", line 19, characters 55-110 Called from Client.ClientF.VM.start_on in file "ocaml/xapi-client/client.ml", line 7937, characters 6-47 Called from Client.ClientF.call in file "ocaml/xapi-client/client.ml" (inlined), line 24, characters 33-51 Called from Quicktest_trace_api__Api.Object.with_call.(fun) in file "ocaml/quicktest/trace/api/api.ml", line 161, characters 8-19 Re-raised at Quicktest_trace_api__Api.Object.with_call.(fun) in file "ocaml/quicktest/trace/api/api.ml", line 167, characters 6-40 Called from Ambient_context_thread_local__Thread_local.with_ in file "vendor/thread_local/thread_local.ml", line 96, characters 14-19 Re-raised at Ambient_context_thread_local__Thread_local.with_ in file "vendor/thread_local/thread_local.ml", line 101, characters 4-11 Called from Ambient_context_thread_local__Thread_local.with_ in file "vendor/thread_local/thread_local.ml", line 96, characters 14-19 Re-raised at Ambient_context_thread_local__Thread_local.with_ in file "vendor/thread_local/thread_local.ml", line 101, characters 4-11 Called from Quicktest_trace__Trace.with_ in file "ocaml/quicktest/trace/trace.ml", line 97, characters 12-20 Re-raised at Quicktest_trace__Trace.with_ in file "ocaml/quicktest/trace/trace.ml", line 102, characters 4-40 Called from Dune__exe__Quicktest_vm_calibrate.host_mem_leak.(fun).loop in file "ocaml/quicktest/quicktest_vm_calibrate.ml", line 115, characters 4-97 Called from Xapi_stdext_pervasives__Pervasiveext.finally in file "ocaml/libs/xapi-stdext/lib/xapi-stdext-pervasives/pervasiveext.ml", line 24, characters 8-14 Re-raised at Xapi_stdext_pervasives__Pervasiveext.finally in file "ocaml/libs/xapi-stdext/lib/xapi-stdext-pervasives/pervasiveext.ml", line 39, characters 6-15 Called from Ambient_context_thread_local__Thread_local.with_ in file "vendor/thread_local/thread_local.ml", line 96, characters 14-19 Re-raised at Ambient_context_thread_local__Thread_local.with_ in file "vendor/thread_local/thread_local.ml", line 101, characters 4-11 Called from Ambient_context_thread_local__Thread_local.with_ in file "vendor/thread_local/thread_local.ml", line 96, characters 14-19 Re-raised at Ambient_context_thread_local__Thread_local.with_ in file "vendor/thread_local/thread_local.ml", line 101, characters 4-11 Called from Quicktest_trace__Trace.with_ in file "ocaml/quicktest/trace/trace.ml", line 97, characters 12-20

I'll look at this next week.

This is a pre-existing bug on XAPI master: when Xen is missing support for RRD4 (domain info numa pages) then we get an ENOSYS exception and we fail to boot the VM.
We should instead handle that error and allow the VM to boot and report 'unkown' for the numa info field.

(even if you installed the updated hypervisor package you still need a full host reboot for this to take effect)

lindig

This is a lot of code at once. Given that it is test code and not production code I am not too worried about it and I assume you have used it already quite a bit.

edwintorok · 2026-01-26T09:19:01Z

That should be ~4, don't know why it'd be 13, it used to be reliably 4 previously,

There is also a rounding bug in XAPI (maximise_memory rounds to 1MiB, not 2MiB, but it appears that overall the memory used would be as if the rounding was to 2MiB, but that needs a bit more investigation).

Writing code that calls XAPI functions is quite tedious, because you have to repeat `~rpc ~session_id` every time. It saves quite a lot of typing to write in this style instead: ``` open Client.Client ... let value = call t @@ VM.maximise_memory ~self ~approximate:false ~total in call t @@ VM.set_memory ~self ~value ``` You still need to repeat `call t @@`, but it is at the beginning and doesn't hinder readability. Add new types and `val call` to Client.Client. The type is called `client` instead of `t` because it isn't used uniformly by other functions in this module. No functional change to the product. Signed-off-by: Edwin Török <edwin.torok@citrix.com>

This uses a previously unused field in the log message format to log the Trace Context. This include the Trace ID (common for the entire tree of activities), and parent Span ID (unique to this instance of the remote caller). We don't log the local span/parent ID, since this will keep changing. Logging the traceparent could make it easier to group log messages belonging to the same high level activity. When an external Trace Context is not available (the default) then the log messages are unchanged. Another alternative would be to explicitly pass a scope/context to the logging functions, but this would require some automated rewriting of the codebase to plumb through the required parameters. With the ambient context the change is much smaller, and we can still plumb through an explicit context later if needed. To avoid a dependency cycle this is not using Threadext, but Ambient_context directly. The first user of this will be the new quicktest. Signed-off-by: Edwin Török <edwin.torok@citrix.com>

This will build upon the upstream Opentelemetry library, so we can gradually move the existing Tracing library over. The upstream library supports Logs and Metrics too, not just Traces. For now this lives inside quicktest, eventually it should be moved into our tracing library. No functional change. Signed-off-by: Edwin Török <edwin.torok@citrix.com>

Some quicktests may run for a long time, and we don't want to run out of memory if they keep creating events/logs/metrics on the same span. This uses a Queue internally, so that we can drop the oldest element when full. Could've used a ringbuffer, but that would've increased per-span memory usage a lot. No functional change. Signed-off-by: Edwin Török <edwin.torok@citrix.com>

The backend is very simple, and may block the caller if the background thread is slow due to I/O. This is not suitable for production use, just for testing (eventually we should use the atomic queue we have in Tracing_export *) No functional change. Can be imported into a local Jaeger instance like this: ``` curl -v localhost:4318/v1/traces --data-binary @trace.trace.otel -H 'Content-Type: application/x-protobuf' -o x ``` Logs and Metrics are not supported by Jaeger though, so those would have to be imported into another tool. Signed-off-by: Edwin Török <edwin.torok@citrix.com>

Until we can upgrade to a newer version of opentelemetry which includes it. No functional change. Signed-off-by: Edwin Török <edwin.torok@citrix.com>

Extends upstream Opentelemetry with convenience functions to record logs and metrics associated with spans. Implements sampling decisions. No functional change. Signed-off-by: Edwin Török <edwin.torok@citrix.com>

This is a parent based sampler: if the parent is sampled, then so is the current span, otherwise it defaults to recording if a backend is registered. This will allow implementing a tail based span processor that changes the sampling decision when a span fails. For now we have only 1 hardcoded sampler, eventually we might make this configurable. No functional change. Signed-off-by: Edwin Török <edwin.torok@citrix.com>

This is a Tail-based Sampling Processor. See https://opentelemetry.io/docs/languages/dotnet/traces/tail-based-sampling/ https://opentelemetry.io/docs/concepts/sampling/#tail-sampling Signed-off-by: Edwin Török <edwin.torok@citrix.com>

Wrapper around upstream Trace module using our Scope, and with support for [result]. Signed-off-by: Edwin Török <edwin.torok@citrix.com>

Signed-off-by: Edwin Török <edwin.torok@citrix.com>

We may want to emit Opentelemetry items to multiple destinations (console, disk, etc.). Implement a Collector.BACKEND functor that forwards all calls to 2 other backends. No functional change. Signed-off-by: Edwin Török <edwin.torok@citrix.com>

Currently useful for debugging how the output looks like. Signed-off-by: Edwin Török <edwin.torok@citrix.com>

Useful for quicktest_trace. Signed-off-by: Edwin Török <edwin.torok@citrix.com>

Sometimes tasks take <1s, but it is still useful to see whether that was 0.1s or 0.9s. Signed-off-by: Edwin Török <edwin.torok@citrix.com>

…ls to XAPI Signed-off-by: Edwin Török <edwin.torok@citrix.com>

Signed-off-by: Edwin Török <edwin.torok@citrix.com>

This is waiting on #6867, and then we should be ready to merge the fixes. The testing PRs/code is still outstanding, would be good to merge that as well, need to fix the CI on that one: #6858 There is also a 2nd quicktest (yet to be finished writing) that would test this a bit more thoroughly.

cli_progress_bar is used by `xe --progress`, and I've reused it in my test code in #6858. However >90% of my test runs failed on various machines due to a `String.blit` exception from `cli_progress_bar`. There are 2 possible reasons, not sure which one caused the failure, but I've fixed both, and now I have a lot more green tests (and the failures are due to actual bugs in the product, not bugs in the progress bar): * if the ETA printed would be >99h (even just temporarily) then we'd overflow the buffer's size and raise an exception. `%02d` means at least 2 digits, not at most! * if time goes backwards then we'd get a negative ETA and try to print a `-` and overflow the buffer size again and raise an exception. Replaced it with monotonic time This also contains an improvement I've made on the other PR to print total time in `ms` (to avoid having to solve rebase conflicts twice in the 2 PRs). This avoids printing awkward looking lines like Total time 00:00:00, when it actually took 0.9s maybe.

edwintorok force-pushed the private/edvint/memorytest6 branch from 4ab6428 to df23901 Compare January 22, 2026 21:32

last-genius reviewed Jan 23, 2026

View reviewed changes

ocaml/quicktest/quicktest_api_helpers.ml Outdated Show resolved Hide resolved

lindig reviewed Jan 23, 2026

View reviewed changes

edwintorok force-pushed the private/edvint/memorytest6 branch from df23901 to 020d082 Compare January 27, 2026 18:08

edwintorok mentioned this pull request Jan 28, 2026

Merge NUMA improvements to master #6869

Merged

edwintorok force-pushed the private/edvint/memorytest6 branch 2 times, most recently from e27c7fc to c07ff10 Compare January 28, 2026 17:21

edwintorok added 9 commits January 28, 2026 17:30

CP-311150: add span_status

453115c

Until we can upgrade to a newer version of opentelemetry which includes it. No functional change. Signed-off-by: Edwin Török <edwin.torok@citrix.com>

CP-311150: add Scope wrapper

2cea380

Extends upstream Opentelemetry with convenience functions to record logs and metrics associated with spans. Implements sampling decisions. No functional change. Signed-off-by: Edwin Török <edwin.torok@citrix.com>

CP-311150: introduce a SpanProcessor

2b618dc

This is a Tail-based Sampling Processor. See https://opentelemetry.io/docs/languages/dotnet/traces/tail-based-sampling/ https://opentelemetry.io/docs/concepts/sampling/#tail-sampling Signed-off-by: Edwin Török <edwin.torok@citrix.com>

CP-311150: introduce a Trace module

49f2e98

Wrapper around upstream Trace module using our Scope, and with support for [result]. Signed-off-by: Edwin Török <edwin.torok@citrix.com>

edwintorok force-pushed the private/edvint/memorytest6 branch from c07ff10 to f3d7a64 Compare January 28, 2026 17:30

edwintorok marked this pull request as ready for review January 28, 2026 17:31

edwintorok added 7 commits January 28, 2026 17:40

CP-311150: a backend that prints a simplified trace to the console

e5402c3

Signed-off-by: Edwin Török <edwin.torok@citrix.com>

CP-311150: add opentelemetry wrappers for XAPI client RPC calls

59f816d

Signed-off-by: Edwin Török <edwin.torok@citrix.com>

CP-311150: test code for new library

b9a0cd2

Currently useful for debugging how the output looks like. Signed-off-by: Edwin Török <edwin.torok@citrix.com>

CP-311150: introduce wait_for_all_with_progress

08d7d9c

Useful for quicktest_trace. Signed-off-by: Edwin Török <edwin.torok@citrix.com>

CP-311150: show ms in progress bar summary

e1fdb7d

Sometimes tasks take <1s, but it is still useful to see whether that was 0.1s or 0.9s. Signed-off-by: Edwin Török <edwin.torok@citrix.com>

CP-311150: forward Opentelemetry W3C TraceContext headers for RPC cal…

3999de6

…ls to XAPI Signed-off-by: Edwin Török <edwin.torok@citrix.com>

edwintorok added 3 commits January 28, 2026 17:40

CP-311150: wrappers for XAPI objects that print the object on failure

5d34dfa

Signed-off-by: Edwin Török <edwin.torok@citrix.com>

CP-311148: quicktest helper functions for filling memory with VMs

4ec06d5

Signed-off-by: Edwin Török <edwin.torok@citrix.com>

CP-311148: calibrate VM memory overhead measurements

146f2c3

Signed-off-by: Edwin Török <edwin.torok@citrix.com>

edwintorok force-pushed the private/edvint/memorytest6 branch from f3d7a64 to 146f2c3 Compare January 28, 2026 17:40

edwintorok mentioned this pull request Feb 4, 2026

CA-423576: fix cli_progress_bar crashes #6892

Merged

edwintorok marked this pull request as draft February 5, 2026 10:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CP-311148/CP-311150: add a quicktest for XAPI memory overhead calculations#6858

CP-311148/CP-311150: add a quicktest for XAPI memory overhead calculations#6858
edwintorok wants to merge 20 commits intoxapi-project:feature/numa-xs9from
edwintorok:private/edvint/memorytest6

edwintorok commented Jan 22, 2026

Uh oh!

edwintorok commented Jan 22, 2026 •

edited

Loading

Uh oh!

edwintorok commented Jan 22, 2026 •

edited

Loading

Uh oh!

Uh oh!

edwintorok commented Jan 23, 2026

Uh oh!

lindig left a comment

Uh oh!

edwintorok commented Jan 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

edwintorok commented Jan 22, 2026

Uh oh!

edwintorok commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

edwintorok commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

edwintorok commented Jan 23, 2026

Uh oh!

lindig left a comment

Choose a reason for hiding this comment

Uh oh!

edwintorok commented Jan 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

edwintorok commented Jan 22, 2026 •

edited

Loading

edwintorok commented Jan 22, 2026 •

edited

Loading