Skip to content

CP-311148/CP-311150: add a quicktest for XAPI memory overhead calculations#6858

Draft
edwintorok wants to merge 20 commits intoxapi-project:feature/numa-xs9from
edwintorok:private/edvint/memorytest6
Draft

CP-311148/CP-311150: add a quicktest for XAPI memory overhead calculations#6858
edwintorok wants to merge 20 commits intoxapi-project:feature/numa-xs9from
edwintorok:private/edvint/memorytest6

Conversation

@edwintorok
Copy link
Member

Test that we can fill a host with 1 VM, with N VMs, based on maximise_memory/compute_memory_overhead.

Check that the constant factors used in XAPI are correct, e.g. amount of memory used/vcpu.
Can be used to validate these PRs:
#6855
#6854

There is also a pagetable overhead calculation, but something weird is going on there:

[2026-01-22T18:40:49.342348481-00:00|0000000000000000]  pagetables,memory_overhead_pages,coeff,vms
[2026-01-22T18:40:49.342333285-00:00|0000000000000000]  64,793,12.3906,9223372036854775807
[2026-01-22T18:40:49.342335974-00:00|0000000000000000]  192,1305,6.79688,9223372036854775807
[2026-01-22T18:40:49.342337658-00:00|0000000000000000]  448,2329,5.19866,9223372036854775807
[2026-01-22T18:40:49.342339751-00:00|0000000000000000]  962,4377,4.5499,9223372036854775807
[2026-01-22T18:40:49.342341392-00:00|0000000000000000]  263102,1048827,3.98639,9223372036854775807
[2026-01-22T18:40:49.342343128-00:00|0000000000000000]  526273,2097403,3.98539,9223372036854775807
[2026-01-22T18:40:49.342345071-00:00|0000000000000000]  708913,2825211,3.98527,9223372036854775807

That should be ~4, don't know why it'd be 13, it used to be reliably 4 previously, could be a bug in the test.
That'll need further investigation (also there is enough free memory on the host that this underestimate doesn't actually cause a failure, which is also unexpected).

@edwintorok edwintorok force-pushed the private/edvint/memorytest6 branch from 4ab6428 to df23901 Compare January 22, 2026 21:32
@edwintorok
Copy link
Member Author

edwintorok commented Jan 22, 2026

Something is broken though in XAPI now (not sure whether a race condition, or a new bug inherit from another branch or master):

2026-01-22T21:52:22.281822457-00:00|4d6cefa9be6757ca] Dune__exe__Quicktest_vm_calibrate.host_mem_leak ERROR Server_error(INTERNAL_ERROR, [ VM not in expected power state after completing operation: OpaqueRef:f9a1c2b5-e33a-8670-4071-143bf46012dc, paused, halted ]) traceparent: 00-b5a629b3ed0330b97ff74ee25b342a3c-4d6cefa9be6757ca-00, traceparent: 00-b5a629b3ed0330b97ff74ee25b342a3c-4d6cefa9be6757ca-00, traceparent: 00-b5a629b3ed0330b97ff74ee25b342a3c-4d6cefa9be6757ca-00
[2026-01-22T21:52:22.370054464-00:00|4d6cefa9be6757ca]  assert_can_boot_here succeeded
[2026-01-22T21:52:22.703489012-00:00|4d6cefa9be6757ca] [duration:  +0.421667s]
[2026-01-22T21:52:22.703487921-00:00|4d6cefa9be6757ca]  error backtrace: Raised at Client.server_failure in file "ocaml/xapi-client/client.ml", line 7, characters 31-75 Called from Client.ClientF.rpc_wrapper.(fun) in file "ocaml/xapi-client/client.ml", line 19, characters 55-110 Called from Client.ClientF.VM.start_on in file "ocaml/xapi-client/client.ml", line 7937, characters 6-47 Called from Client.ClientF.call in file "ocaml/xapi-client/client.ml" (inlined), line 24, characters 33-51 Called from Quicktest_trace_api__Api.Object.with_call.(fun) in file "ocaml/quicktest/trace/api/api.ml", line 161, characters 8-19 Re-raised at Quicktest_trace_api__Api.Object.with_call.(fun) in file "ocaml/quicktest/trace/api/api.ml", line 167, characters 6-40 Called from Ambient_context_thread_local__Thread_local.with_ in file "vendor/thread_local/thread_local.ml", line 96, characters 14-19 Re-raised at Ambient_context_thread_local__Thread_local.with_ in file "vendor/thread_local/thread_local.ml", line 101, characters 4-11 Called from Ambient_context_thread_local__Thread_local.with_ in file "vendor/thread_local/thread_local.ml", line 96, characters 14-19 Re-raised at Ambient_context_thread_local__Thread_local.with_ in file "vendor/thread_local/thread_local.ml", line 101, characters 4-11 Called from Quicktest_trace__Trace.with_ in file "ocaml/quicktest/trace/trace.ml", line 97, characters 12-20 Re-raised at Quicktest_trace__Trace.with_ in file "ocaml/quicktest/trace/trace.ml", line 102, characters 4-40 Called from Dune__exe__Quicktest_vm_calibrate.host_mem_leak.(fun).loop in file "ocaml/quicktest/quicktest_vm_calibrate.ml", line 115, characters 4-97 Called from Xapi_stdext_pervasives__Pervasiveext.finally in file "ocaml/libs/xapi-stdext/lib/xapi-stdext-pervasives/pervasiveext.ml", line 24, characters 8-14 Re-raised at Xapi_stdext_pervasives__Pervasiveext.finally in file "ocaml/libs/xapi-stdext/lib/xapi-stdext-pervasives/pervasiveext.ml", line 39, characters 6-15 Called from Ambient_context_thread_local__Thread_local.with_ in file "vendor/thread_local/thread_local.ml", line 96, characters 14-19 Re-raised at Ambient_context_thread_local__Thread_local.with_ in file "vendor/thread_local/thread_local.ml", line 101, characters 4-11 Called from Ambient_context_thread_local__Thread_local.with_ in file "vendor/thread_local/thread_local.ml", line 96, characters 14-19 Re-raised at Ambient_context_thread_local__Thread_local.with_ in file "vendor/thread_local/thread_local.ml", line 101, characters 4-11 Called from Quicktest_trace__Trace.with_ in file "ocaml/quicktest/trace/trace.ml", line 97, characters 12-20

I'll look at this next week.

@edwintorok
Copy link
Member Author

edwintorok commented Jan 22, 2026

Also looks like this now calls something too early that breaks the CI (it calls some xenctrl function that is not implemented when run outside of Xen), it does work in koji.

@edwintorok
Copy link
Member Author

Something is broken though in XAPI now (not sure whether a race condition, or a new bug inherit from another branch or master):

2026-01-22T21:52:22.281822457-00:00|4d6cefa9be6757ca] Dune__exe__Quicktest_vm_calibrate.host_mem_leak ERROR Server_error(INTERNAL_ERROR, [ VM not in expected power state after completing operation: OpaqueRef:f9a1c2b5-e33a-8670-4071-143bf46012dc, paused, halted ]) traceparent: 00-b5a629b3ed0330b97ff74ee25b342a3c-4d6cefa9be6757ca-00, traceparent: 00-b5a629b3ed0330b97ff74ee25b342a3c-4d6cefa9be6757ca-00, traceparent: 00-b5a629b3ed0330b97ff74ee25b342a3c-4d6cefa9be6757ca-00
[2026-01-22T21:52:22.370054464-00:00|4d6cefa9be6757ca]  assert_can_boot_here succeeded
[2026-01-22T21:52:22.703489012-00:00|4d6cefa9be6757ca] [duration:  +0.421667s]
[2026-01-22T21:52:22.703487921-00:00|4d6cefa9be6757ca]  error backtrace: Raised at Client.server_failure in file "ocaml/xapi-client/client.ml", line 7, characters 31-75 Called from Client.ClientF.rpc_wrapper.(fun) in file "ocaml/xapi-client/client.ml", line 19, characters 55-110 Called from Client.ClientF.VM.start_on in file "ocaml/xapi-client/client.ml", line 7937, characters 6-47 Called from Client.ClientF.call in file "ocaml/xapi-client/client.ml" (inlined), line 24, characters 33-51 Called from Quicktest_trace_api__Api.Object.with_call.(fun) in file "ocaml/quicktest/trace/api/api.ml", line 161, characters 8-19 Re-raised at Quicktest_trace_api__Api.Object.with_call.(fun) in file "ocaml/quicktest/trace/api/api.ml", line 167, characters 6-40 Called from Ambient_context_thread_local__Thread_local.with_ in file "vendor/thread_local/thread_local.ml", line 96, characters 14-19 Re-raised at Ambient_context_thread_local__Thread_local.with_ in file "vendor/thread_local/thread_local.ml", line 101, characters 4-11 Called from Ambient_context_thread_local__Thread_local.with_ in file "vendor/thread_local/thread_local.ml", line 96, characters 14-19 Re-raised at Ambient_context_thread_local__Thread_local.with_ in file "vendor/thread_local/thread_local.ml", line 101, characters 4-11 Called from Quicktest_trace__Trace.with_ in file "ocaml/quicktest/trace/trace.ml", line 97, characters 12-20 Re-raised at Quicktest_trace__Trace.with_ in file "ocaml/quicktest/trace/trace.ml", line 102, characters 4-40 Called from Dune__exe__Quicktest_vm_calibrate.host_mem_leak.(fun).loop in file "ocaml/quicktest/quicktest_vm_calibrate.ml", line 115, characters 4-97 Called from Xapi_stdext_pervasives__Pervasiveext.finally in file "ocaml/libs/xapi-stdext/lib/xapi-stdext-pervasives/pervasiveext.ml", line 24, characters 8-14 Re-raised at Xapi_stdext_pervasives__Pervasiveext.finally in file "ocaml/libs/xapi-stdext/lib/xapi-stdext-pervasives/pervasiveext.ml", line 39, characters 6-15 Called from Ambient_context_thread_local__Thread_local.with_ in file "vendor/thread_local/thread_local.ml", line 96, characters 14-19 Re-raised at Ambient_context_thread_local__Thread_local.with_ in file "vendor/thread_local/thread_local.ml", line 101, characters 4-11 Called from Ambient_context_thread_local__Thread_local.with_ in file "vendor/thread_local/thread_local.ml", line 96, characters 14-19 Re-raised at Ambient_context_thread_local__Thread_local.with_ in file "vendor/thread_local/thread_local.ml", line 101, characters 4-11 Called from Quicktest_trace__Trace.with_ in file "ocaml/quicktest/trace/trace.ml", line 97, characters 12-20

I'll look at this next week.

This is a pre-existing bug on XAPI master: when Xen is missing support for RRD4 (domain info numa pages) then we get an ENOSYS exception and we fail to boot the VM.
We should instead handle that error and allow the VM to boot and report 'unkown' for the numa info field.

(even if you installed the updated hypervisor package you still need a full host reboot for this to take effect)

Copy link
Contributor

@lindig lindig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a lot of code at once. Given that it is test code and not production code I am not too worried about it and I assume you have used it already quite a bit.

@edwintorok
Copy link
Member Author

That should be ~4, don't know why it'd be 13, it used to be reliably 4 previously,

There is also a rounding bug in XAPI (maximise_memory rounds to 1MiB, not 2MiB, but it appears that overall the memory used would be as if the rounding was to 2MiB, but that needs a bit more investigation).

Writing code that calls XAPI functions is quite tedious, because you have to
repeat `~rpc  ~session_id` every time.

It saves quite a lot of typing to write in this style instead:

```
open Client.Client

...
  let value = call t @@ VM.maximise_memory ~self ~approximate:false ~total in
  call t @@ VM.set_memory ~self ~value
```

You still need to repeat `call t @@`, but it is at the beginning and doesn't
hinder readability.

Add new types and `val call` to Client.Client.
The type is called `client` instead of `t` because it isn't used uniformly by
other functions in this module.

No functional change to the product.

Signed-off-by: Edwin Török <edwin.torok@citrix.com>
@edwintorok edwintorok force-pushed the private/edvint/memorytest6 branch from df23901 to 020d082 Compare January 27, 2026 18:08
@edwintorok edwintorok force-pushed the private/edvint/memorytest6 branch 2 times, most recently from e27c7fc to c07ff10 Compare January 28, 2026 17:21
This uses a previously unused field in the log message format to log
the Trace Context.
This include the Trace ID (common for the entire tree of activities),
and parent Span ID (unique to this instance of the remote caller).
We don't log the local span/parent ID, since this will keep changing.

Logging the traceparent could make it easier to group log messages belonging
to the same high level activity.

When an external Trace Context is not available (the default) then the log
messages are unchanged.

Another alternative would be to explicitly pass a scope/context to the logging
functions, but this would require some automated rewriting of the codebase to
plumb through the required parameters.
With the ambient context the change is much smaller, and we can still plumb
through an explicit context later if needed.

To avoid a dependency cycle this is not using Threadext, but Ambient_context
directly.

The first user of this will be the new quicktest.

Signed-off-by: Edwin Török <edwin.torok@citrix.com>
This will build upon the upstream Opentelemetry library,
so we can gradually move the existing Tracing library over.
The upstream library supports Logs and Metrics too, not just Traces.

For now this lives inside quicktest, eventually it should be moved
into our tracing library.

No functional change.

Signed-off-by: Edwin Török <edwin.torok@citrix.com>
Some quicktests may run for a long time, and we don't want to run out of memory
if they keep creating events/logs/metrics on the same span.

This uses a Queue internally, so that we can drop the oldest element when full.
Could've used a ringbuffer, but that would've increased per-span memory usage a
lot.

No functional change.

Signed-off-by: Edwin Török <edwin.torok@citrix.com>
The backend is very simple, and may block the caller if the background thread
is slow due to I/O.
This is not suitable for production use, just for testing
(eventually we should use the atomic queue we have in Tracing_export *)

No functional change.

Can be imported into a local Jaeger instance like this:

```
curl -v localhost:4318/v1/traces --data-binary @trace.trace.otel -H 'Content-Type: application/x-protobuf' -o x
```

Logs and Metrics are not supported by Jaeger though, so those would have to be
imported into another tool.

Signed-off-by: Edwin Török <edwin.torok@citrix.com>
Until we can upgrade to a newer version of opentelemetry which includes it.

No functional change.

Signed-off-by: Edwin Török <edwin.torok@citrix.com>
Extends upstream Opentelemetry with convenience functions
to record logs and metrics associated with spans.

Implements sampling decisions.

No functional change.

Signed-off-by: Edwin Török <edwin.torok@citrix.com>
This is a parent based sampler: if the parent is sampled, then so is the
current span, otherwise it defaults to recording if a backend is registered.

This will allow implementing a tail based span processor that changes the
sampling decision when a span fails.

For now we have only 1 hardcoded sampler, eventually we might make this
configurable.

No functional change.

Signed-off-by: Edwin Török <edwin.torok@citrix.com>
Wrapper around upstream Trace module using our Scope,
and with support for [result].

Signed-off-by: Edwin Török <edwin.torok@citrix.com>
@edwintorok edwintorok force-pushed the private/edvint/memorytest6 branch from c07ff10 to f3d7a64 Compare January 28, 2026 17:30
@edwintorok edwintorok marked this pull request as ready for review January 28, 2026 17:31
Signed-off-by: Edwin Török <edwin.torok@citrix.com>
Signed-off-by: Edwin Török <edwin.torok@citrix.com>
We may want to emit Opentelemetry items to multiple destinations
(console, disk, etc.).
Implement a Collector.BACKEND functor that forwards all calls to 2 other backends.

No functional change.

Signed-off-by: Edwin Török <edwin.torok@citrix.com>
Currently useful for debugging how the output looks like.

Signed-off-by: Edwin Török <edwin.torok@citrix.com>
Useful for quicktest_trace.

Signed-off-by: Edwin Török <edwin.torok@citrix.com>
Sometimes tasks take <1s, but it is still useful to see whether that was 0.1s
or 0.9s.

Signed-off-by: Edwin Török <edwin.torok@citrix.com>
…ls to XAPI

Signed-off-by: Edwin Török <edwin.torok@citrix.com>
Signed-off-by: Edwin Török <edwin.torok@citrix.com>
Signed-off-by: Edwin Török <edwin.torok@citrix.com>
Signed-off-by: Edwin Török <edwin.torok@citrix.com>
@edwintorok edwintorok force-pushed the private/edvint/memorytest6 branch from f3d7a64 to 146f2c3 Compare January 28, 2026 17:40
github-merge-queue bot pushed a commit that referenced this pull request Jan 30, 2026
This is waiting on #6867,
and then we should be ready to merge the fixes.

The testing PRs/code is still outstanding, would be good to merge that
as well, need to fix the CI on that one:
#6858

There is also a 2nd quicktest (yet to be finished writing) that would
test this a bit more thoroughly.
@edwintorok edwintorok marked this pull request as draft February 5, 2026 10:56
github-merge-queue bot pushed a commit that referenced this pull request Feb 5, 2026
cli_progress_bar is used by `xe --progress`, and I've reused it in my
test code in #6858.
However >90% of my test runs failed on various machines due to a
`String.blit` exception from `cli_progress_bar`.

There are 2 possible reasons, not sure which one caused the failure, but
I've fixed both, and now I have a lot more green tests (and the failures
are due to actual bugs in the product, not bugs in the progress bar):
* if the ETA printed would be >99h (even just temporarily) then we'd
overflow the buffer's size and raise an exception. `%02d` means at least
2 digits, not at most!
* if time goes backwards then we'd get a negative ETA and try to print a
`-` and overflow the buffer size again and raise an exception. Replaced
it with monotonic time

This also contains an improvement I've made on the other PR to print
total time in `ms` (to avoid having to solve rebase conflicts twice in
the 2 PRs). This avoids printing awkward looking lines like Total time
00:00:00, when it actually took 0.9s maybe.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants