CP-311148/CP-311150: add a quicktest for XAPI memory overhead calculations#6858
CP-311148/CP-311150: add a quicktest for XAPI memory overhead calculations#6858edwintorok wants to merge 20 commits intoxapi-project:feature/numa-xs9from
Conversation
4ab6428 to
df23901
Compare
|
Something is broken though in XAPI now (not sure whether a race condition, or a new bug inherit from another branch or master): I'll look at this next week. |
|
Also looks like this now calls something too early that breaks the CI (it calls some xenctrl function that is not implemented when run outside of Xen), it does work in koji. |
This is a pre-existing bug on XAPI master: when Xen is missing support for RRD4 (domain info numa pages) then we get an ENOSYS exception and we fail to boot the VM. (even if you installed the updated hypervisor package you still need a full host reboot for this to take effect) |
lindig
left a comment
There was a problem hiding this comment.
This is a lot of code at once. Given that it is test code and not production code I am not too worried about it and I assume you have used it already quite a bit.
There is also a rounding bug in XAPI (maximise_memory rounds to 1MiB, not 2MiB, but it appears that overall the memory used would be as if the rounding was to 2MiB, but that needs a bit more investigation). |
Writing code that calls XAPI functions is quite tedious, because you have to repeat `~rpc ~session_id` every time. It saves quite a lot of typing to write in this style instead: ``` open Client.Client ... let value = call t @@ VM.maximise_memory ~self ~approximate:false ~total in call t @@ VM.set_memory ~self ~value ``` You still need to repeat `call t @@`, but it is at the beginning and doesn't hinder readability. Add new types and `val call` to Client.Client. The type is called `client` instead of `t` because it isn't used uniformly by other functions in this module. No functional change to the product. Signed-off-by: Edwin Török <edwin.torok@citrix.com>
df23901 to
020d082
Compare
e27c7fc to
c07ff10
Compare
This uses a previously unused field in the log message format to log the Trace Context. This include the Trace ID (common for the entire tree of activities), and parent Span ID (unique to this instance of the remote caller). We don't log the local span/parent ID, since this will keep changing. Logging the traceparent could make it easier to group log messages belonging to the same high level activity. When an external Trace Context is not available (the default) then the log messages are unchanged. Another alternative would be to explicitly pass a scope/context to the logging functions, but this would require some automated rewriting of the codebase to plumb through the required parameters. With the ambient context the change is much smaller, and we can still plumb through an explicit context later if needed. To avoid a dependency cycle this is not using Threadext, but Ambient_context directly. The first user of this will be the new quicktest. Signed-off-by: Edwin Török <edwin.torok@citrix.com>
This will build upon the upstream Opentelemetry library, so we can gradually move the existing Tracing library over. The upstream library supports Logs and Metrics too, not just Traces. For now this lives inside quicktest, eventually it should be moved into our tracing library. No functional change. Signed-off-by: Edwin Török <edwin.torok@citrix.com>
Some quicktests may run for a long time, and we don't want to run out of memory if they keep creating events/logs/metrics on the same span. This uses a Queue internally, so that we can drop the oldest element when full. Could've used a ringbuffer, but that would've increased per-span memory usage a lot. No functional change. Signed-off-by: Edwin Török <edwin.torok@citrix.com>
The backend is very simple, and may block the caller if the background thread is slow due to I/O. This is not suitable for production use, just for testing (eventually we should use the atomic queue we have in Tracing_export *) No functional change. Can be imported into a local Jaeger instance like this: ``` curl -v localhost:4318/v1/traces --data-binary @trace.trace.otel -H 'Content-Type: application/x-protobuf' -o x ``` Logs and Metrics are not supported by Jaeger though, so those would have to be imported into another tool. Signed-off-by: Edwin Török <edwin.torok@citrix.com>
Until we can upgrade to a newer version of opentelemetry which includes it. No functional change. Signed-off-by: Edwin Török <edwin.torok@citrix.com>
Extends upstream Opentelemetry with convenience functions to record logs and metrics associated with spans. Implements sampling decisions. No functional change. Signed-off-by: Edwin Török <edwin.torok@citrix.com>
This is a parent based sampler: if the parent is sampled, then so is the current span, otherwise it defaults to recording if a backend is registered. This will allow implementing a tail based span processor that changes the sampling decision when a span fails. For now we have only 1 hardcoded sampler, eventually we might make this configurable. No functional change. Signed-off-by: Edwin Török <edwin.torok@citrix.com>
This is a Tail-based Sampling Processor. See https://opentelemetry.io/docs/languages/dotnet/traces/tail-based-sampling/ https://opentelemetry.io/docs/concepts/sampling/#tail-sampling Signed-off-by: Edwin Török <edwin.torok@citrix.com>
Wrapper around upstream Trace module using our Scope, and with support for [result]. Signed-off-by: Edwin Török <edwin.torok@citrix.com>
c07ff10 to
f3d7a64
Compare
Signed-off-by: Edwin Török <edwin.torok@citrix.com>
Signed-off-by: Edwin Török <edwin.torok@citrix.com>
We may want to emit Opentelemetry items to multiple destinations (console, disk, etc.). Implement a Collector.BACKEND functor that forwards all calls to 2 other backends. No functional change. Signed-off-by: Edwin Török <edwin.torok@citrix.com>
Currently useful for debugging how the output looks like. Signed-off-by: Edwin Török <edwin.torok@citrix.com>
Useful for quicktest_trace. Signed-off-by: Edwin Török <edwin.torok@citrix.com>
Sometimes tasks take <1s, but it is still useful to see whether that was 0.1s or 0.9s. Signed-off-by: Edwin Török <edwin.torok@citrix.com>
…ls to XAPI Signed-off-by: Edwin Török <edwin.torok@citrix.com>
Signed-off-by: Edwin Török <edwin.torok@citrix.com>
Signed-off-by: Edwin Török <edwin.torok@citrix.com>
Signed-off-by: Edwin Török <edwin.torok@citrix.com>
f3d7a64 to
146f2c3
Compare
cli_progress_bar is used by `xe --progress`, and I've reused it in my test code in #6858. However >90% of my test runs failed on various machines due to a `String.blit` exception from `cli_progress_bar`. There are 2 possible reasons, not sure which one caused the failure, but I've fixed both, and now I have a lot more green tests (and the failures are due to actual bugs in the product, not bugs in the progress bar): * if the ETA printed would be >99h (even just temporarily) then we'd overflow the buffer's size and raise an exception. `%02d` means at least 2 digits, not at most! * if time goes backwards then we'd get a negative ETA and try to print a `-` and overflow the buffer size again and raise an exception. Replaced it with monotonic time This also contains an improvement I've made on the other PR to print total time in `ms` (to avoid having to solve rebase conflicts twice in the 2 PRs). This avoids printing awkward looking lines like Total time 00:00:00, when it actually took 0.9s maybe.
Test that we can fill a host with 1 VM, with N VMs, based on maximise_memory/compute_memory_overhead.
Check that the constant factors used in XAPI are correct, e.g. amount of memory used/vcpu.
Can be used to validate these PRs:
#6855
#6854
There is also a pagetable overhead calculation, but something weird is going on there:
That should be ~4, don't know why it'd be 13, it used to be reliably 4 previously, could be a bug in the test.
That'll need further investigation (also there is enough free memory on the host that this underestimate doesn't actually cause a failure, which is also unexpected).