Merge NUMA improvements to master by edwintorok · Pull Request #6869 · xapi-project/xen-api

edwintorok · 2026-01-28T11:44:19Z

This is waiting on #6867, and then we should be ready to merge the fixes.

The testing PRs/code is still outstanding, would be good to merge that as well, need to fix the CI on that one: #6858

There is also a 2nd quicktest (yet to be finished writing) that would test this a bit more thoroughly.

The only Xen command-line related to this is `low_mem_virq_limit`, which is 64MiB. A new quicktest has shown that we are sometimes off by ~10MiB or more, and get failures booting VMs even after `assert_can_boot_here` said yes. Sometimes the error messages can be quite ugly, internal xenguest/xenopsd errors, instead of HOST_NOT_ENOUGH_FREE_MEMORY. Signed-off-by: Edwin Török <edwin.torok@citrix.com>

The only Xen command-line related to this is `low_mem_virq_limit`, which is 64MiB. A new quicktest has shown that we are sometimes off by ~10MiB (between `Host.compute_free_memory` and actual free memory as measured by a call to Xenctrl physinfo) or more, and get failures booting VMs even after `assert_can_boot_here` said yes. Sometimes the error messages can be quite ugly, internal xenguest/xenopsd errors, instead of HOST_NOT_ENOUGH_FREE_MEMORY. After this change (together with #6854) the new quicktest doesn't fail anymore. PR to feature branch because this will need testing together with all the other NUMA changes, it may expose latent bugs elsewhere. The new testcase will get its own PR because it is quite large.

Signed-off-by: Edwin Török <edwin.torok@citrix.com>

Measured the actual increase in host memory usage when increasing the number of vCPUs on a VM from 1 to 64: ``` vcpu,memory_overhead_pages,coeff 1,264,264 2,558,279 3,776,258.667 4,1032,258 5,1350,270 6,1614,269 7,1878,268.286 8,2056,257 9,2406,267.333 10,2670,267 11,2934,266.727 12,3198,266.5 13,3462,266.308 14,3726,266.143 15,3990,266 16,4254,265.875 17,4518,265.765 18,4782,265.667 19,5046,265.579 20,5310,265.5 21,5574,265.429 22,5838,265.364 23,6102,265.304 24,6366,265.25 25,6630,265.2 26,6894,265.154 27,7158,265.111 28,7422,265.071 29,7686,265.034 30,7952,265.067 31,8216,265.032 32,8480,265 33,8744,264.97 34,9009,264.971 35,9276,265.029 36,9543,265.083 37,9810,265.135 38,10076,265.158 39,10340,265.128 40,10604,265.1 41,10869,265.098 42,11133,265.071 43,11397,265.047 44,11662,265.045 45,11925,265 46,12191,265.022 47,12454,264.979 0,30,0 1,294,294 2,558,279 3,822,274 4,1086,271.5 5,1350,270 6,1614,269 7,1878,268.286 8,2142,267.75 9,2406,267.333 10,2670,267 11,2934,266.727 12,3198,266.5 13,3462,266.308 14,3726,266.143 15,3990,266 16,4254,265.875 17,4518,265.765 18,4782,265.667 19,5046,265.579 20,5310,265.5 21,5574,265.429 22,5838,265.364 23,6102,265.304 24,6366,265.25 25,6630,265.2 26,6894,265.154 27,7158,265.111 28,7422,265.071 29,7686,265.034 30,7952,265.067 31,8216,265.032 32,8480,265 33,8744,264.97 34,9011,265.029 35,9278,265.086 36,9546,265.167 37,9811,265.162 38,10076,265.158 39,10340,265.128 40,10603,265.075 41,10869,265.098 42,11132,265.048 43,11397,265.047 44,11663,265.068 45,11925,265 46,12191,265.022 47,12456,265.021 [INFO]VM memory_overhead_pages = ... + vcpu * 294 =~ ... + vcpu * 294 ``` We already allocate 256 pages/vcpu as part of shadow, so we need an extra 294-256=38 pages/vcpu. This can lead to internal errors raised by xenguest, or NOT_ENOUGH_FREE_MEMORY errors raised by xenopsd, after `assert_can_boot_here` has already replied yes, even when booting VMs sequentially. It could also lead XAPI to choose the wrong host to evacuate a VM too, which could lead to RPU migration failures. This is a pre-existing bug, affecting both the versions of Xen in XS8 and XS9. Cannot allocate this from shadow, because otherwise the memory usage would never converge (Xen doesn't allocate these from shadow). On another host the measured overhead is less, take the maximum for now: ``` [INFO]VM memory_overhead_pages = ... + vcpu * 264.067 =~ ... + vcpu * 265 ``` Also the amount of shadow memory reserved is nearly twice as much as needed, especially that shadow is compiled out of Xen, but overestimates are OK, and we might fix that separately. Signed-off-by: Edwin Török <edwin.torok@citrix.com>

Measured the actual increase in host memory usage when increasing the number of vCPUs on a VM from 1 to 64: ``` delta vcpu,delta memory_overhead_pages,coeff 1,264,264 3,724,241.333 7,1848,264 15,3960,264 31,8186,264.065 63,16635,264.048 ``` Ran the test on both an AMD and Intel host and got similar results. Currently XAPI uses 256*vcpu, which is an underestimate. This can lead to internal errors raised by xenguest, or NOT_ENOUGH_FREE_MEMORY errors raised by xenopsd, after `assert_can_boot_here` has already replied yes, even when booting VMs sequentially. It could also lead XAPI to choose the wrong host to evacuate a VM too, which could lead to RPU migration failures. This is a pre-existing bug, affecting both the versions of Xen in XS8 and XS9. PR to feature branch because this will need testing together with all the other NUMA changes, it may expose latent bugs elsewhere. The new testcase will get its own PR because it is quite large.

Signed-off-by: Edwin Török <edwin.torok@citrix.com>

Do not mix using claims with not using claims. Xen cannot currently guarantee that it'll honour a VM's memory claim, unless all other VMs also use claims. Global claims have existed since a long time in Xen, so this should be safe to do on both XS8 and XS9. Safer defaults for global claims: Xen may have already allocated some memory for the domain, and the overhead is only an estimate. A global claim failing is a hard failure, so instead use a more conservative estimate: `memory.build_start_mib`. This is similar to `required_host_free_mib`, but doesn't take overhead into account. Eventually we'd want to have another argument to the create hypercall that tells it what NUMA node(s) to use, and then we can include all the overhead too there. For the single node claim keep the amount as it was, it is only a best effort claim. Do not claim shadow_mib, it has already been allocated: When rebooting lots of VMs in parallel we might run out of memory and fail to boot all the VMs again. This is because we overestimate the amount of memory required, and claim too much. That memory is released when the domain build finishes, but when building domains in parallel it'll temporarily result in an out of memory error. Instead try to claim only what is left to be allocated: the p2m map and shadow map have already been allocated by this point, i.e. claim just the bare minimum. Signed-off-by: Edwin Török <edwin.torok@citrix.com>

Also always print memory free statistics when `wait_xen_free_mem` is called. Turns out `scrub_pages` is always 0, since this never got implemented in Xen (it is hardcoded to 0). Signed-off-by: Edwin Török <edwin.torok@citrix.com>

We need to reserve the exact same amount of pages on the destination as we had on the source. The amount we reserved when initially booting the domain was only an estimate, but when migrating we know the exact amount. Signed-off-by: Edwin Török <edwin.torok@citrix.com>

See individual commits. Draft PR, because this is still being tested, together with the Xen side changes to make the allocator more reliable.

edwintorok · 2026-01-30T10:09:11Z

This is waiting on #6867, and then we should be ready to merge the fixes.

This will take a while, maybe best to merge this now, it is at least an improvement over the NUMA feature in master.

The testing PRs/code is still outstanding, would be good to merge that as well, need to fix the CI on that one: #6858

Lets merge this separately, although that test is passing in my limited testing, need to do more wide testing to see how stable the test and the product is on wider range of hardware.

edwintorok and others added 11 commits January 27, 2026 11:23

[maintenance]: fix formatting

e2db96c

Signed-off-by: Edwin Török <edwin.torok@citrix.com>

CA-422187: only ENOMEM is retrieable when a single-node NUMA claim fails

f40cc48

Signed-off-by: Edwin Török <edwin.torok@citrix.com>

CA-422187: more accurate debug messages

be13737

Also always print memory free statistics when `wait_xen_free_mem` is called. Turns out `scrub_pages` is always 0, since this never got implemented in Xen (it is hardcoded to 0). Signed-off-by: Edwin Török <edwin.torok@citrix.com>

NUMA claim handling improvements (#6809)

4f11f5e

See individual commits. Draft PR, because this is still being tested, together with the Xen side changes to make the allocator more reliable.

Update NUMA feature branch from master (#6865)

f8f5a25

edwintorok marked this pull request as ready for review January 30, 2026 09:52

edwintorok enabled auto-merge January 30, 2026 09:53

edwintorok requested a review from mg12 January 30, 2026 10:09

psafont approved these changes Jan 30, 2026

View reviewed changes

GabrielBuica approved these changes Jan 30, 2026

View reviewed changes

edwintorok added this pull request to the merge queue Jan 30, 2026

mg12 approved these changes Jan 30, 2026

View reviewed changes

Merged via the queue into master with commit 1c354a4 Jan 30, 2026
53 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge NUMA improvements to master#6869

Merge NUMA improvements to master#6869
edwintorok merged 11 commits intomasterfrom
feature/numa-xs9

edwintorok commented Jan 28, 2026 •

edited

Loading

Uh oh!

edwintorok commented Jan 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

edwintorok commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

edwintorok commented Jan 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

edwintorok commented Jan 28, 2026 •

edited

Loading