NUMA balancing and scheduling in Xen

Owner: edwintorok

Time: Wed 4:30 PM 5 Jun +0100 (Europe/Lisbon) Final

Location: LIBERDADE

Background:

https://wiki.xenproject.org/wiki/Xen_on_NUMA_Machines (upstream toolstack)
https://xapi-project.github.io/new-docs/toolstack/features/NUMA/index.html (XAPI)

Problems:

vCPU get stuck outside of their soft-affinity mask. This is allowed, but undesirable long term. Can happen on credit1 because decisions are taken locally.
- possible workaround: perioridically set hard-affinity to move vCPUs back (we don’t want hard-affinity permanently because one NUMA node may end up completely idle if we have a burst of activity on the other guests)
- due to this perf issue turning NUMA balancing on can actually degrade performance, because before we’d at least get memory interleaved and on average get better memory performance when the vCPUs are running in the wrong place
allocation is static, at boot time, and cannot react to runtime bursts of activity. Might be useful to be able to migrate pages from one NUMA node to another without doing a full VM migration, just a guest-unaware page migration
no API to query or set where pages have ended up, we can only query host free memory, but that is racy with booting VMs in parallel, and having a lock around booting VMs is undesirable for performance reasons (although could be used as a short term workaround)
handling out of memory scenarios on a node: might be useful to specify a 2nd or 3rd node to chose when the preferred one runs out (e.g. chose another node that is closest), or perhaps return a hard failure (some users might prefer this)
if we had working vNUMA then we could hard split the host into 2 (if it has >= 2 numa nodes), and give each host 2 vNUMA nodes, and expose this topology to the guest, so it can make better scheduling decisions
how about Dom0? Last time I tried enabling NUMA in Kconfig resulted in a kernel that failed to boot
how about PV backends in Dom0? might be useful to run these on the NUMA node matching the guest

The XAPI toolstack would need some more information on which NUMA node each VM runs on, and pehaps perform memory reservation, so that it can do more reliable scheduling of VMs on NUMA nodes. The soft affinity mask is only a hint and it may fail, so it’d be good to at least “fix up” the soft affinity mask based on where memory actually got allocated on.

There are also some potential issues with the soft affinity support in the Xen scheduler that needs a more detailed investigation based on a user report (e.g. if all vCPUs end up running outside of their soft affinity mask, do they ever get back to the correct place without rebooting a host, this has been reproduced now by having 3 guests A (0-27),B(28-55),C(28-55) across 2 numa nodes, pause A, then B or C will have some of its CPUs running on 0-27, then unpause A and pause B. Now C will keep running its CPUs in the wrong place and A is running its CPUs in the wrong place, but if some of those vCPUs would get swapped the problem might go away. This is with credit1, it is believed that credit2 might not have this issue, but might have other scalability issues)

Also perhaps some dynamic scheduling of NUMA workloads might be useful, a static schedule only works well if VM workload is very similar, asymmetric workloads or bursty workloads might end up overloading one NUMA node, while other nodes might be mostly idle, support for migrating memory might be useful (without doing a full VM migration which can lead to long downtimes).

Eventually support for NUMA in Dom0 might be useful too, and to align the backends with the NUMA node of frontends, but last time I tried a NUMA enabled Dom0 kernel wouldn’t even boot.