NUMA (non-uniform memory access) is a memory architecture that supports multiprocessing where the memory access time is dependent on the memory location relative to the processor, and a CPU may access its own local memory quicker than non-local memory under NUMA. NUMA’s advantages are confined to specific workloads, particularly on servers where data is frequently connected with specific processes or users.
NUMA systems are multi-system bus high-performance server solutions. They may combine many processors into a single system picture, resulting in better price-performance ratios.
When implemented in a virtualized environment (VMware ESXi, Microsoft Hyper-V, etc.), NUMA provides significant benefits as long as the Guest VM does not consume more resources than the NUMA nodes.
There are numerous CPU modules in the NUMA architecture, each of which has multiple CPUs and has its own local memory, I/O slots, and other features. Because its nodes may link and share information using an interconnection module, each CPU has access to the whole system’s memory (for example, a crossbar switch). Accessing local memory is obviously much faster than accessing remote memory (memory from other nodes in the system), which is also the cause of non-consistent memory access.
NUMA and MPP are architecturally comparable in many ways. They are made up of many nodes. Each node is equipped with its own processor, memory, and I/O. The node connectivity technique allows nodes to share information. However, there is a significant difference:
Node Interconnection Mechanism: This mechanism is implemented on the same physical server as the NUMA node interconnection mechanism. A CPU must wait when it needs to access external memory. This is the primary reason why the NUMA server is unable to support linear performance expansion as the CPU power grows.
Memory Access Mechanism: This mechanism is implemented on the same physical server as the NUMA node interconnection mechanism. A CPU must wait when it needs to access external memory. This is the primary reason why the NUMA server is unable to support linear performance expansion as the CPU power grows.
NUMA is perceived by the higher layer, and resource allocation inside the node can considerably increase performance on data and specialized activities, as well as user-related business servers.
To access data, memory access between NUMA regions must travel through both a memory bus and an inter-region bus. On the one hand, this access lengthens the latency; on the other hand, a secondary collision on the QPI bus is possible. Furthermore, Intel’s unique high-speed connectivity bus, led by the QPI/UPI, is a patented technology. Although the bus’s access speed is slower than the memory bus’s, the Intel x86 architecture’s CPU does not create a significant bottleneck.
As a result, NUMA affinity should be established for VMs with good memory access efficiency, particularly ARM CPUs. You may choose a NUMA topology for NUMA affinity to limit the NUMA area where CPUs assigned to VMs are placed, as well as the number of CPUs and RAM allotted in each region.
Memory use, memory ballooning, and memory shifting of ESXi hosts in the VMware vSphere system are all monitored by CNIL Metrics and Logs
The percentage of RAM that is used locally. The lower this value, the more likely NUMA locality is to blame for performance issues. If the number is less than 80%, you should be concerned.
When NUMA affinity is off, the CPU access to another NUMA area can be minimized to increase efficiency, as opposed to when the CPU and memory are randomly distributed. In the worst-case scenario, the NUMA area containing the CPU assigned to the VM is set to 1. The CPU and RAM assigned to the VM are in the same NUMA area in this example. There is no cross-NUMA memory access in this situation. However, it can also result in resource waste. (For instance, NUMA 0 has two CPUs, NUMA 1 has two CPUs, and there is a four-processor virtual machine.) The VM cannot be deployed if the NUMA area where the CPU assigned to the VM is set to 1.
NUMA affinity refers to the fact that VMs on the same NUMA node share memory and CPU resources. That is, each VM’s CPU and memory must be located on the same NUMA node. NUMA affinity implies that VMs on the same NUMA shared memory, CPU, and PCI resources.
Detect NUMA Performance Issues with CNIL Metrics and Logs
The NUMA scheduler assigns a home node to each virtual machine it manages. A home node is a NUMA node in a system that has processors and local memory.
The ESXi host will always try to allocate memory from the home node when allocating memory to a virtual machine. To maximize memory locality, the virtual machine’s virtual CPUs are limited to executing on the home node.
The VMkernel NUMA scheduler may dynamically adjust a VM’s home node to react to changes in system load when needed or practical. However, the VMkernel is constrained by physical and technical limits, and misconfigurations can cause performance issues. For efficient load balancing of your VMs, you can’t solely rely on the VMkernel.
As a result, you should begin by assessing your present NUMA condition.
Software that identifies and tracks NUMA KPIs, as well as your optimization efforts, is required to back up your approach.
Metrics and Logs gather and visualize all critical NUMA metrics for all your ESXi hosts and VMs over a lengthy period. As a result, NUMA faults become obvious rather than buried.
Optimize VMware Infrastructure for NUMA
When dealing with NUMA, the following are some crucial factors to keep in mind:
- In the BIOS of your server, disable Node interleaving!
- Order or configure physical server hardware so that each NUMA node has the same amount of RAM.
- Assign vCPUs to VMs with a total number of physical cores that is less than or equal to the total number of physical cores in a single CPU Socket (stay within 1 NUMA Node). Hyperthreading isn’t included in the package!
- Examine your virtual architecture, in general, to ensure that it is optimized for your servers’ physical NUMA node restrictions. Keep an eye out for Monster-VMs!
- Avoid having a single VM consume more vCPUs than a single NUMA node, since this may cause memory access deterioration if it is scheduled over many NUMA nodes.
- If a single or several VMs consume more RAM than a single NUMA node, the VMkernel will span a part of the memory content in the distant NUMA node, resulting in lower performance.
- For VMs with 9 or more vCPUs, vNUMA (virtual NUMA) is enabled by default. Caution! When you enable “hot add CPU/memory” or configure CPU affinity, vNUMA is immediately disabled.
- Every 2 seconds, the VMkernel NUMA rebalances.