Friday, June 3, 2011

Clarifications on Linux's NUMA stats

After reading the excellent post on The MySQL “swap insanity” problem and the effects of the NUMA architecture, I remembered about the existence of /sys/devices/system/node/node*/numastat and decided to add these numbers to a collector for OpenTSDB. But whenever I add a collector that reads metrics from /proc or /sys, I always need to go read the Linux kernel's source code, because most metrics tend to be misleading and under-documented (when they're documented at all).

In this case, if you RTFM, you'll get this:
Numa policy hit/miss statistics

/sys/devices/system/node/node*/numastat

All units are pages. Hugepages have separate counters.

numa_hit      A process wanted to allocate memory from this node, and succeeded.
numa_miss     A process wanted to allocate memory from another node, but ended up with memory from this node.
numa_foreign  A process wanted to allocate on this node, but ended up with memory from another one.
local_node    A process ran on this node and got memory from it.
other_node    A process ran on this node and got memory from another node.
interleave_hit  Interleaving wanted to allocate from this node and succeeded.
I was very confused about the last one, about the exact difference between the second and the third one, and about the difference between the first 3 metrics and the next 2.

After RTFSC, the relevant part of the code appeared to be in mm/vmstat.c:
void zone_statistics(struct zone *preferred_zone, struct zone *z, gfp_t flags)
{       
        if (z->zone_pgdat == preferred_zone->zone_pgdat) {
                __inc_zone_state(z, NUMA_HIT);
        } else {
                __inc_zone_state(z, NUMA_MISS);
                __inc_zone_state(preferred_zone, NUMA_FOREIGN);
        }
        if (z->node == ((flags & __GFP_OTHER_NODE) ?
                        preferred_zone->node : numa_node_id()))
                __inc_zone_state(z, NUMA_LOCAL);
        else
                __inc_zone_state(z, NUMA_OTHER);
}

So here's what it all really means:
  • numa_hit: Number of pages allocated from the node the process wanted.
  • numa_miss: Number of pages allocated from this node, but the process preferred another node.
  • numa_foreign: Number of pages allocated another node, but the process preferred this node.
  • local_node: Number of pages allocated from this node while the process was running locally.
  • other_node: Number of pages allocated from this node while the process was running remotely (on another node).
  • interleave_hit: Number of pages allocated successfully with the interleave strategy.

I was originally confused about numa_foreign but this metric can actually be useful to see what happens when a node runs out of free pages. If a process attempts to get a page from its local node, but this node is out of free pages, then the numa_miss of that node will be incremented (indicating that the node is out of memory) and another node will accomodate the process's request. But in order to know which nodes are "lending memory" to the out-of-memory node, you need to look at numa_foreign. Having a high value for numa_foreign for a particular node indicates that this node's memory is under-utilized so the node is frequently accommodating memory allocation requests that failed on other nodes.