By default, your operating system lets software threads float between all available processors. This means when your system’s scheduler decides to run some kernel routine on core x it might preempt a thread running on this core and move it to core y. While this might seem like a good idea—i.e. the thread can continue running somewhere else while the kernel is running—it might actually be the cause of performance issues.
One of the major problems with moving threads between physical cores is related to caches. Assume a thread is running on core x and has filled the core’s 32 kB of L1 cache with data. Now, the thread is moved to core y. The thread continues to work on its data, which is no longer available the core’s cache(s). Recent Intel architecitres (e.g. Nehalem, Westmere, Sandy Bridge, Ivy Bridge, Haswell, Broadwell, Skylake) have inclusive last-level caches. This means that in the best-case scenario (where data in core x’s L1 cache was not modified) data can be brought in from the shared L3 cache, which due to inclusiveness holds a copy of all other data inside the cache hierarchy. On Ivy Bridge, which provides a bandwidth of 32 byte/cycle between the L3 and L2 cache as well as the L2 and L1 cache, it takes over two thousand cycles to move the 32 kB from the shared L3 to the L1 cache of core y. If data was modified, this penalty is doubled, because data has to be moved from core x’s L1 cache to the shared L3 cache before it can be brought into core y’s caches, which makes a worst-case penalty of roughly four thousand cycles.
The problem is easily solved by thread pinning. Both Linux and Windows provide
the means to pin threads via the
SetProcessAffinityMask calls. These functions enable programs to specify a
pinning mask—a list of processors on which the calling thread may be
scheduled—for each thread. By assigning each thread its dedicated processors,
the issue of “cold caches” no longer arises: whenever a thread is preempted
from its core, thread pinning guarantees that the thread will resume execution
on exactly that core.
To give an example of the real-world impact of thread pinning, consider the figure above. It depicts the performance of a 2D jacobi stencil on a dual-socket Ivy Bridge EP Xeon E5-2650 v2 machine with DDR3-1600 memory for varying dataset sizes. Not only do we get a much better performance when empoying thread pinning, the performance is also more consistent. The unpinned version behaves erratic, which can be attributed to the fact that thread preemption may happen more often in some runs than others.