Evolution of Cache Replacement Strategies

To better acquaint my students with the intricacies of caches I let them write an LRU cache simulator which they then use to determine hit rates for a streaming access pattern. As a follow-up exercise they have to compare the results of their simulations to data obtained via performance counters on a real CPU’s (Ivy Bridge-EP) L1 cache. While working out the exercise I took the opportunity and ran some more benchmarks to get a better idea what to expect from L2 and L3 caches of recent Intel microarchitectures—a quest that included a painful trial-and-error search for the hardware events from which to infer these caches’ hit rates!

Methodology

For all following benchmarks I used the simple streaming code shown below which accesses every eighth element of a double precision floating-point array (one cache line on Intel is 64 byte so jumping over eight doubles—8 byte each—results in accessing each cache line once).

LIKWID_MARKER_INIT;
LIKWID_MARKER_START("sum");

// choose T so that runtime is ~1s
for (t=0; t<T; ++t)
#pragma vector aligned
#pragma nounroll
    for (i=0; i<N; i=i+8)
        sum += A[i];

LIKWID_MARKER_STOP("sum");
LIKWID_MARKER_CLOSE;

I used the likwid marker API in combination with likwid-perfctr to access hardware performance counters, which measure total cache accesses and cache hits for a particular cache level. To avoid hardware prefetchers biasing results all hardware prefetchers (DCU IP prefetcher, DCU prefetcher, L2 hardware prefetcher, L2 adjacent cache line prefetcher) were disabled using likwid-features. Each empirical result corresponds the median of one hundred samples. Processors used for evaluations are based on the four most recent Intel Xeon EP microarchitectures: Intel Xeon E5-2680 (Sandy Bridge-EP), Xeon E5-2690 v2 (Ivy Bridge-EP), Xeon E5-2695 v3 (Haswell-EP), and Xeon E5-2697 v4 (Broadwell-EP).

L1 Caches

To start off, the figure above shows hit rates for L1 caches on Sandy Bridge-EP (SNB), Ivy Bridge-EP (IVB), Haswell-EP (HSW), and Broadwell-EP (BDW). You can see that the measurements (black circles) almost perfectly correspond to the prediction of the cache simulator (dashed red line) on all four microarchitectures. The L1 cache uses eight ways on all four microarchitectures, so we observe non-zero hit rates as long as the data set we are streaming over is smaller than 9/8ths of a cache’s 32 kB capacity. Cache hit rates are computed using the MEM_LOAD_UOPS_RETIRED_L1_HIT and MEM_UOPS_RETIRED_LOADS_ALL events.

L2 Caches

Results for the L2 caches—which, according to official documentation implement an LRU replacement strategy as well—are shown in the figure above. Here, we can observe significant differences between SNB and IVB on one side and HSW and BDW on the other. For SNB and IVB, hit rates do not conform to a LRU replacement strategy; also, hit rates are very erratic—hinting at some sort of randomness involved in the replacement strategy. HSW and BDW hit rates perfectly correspond to the simulator and analytic prediction (the L2 cache also uses eight ways so cache hits can be observed as long as the data set is smaller than 9/8ths of the caches' capacities of 256 kB). Total number of L2 accesses were measured indirectly via the L1D_REPLACEMENT event: whenever a cache line in L2 cache is accessed, a cache line from L1 is replaced with that line. L2 cache hits are collected using the MEM_LOAD_UOPS_RETIRED_L2_HIT event.

L3 Caches

According to documentation, the L3 cache uses a pseudo-LRU replacement strategy. The figure above shows a dashed red line for LRU nevertheless, just to get an idea of what the implemented replacement strategies are capable of. Because I wanted to compare replacement strategies I normalized the x-axis to the cache size in percent instead of showing absolute capacity in kB, which of course is increasing with the number of cores and would bias results. The first thing to notice is a change in the replacement strategy when going from SNB to IVB (note that microarchitectural changes are not exclusive to “tocks!”) and from IVB to HSW. Interestingly the change from SNB to IVB results in a lower cache hit rate for our streaming access pattern; it might, however, result in better cache hit rates for other access patterns; also, replacement latency might be affected in one or the other way. When going from IVB to HSW, we find that the cache hit rate has improved significantly for streaming access patterns, observing hit rates of up to 30% for data sets 1.5× the size of the cache!

Selecting the right events to compute the hit rate for the L3 cache proved a little difficult. In general, there’s two ways to go: either using the offcore response counters or using the MEM_LOAD_UOPS_RETIRED.L3_ALL and MEM_LOAD_UOPS_RETIRED.L3_HIT events. The problem with the former is that it misses some occurrences of the events on SNB and IVB and is not reliable at all for measuring the events on HSW and BDW. That’s why I went with the MEM_LOAD_UOPS_RETIRED.L3_* counters. But there’s some pitfalls to this options as well. First, you have to make sure to use SSE loads on SNB and IVB, because on these microarchitectures the event doesn’t count events triggered by AVX loads. Second, you have to work around a bug in the implementations of the counters for SNB: here, you can find a script called latego.py which you can use make the counters work.