To better acquaint my students with the intricacies of caches I let them write an LRU cache simulator which they then use to determine hit rates for a streaming access pattern. As a follow-up exercise they have to compare the results of their simulations to data obtained via performance counters on a real CPU’s (Ivy Bridge-EP) L1 cache. While working out the exercise I took the opportunity and ran some more benchmarks to get a better idea what to expect from L2 and L3 caches of recent Intel microarchitectures—a quest that included a painful trial-and-error search for the hardware events from which to infer these caches’ hit rates!
Methodology
For all following benchmarks I used the simple streaming
code shown below which accesses every eighth element of a double
precision floating-point array (one cache line on Intel is
64 byte so jumping over eight
double
s—8 byte each—results in
accessing each cache line once).
LIKWID_MARKER_INIT;
LIKWID_MARKER_START("sum");
// choose T so that runtime is ~1s
for (t=0; t<T; ++t)
#pragma vector aligned
#pragma nounroll
for (i=0; i<N; i=i+8)
sum += A[i];
LIKWID_MARKER_STOP("sum");
LIKWID_MARKER_CLOSE;
I used the likwid marker
API in combination with likwid-perfctr
to access
hardware performance counters, which measure total cache
accesses and cache hits for a particular cache level. To avoid
hardware prefetchers biasing results all
hardware prefetchers
(DCU IP prefetcher, DCU prefetcher, L2 hardware prefetcher, L2
adjacent cache line prefetcher) were disabled using
likwid-features
. Each empirical result corresponds
the median of one hundred samples. Processors used for
evaluations are based on the four most recent Intel Xeon EP
microarchitectures: Intel Xeon E5-2680 (Sandy Bridge-EP),
Xeon E5-2690 v2 (Ivy Bridge-EP),
Xeon E5-2695 v3 (Haswell-EP), and
Xeon E5-2697 v4 (Broadwell-EP).
L1 Caches
To start off, the figure above shows hit rates for L1 caches
on Sandy Bridge-EP (SNB), Ivy Bridge-EP (IVB), Haswell-EP
(HSW), and Broadwell-EP (BDW). You can see that the
measurements (black circles) almost perfectly correspond to the
prediction of the cache simulator (dashed red line) on all four
microarchitectures. The L1 cache uses eight ways on all four
microarchitectures, so we observe non-zero hit rates as long as
the data set we are streaming over is smaller than 9/8ths of
a cache’s
32 kB
capacity. Cache hit rates are
computed using the MEM_LOAD_UOPS_RETIRED_L1_HIT
and MEM_UOPS_RETIRED_LOADS_ALL
events.
L2 Caches
Results for the L2 caches—which, according to official
documentation implement an LRU replacement strategy as
well—are shown in the figure above. Here, we can observe
significant differences between SNB and IVB on one side and HSW
and BDW on the other. For SNB and IVB, hit rates do not conform
to a LRU replacement strategy; also, hit rates are very
erratic—hinting at some sort of randomness involved in the
replacement strategy. HSW and BDW hit rates perfectly
correspond to the simulator and analytic prediction (the L2
cache also uses eight ways so cache hits can be observed as long
as the data set is smaller than 9/8ths of the caches'
capacities of
256 kB). Total number of L2 accesses were
measured indirectly via the L1D_REPLACEMENT
event:
whenever a cache line in L2 cache is accessed, a cache line
from L1 is replaced with that line. L2 cache hits are collected
using the MEM_LOAD_UOPS_RETIRED_L2_HIT
event.
L3 Caches
According to documentation, the L3 cache uses a pseudo-LRU replacement strategy. The figure above shows a dashed red line for LRU nevertheless, just to get an idea of what the implemented replacement strategies are capable of. Because I wanted to compare replacement strategies I normalized the x-axis to the cache size in percent instead of showing absolute capacity in kB, which of course is increasing with the number of cores and would bias results. The first thing to notice is a change in the replacement strategy when going from SNB to IVB (note that microarchitectural changes are not exclusive to “tocks!”) and from IVB to HSW. Interestingly the change from SNB to IVB results in a lower cache hit rate for our streaming access pattern; it might, however, result in better cache hit rates for other access patterns; also, replacement latency might be affected in one or the other way. When going from IVB to HSW, we find that the cache hit rate has improved significantly for streaming access patterns, observing hit rates of up to 30% for data sets 1.5× the size of the cache!
Selecting the right events to
compute the hit rate for the L3 cache proved a little difficult.
In general, there’s two ways to go: either using the offcore response counters
or using the MEM_LOAD_UOPS_RETIRED.L3_ALL
and
MEM_LOAD_UOPS_RETIRED.L3_HIT
events. The problem
with the former is that it misses some occurrences of the events
on SNB and IVB and is not reliable at all for measuring the
events on HSW and BDW. That’s why I went with the
MEM_LOAD_UOPS_RETIRED.L3_*
counters. But there’s
some pitfalls to this options as well. First, you have to make
sure to use SSE loads on SNB and IVB, because on these
microarchitectures the event doesn’t count events triggered by
AVX loads. Second, you have to work around a
bug
in the implementations of the counters for SNB:
here, you
can find a script called latego.py
which you can
use make the counters work.