While measuring the performance or runtime of a code may seem trivial, there is a lot than can go wrong. In this post I discuss some of most frequent problems my students encounter and how to work around them.
The issue that probably arises most is related to Turbo mode found in modern
processors. Typically, CPU cores run at the lowest possible CPU frequency when
there is no work to do in order to preserve energy—which is a good thing. As
soon as demand for computing power arises, the cores get clocked up. This
makes perfect sense for most use cases. If, however, you want to verify the
runtime of your code against a performance model you do not want your clock
frequency to change. A varying clock rate during measurements will leave you
clueless about what the actual frequency was during your measurement: Your CPU
will probably start in the lowest frequency state. After your code has been
running for some milliseconds (the timeframe a frequency change takes) it will
clock up. However, the core may not adjust to the maximum frequency in just
one recalibration. It may go through some or all of the possible CPU
frequencies until it reaches the maximum frequency. Running your code for a
really long time to guarantee that most of the time the CPU was using its
maximum Turbo frequency is also not a good idea, because it is not trivial to
determine the maximum CPU frequency. For example, the maximum Turbo frequency
specified by Intel is the guaranteed frequency when using a single core that
does not use AVX code. The actual frequency may be higher than the Turbo
frequency guaranteed by the spec sheet. The reason for this lies in the nature
of the fabrication process, which causes the quality of chips to vary. As a
consequence, some chips of a given CPU model can achieve higher clock rates
than others. Starting with Haswell, the situation gets even more complicated,
because different guaranteed Turbo frequencies exist depending on whether your
codes uses AVX instructions or not. The solution to this problems is to fix
the CPU clock frequency. This can comfortably be done with various tools such
as likwid-setFrequencies from the likwid
tool suite or cpufreq-set
from the cpufrequtils package available for most
Linux distributions.
Another problem I’ve seen students suffer with is the question of how long their code should run in order for them to obtain significant measurement results. We will examine this question using an example:
for (T=1; runtime<min_time; T*=2) {
start = get_time();
for (t=0; t<T; ++t)
ddot(A, B, N);
runtime = get_time() - start;
}
T /= 2;
printf("MUp/s: %f", (N*T)/runtime);
The ddot(A, B, N)
function shown above computes the
dot product
for arrays A
and B
, both of length N
. The question is, how many
iterations of the code should we run in order to get a significant result. To
answer this question, we set the number of times t
the code is run based on a
minimum runtime min_time
that we will vary. In the plot below, the x-axis
shows different minimum measurement times. For each of them, one hundred
samples were taken on a single core of an Ivy Bridge-EP Xeon E5-2660 v2
chip. The y-axis shows the coefficient of variation for the hundred samples
taken for a given minimum measurent time. The coefficient of variation is
calculated as the standard deviation σ of the distrubition divided by its
mean μ. It is therefore a measure of how scattered the samples are. A high
scattering indicates that a single sample may deviate a lot from the mean. We
find that using a runtime of at least 10 ms is required to keep the
variation of samples at around 1%. Note that in accord with the previous
paragraph about Turbo mode, the black line which corresponds to measurements
taken without frequency fixing exhibits higher variance than the samples that
were taken with the frequency fixed at the nominal CPU clock of 2.2 GHz.
The last point I would like mention and that was already hinted at in the plot above is that of thread pinning. As always, thread pinning is recommended to increase reproducability of results. More information about thread pinning can be found in one of my previous posts.