Correctly measuring runtimes

While measuring the performance or runtime of a code may seem trivial, there is a lot than can go wrong. In this post I discuss some of most frequent problems my students encounter and how to work around them.

The issue that probably arises most is related to Turbo mode found in modern processors. Typically, CPU cores run at the lowest possible CPU frequency when there is no work to do in order to preserve energy—which is a good thing. As soon as demand for computing power arises, the cores get clocked up. This makes perfect sense for most use cases. If, however, you want to verify the runtime of your code against a performance model you do not want your clock frequency to change. A varying clock rate during measurements will leave you clueless about what the actual frequency was during your measurement: Your CPU will probably start in the lowest frequency state. After your code has been running for some milliseconds (the timeframe a frequency change takes) it will clock up. However, the core may not adjust to the maximum frequency in just one recalibration. It may go through some or all of the possible CPU frequencies until it reaches the maximum frequency. Running your code for a really long time to guarantee that most of the time the CPU was using its maximum Turbo frequency is also not a good idea, because it is not trivial to determine the maximum CPU frequency. For example, the maximum Turbo frequency specified by Intel is the guaranteed frequency when using a single core that does not use AVX code. The actual frequency may be higher than the Turbo frequency guaranteed by the spec sheet. The reason for this lies in the nature of the fabrication process, which causes the quality of chips to vary. As a consequence, some chips of a given CPU model can achieve higher clock rates than others. Starting with Haswell, the situation gets even more complicated, because different guaranteed Turbo frequencies exist depending on whether your codes uses AVX instructions or not. The solution to this problems is to fix the CPU clock frequency. This can comfortably be done with various tools such as likwid-setFrequencies from the likwid tool suite or cpufreq-set from the cpufrequtils package available for most Linux distributions.

Another problem I’ve seen students suffer with is the question of how long their code should run in order for them to obtain significant measurement results. We will examine this question using an example:

for (T=1; runtime<min_time; T*=2) {
  start = get_time();

  for (t=0; t<T; ++t)
    ddot(A, B, N);

  runtime = get_time() - start;
}

T /= 2;
printf("MUp/s: %f", (N*T)/runtime);

The ddot(A, B, N) function shown above computes the dot product for arrays A and B, both of length N. The question is, how many iterations of the code should we run in order to get a significant result. To answer this question, we set the number of times t the code is run based on a minimum runtime min_time that we will vary. In the plot below, the x-axis shows different minimum measurement times. For each of them, one hundred samples were taken on a single core of an Ivy Bridge-EP Xeon E5-2660 v2 chip. The y-axis shows the coefficient of variation for the hundred samples taken for a given minimum measurent time. The coefficient of variation is calculated as the standard deviation σ of the distrubition divided by its mean μ. It is therefore a measure of how scattered the samples are. A high scattering indicates that a single sample may deviate a lot from the mean. We find that using a runtime of at least 10 ms is required to keep the variation of samples at around 1%. Note that in accord with the previous paragraph about Turbo mode, the black line which corresponds to measurements taken without frequency fixing exhibits higher variance than the samples that were taken with the frequency fixed at the nominal CPU clock of 2.2 GHz.

The last point I would like mention and that was already hinted at in the plot above is that of thread pinning. As always, thread pinning is recommended to increase reproducability of results. More information about thread pinning can be found in one of my previous posts.