A First Glimpse at Skylake

As a “tock” in Intel’s tick-tock model, Skylake-EP introduces major improvements over the previous Haswell-EP and Broadwell-EP microarchitectures. Some of the advertised enhancements include the introduction of 512 bit SIMD (AVX 3.2/AVX-512F), three instead of two memory controllers per chip (this alone should increasing memory bandwidth by 50%, not taking into account further bandwidth improvements obtained by raising the DDR memory clock speed), and the somewhat mysterious appearance of an FPGA. Unfortunately, Skylake-EP is still far away—Intel hasn’t even released the previous Broadwell-EP chips, which were originally scheduled for Q4/15 but are now expected in Q1/16. However, Skylake mobile and desktop chips have been available for quite some time and it is about time we gave them a test-drive.

In this post I will examine an Intel Core i5-6500 chip using micro-benchmarks to investigate some of the improvements that already found their way into the desktop chips of the Skylake architecture. According to the Intel 64 and IA-32 Architectures Optimization Reference Manual, these enhancements include improved front end throughput, deeper out-of-order execution, higher cache bandwidths, improved divider latency, lower power consumption, improved SMT performance, and balanced floating-point add, multiply, and fused multiply-add (FMA) instruction throughput and latency. From these candidates I picked those that are in my opinion most relevant to my work in high performance computing: micro-op throughput in the front end, new functional units and deeper out-of-order execution in the back end, caches, and instruction latencies.

Front and Back End Improvements

The Skylake front end has seen multiple improvements regarding micro-op throughput. The legacy decode pipeline, which decodes CISC instructions to RISC-like micro-ops, can now deliver up to five (previously: four) micro-ops per cycle to the decoded micro-op queue. The decoded micro-op queue has been renamed to instruction decode queue (IDQ). The decoded instruction cache, which caches up to 1536 decoded micro-ops, was renamed as well and is now known as decode stream buffer (DSB). The DSB bandwidth to the IDQ has been increased from four to six micro-ops per cycle.

In the back end, the out-of-order window was increased by over 16% from 192 (Haswell) to 224 micro-ops. Before Haswell, there only existed one floating-point add and one floating-point multiplication unit. Haswell then introduced two FMA units and doubled the number of multiplication units. Skylake now also doubled the number of add units, resulting in a total of two fused multiply-add, two multiplication, and two addition units. We verified this using a vector reduction micro-benchmark: Adding up the contents of one cache line (64 byte) requires two AVX load instructions (loading 32 byte each) and two AVX add instructions (processing 32 byte of inputs each). With sufficient loop unrolling to work around the instruction retirement limit of the core, these four instructions can be processed in a single clock cycle on Skylake. Before Skylake, processing the two AVX add instructions took two cycles, because there was only one add unit. These results are shown in the figure below.

Cache Improvements

As advertised, cache performance is also significantly better across all cache levels on Skylake. The L1-L2 cache bandwidth is advertised as 64 byte per cycle; the L2-L3 cache bandwidth as 32 byte per cycle. In the figure above, we can see that the time it takes to process one cache line (64 bytes) of data increases from one to two cycles when dropping out of the L1 cache and into the L2 cache. This exactly corresponds to the delay caused by transferring the 64 byte cache line from L2 to L1 at the advertised speed of 64 byte per cycle. When dropping out of the L2 cache and into the L3 cache we see performance degrade from two to four cycles per cache line—again, exactly what we expect the delay to be when transferring the cache line at a bandwidth of 32 byte per cycle from L3 to L2 cache. The improved in-memory performance of Skylake can be attributed to both a higher memory clock and a lower Uncore latency penalty, because the Skylake chip is only a Desktop chip with 4 cores compared to the 14 cores of the Haswell-EP chip.

Core cache capacities remain unchanged at 32 kB L1 and 256 kB L2 per core. However, the L2 cache is no longer 8-way associative; according to Intel, the decrease to 4-way associativity was motivated mainly by reduced power consumption. The i5-6500 has a total of 1.5 MB L3 cache per slice.

Instruction Latencies

As far as instruction latencies are concerned, there are two important things to note: First, the marketing of instruction latencies as “balanced” should be taken with a grain of salt. While most floating-point instructions now have a latency of four clock cycles, most of these latencies have increased when comparing them to previous generations—making the positive connotation of “balanced” perhaps a bit misleading. Take for example floating-point multiplication; its latency has been decreased from five cycles (Haswell) to three cycles (Broadwell). Now (Skylake) it is four cycles. Floating-point addition, which for several generations has had a latency of three cycles has increased to four cycles.

My second remark on the subject of instruction latencies is that most latencies appear to be documented incorrectly in the current Optimization Reference Manual (was: Order Number: 248966-031; September 2015). The table below summarizes the measured instruction latencies; values in red indicate deviations between my measurements and values taken from the Optimization Reference Manual.

Instruction	Measurement	Documentation	Table in Optimization Manual
vaddps/vaddpd (AVX)	4	4	C-8
addps/vaddpd (SSE)	5	4	C-15, C-16
addss/vaddsd (scalar)	5	4	C-15, C-16
vmulps/vmulpd (AVX)	4	4	C-8
mulsd/mulpd (scalar/SSE)	4	3	C-15
mulss/mulps (scalar/SSE)	4	4	C-15
vfmaddxxxps (AVX)	4	4	(balanced latency)
vfmaddxxxpd (AVX)	5	4	(balanced latency)
vfmaddxxxps/vfmaddxxxpd (SSE)	5	4	(balanced latency)
vfmaddxxxss/vfmaddxxxsd (scalar)	5	4	(balanced latency)
vdivpd, divpd, divsd (AVX, SSE, scalar)	13	14	C-8, C-15
vdivps, divps, divss (AVX, SSE, scalar)	11	11	C-8, C-16

Most interesting to me is the different latency of AVX and SSE/scalar floating-point add instructions. While my measurements for the AVX instruction fit the documented latency of four cycles, they indicates the SSE and scalar instructions have a latency of five instead of the documented four cycles. This suggests that there are dedicated hardware units for scalar/SSE and AVX add instructions. While this may seem unlikely at first, it makes perfect sense when taking the new Turbo mode into accout: Starting on Haswell, a core is clocked either at the advertised Turbo frequency or at a slightly lower frequency (called AVX Turbo) depending on whether only scalar and SSE instructions or in addition AVX instructions are used. This means that scalar/SSE floating-point units have to be clocked higher as AVX units, a feature achieved by increasing the pipeline depth, e.g. from four to five steps, corresponding to an increase in latency from four to five cycles.

For AVX multiplication (vmulps/vmulpd) the measured latency of four clock cycles fits the documentation. Discrepancy arises for the scalar and SSE instruction for double-precision multiplication. FMA latencies are not explicitly documented in the Optimization Reference Manual but the “balanced” instruction latency implies a latency of four cycles. Measurements only confirm the vfmaddxxxps instruction as having a latency of four cycles; all other FMA instructions have a latency of five clock cycles. The measured latencies of divide instructions are 13 cycles for double-precision (vdivpd, divpd, divsd) respectively 11 cycles for single-precision (vdivps, divps, divss); the documented latencies are 14 cycles four double-preicision and 11 cycles for single-precision. The major improvement for dividing lies in the fact that the latencies of the AVX instructions match those of the SSE instructions—indicating full AVX divider units in contrast to previous architectures.

Conclusion

We verified the Core i5-6500 Skylake chip already has some of the features of Skylake-EP such as the second add pipeline, better cache bandwidths than Haswell-EP, and “balanced” instruction latencies but are eagerly awaiting the release of the full-blown Skylake-EP microarchitecture that will see additional changes, such as, e.g., the introduction of AVX-512.