background image

Vol. 1 2-15

INTEL

®

 64 AND IA-32 ARCHITECTURES

— Improved prefetching.
— High bandwidth low latency LLC architecture.
— High bandwidth ring architecture of on-die interconnect.

For additional information on Intel

®

 Advanced Vector Extensions (AVX), see Section 5.13, “Intel® Advanced Vector 

Extensions (Intel® AVX)” and Chapter 14, “Programming with AVX, FMA and AVX2” in Intel® 64 and IA-32 Archi-
tectures Software Developer’s Manual, Volume 1
.

2.2.7 SIMD 

Instructions

Beginning with the Pentium II and Pentium with Intel MMX technology processor families, six extensions have been 
introduced into the Intel 64 and IA-32 architectures to perform single-instruction multiple-data (SIMD) operations. 
These extensions include the MMX technology, SSE extensions, SSE2 extensions, SSE3 extensions, Supplemental 
Streaming SIMD Extensions 3, and SSE4. Each of these extensions provides a group of instructions that perform 
SIMD operations on packed integer and/or packed floating-point data elements. 
SIMD integer operations can use the 64-bit MMX or the 128-bit XMM registers. SIMD floating-point operations use 
128-bit XMM registers. Figure 2-4 shows a summary of the various SIMD extensions (MMX technology, SSE, SSE2, 
SSE3, SSSE3, and SSE4), the data types they operate on, and how the data types are packed into MMX and XMM 
registers.
The Intel MMX technology was introduced in the Pentium II and Pentium with MMX technology processor families. 
MMX instructions perform SIMD operations on packed byte, word, or doubleword integers located in MMX registers. 
These instructions are useful in applications that operate on integer arrays and streams of integer data that lend 
themselves to SIMD processing.
SSE extensions were introduced in the Pentium III processor family. SSE instructions operate on packed single-
precision floating-point values contained in XMM registers and on packed integers contained in MMX registers. 
Several SSE instructions provide state management, cache control, and memory ordering operations. Other SSE 
instructions are targeted at applications that operate on arrays of single-precision floating-point data elements (3-
D geometry, 3-D rendering, and video encoding and decoding applications).
SSE2 extensions were introduced in Pentium 4 and Intel Xeon processors. SSE2 instructions operate on packed 
double-precision floating-point values contained in XMM registers and on packed integers contained in MMX and 
XMM registers. SSE2 integer instructions extend IA-32 SIMD operations by adding new 128-bit SIMD integer oper-
ations and by expanding existing 64-bit SIMD integer operations to 128-bit XMM capability. SSE2 instructions also 
provide new cache control and memory ordering operations.
SSE3 extensions were introduced with the Pentium 4 processor supporting Hyper-Threading Technology (built on 
90 nm process technology). SSE3 offers 13 instructions that accelerate performance of Streaming SIMD Exten-
sions technology, Streaming SIMD Extensions 2 technology, and x87-FP math capabilities.
SSSE3 extensions were introduced with the Intel Xeon processor 5100 series and Intel Core 2 processor family. 
SSSE3 offer 32 instructions to accelerate processing of SIMD integer data.
SSE4 extensions offer 54 instructions. 47 of them are referred to as SSE4.1 instructions. SSE4.1 are introduced 
with Intel Xeon processor 5400 series and Intel Core 2 Extreme processor QX9650. The other 7 SSE4 instructions 
are referred to as SSE4.2 instructions.
AESNI and PCLMULQDQ introduce 7 new instructions. Six of them are primitives for accelerating algorithms based 
on AES encryption/decryption standard, referred to as AESNI.
The PCLMULQDQ instruction accelerates general-purpose block encryption, which can perform carry-less multipli-
cation for two binary numbers up to 64-bit wide.
Intel 64 architecture allows four generations of 128-bit SIMD extensions to access up to 16 XMM registers. IA-32 
architecture provides 8 XMM registers.
Intel

®

 Advanced Vector Extensions offers comprehensive architectural enhancements over previous generations of 

Streaming SIMD Extensions. Intel AVX introduces the following architectural enhancements:

Support for 256-bit wide vectors and SIMD register set.

256-bit floating-point instruction set enhancement with up to 2X performance gain relative to 128-bit 
Streaming SIMD extensions.