Vol. 1 12-11
PROGRAMMING WITH INTEL® SSE3, SSSE3, INTEL® SSE4 AND INTEL® AESNI
increase support for packed dword computation. The technology also provides a hint that can improve memory
throughput when reading from uncacheable WC memory type.
The 47 SSE4.1 instructions include:
•
Two instructions perform packed dword multiplies.
•
Two instructions perform floating-point dot products with input/output selects.
•
One instruction performs a load with a streaming hint.
•
Six instructions simplify packed blending.
•
Eight instructions expand support for packed integer MIN/MAX.
•
Four instructions support floating-point round with selectable rounding mode and precision exception override.
•
Seven instructions improve data insertion and extractions from XMM registers
•
Twelve instructions improve packed integer format conversions (sign and zero extensions).
•
One instruction improves SAD (sum absolute difference) generation for small block sizes.
•
One instruction aids horizontal searching operations.
•
One instruction improves masked comparisons.
•
One instruction adds qword packed equality comparisons.
•
One instruction adds dword packing with unsigned saturation.
The SSE4.2 instructions operating on XMM registers improve performance in the following areas:
•
String and text processing that can take advantage of single-instruction multiple-data programming
techniques.
•
A SIMD integer instruction that enhances the capability of the 128-bit integer SIMD capability in SSE4.1.
12.10 SSE4.1 INSTRUCTION SET
12.10.1 Dword Multiply Instructions
SSE4.1 adds two dword multiply instructions that aid vectorization. They allow four simultaneous 32 bit by 32 bit
multiplies. PMULLD returns a low 32-bits of the result and PMULDQ returns a 64-bit signed result. These represent
the most common integer multiply operation. See Table 12-2.
12.10.2 Floating-Point Dot Product Instructions
SSE4.1 adds two instructions for double-precision (for up to 2 elements; DPPD) and single-precision dot products
(for up to 4 elements; DPPS).
These dot-product instructions include source select and destination broadcast which generally improves the flex-
ibility. For example, a single DPPS instruction can be used for a 2, 3, or 4 element dot product.
Table 12-2. Enhanced 32-bit SIMD Multiply Supported by SSE4.1
32 bit Integer Operation
unsigned x unsigned
signed x signed
Re
su
lt
Low 32-bit
(not available)
PMULLD
High 32-bit
(not available)
(not available)
64-bit
PMULUDQ*
PMULDQ
NOTE:
* Available prior to SSE4.1.