background image

10-12 Vol. 1

PROGRAMMING WITH INTEL® STREAMING SIMD EXTENSIONS (INTEL® SSE)

The PMULHUW (multiply packed unsigned word integers and store high result) instruction performs a SIMD 
unsigned multiply of the words in the two source operands and returns the high word of each result to an MMX 
register.
The PSADBW (compute sum of absolute differences) instruction computes the SIMD absolute differences of the 
corresponding unsigned byte integers in two source operands, sums the differences, and stores the sum in the low 
word of the destination operand.
The PSHUFW (shuffle packed word integers) instruction shuffles the words in the source operand according to the 
order specified by an 8-bit immediate operand and returns the result to the destination operand.

10.4.5 

MXCSR State Management Instructions

The MXCSR state management instructions (LDMXCSR and STMXCSR) load and save the state of the MXCSR 
register, respectively. The LDMXCSR instruction loads the MXCSR register from memory, while the STMXCSR 
instruction stores the contents of the register to memory.

10.4.6 

Cacheability Control, Prefetch, and Memory Ordering Instructions

SSE extensions introduce several new instructions to give programs more control over the caching of data. They 
also introduces the PREFETCHh instructions, which provide the ability to prefetch data to a specified cache level, 
and the SFENCE instruction, which enforces program ordering on stores. These instructions are described in the 
following sections.

10.4.6.1   Cacheability Control Instructions

The following three instructions enable data from the MMX and XMM registers to be stored to memory using a non-
temporal hint. The non-temporal hint directs the processor to store the data to memory without writing the data 
into the cache hierarchy. See Section 10.4.6.2, “Caching of Temporal vs. Non-Temporal Data,” for information 
about non-temporal stores and hints.
The MOVNTQ (store quadword using non-temporal hint) instruction stores packed integer data from an MMX 
register to memory, using a non-temporal hint.
The MOVNTPS (store packed single-precision floating-point values using non-temporal hint) instruction stores 
packed floating-point data from an XMM register to memory, using a non-temporal hint.
The MASKMOVQ (store selected bytes of quadword) instruction stores selected byte integers from an MMX register 
to memory, using a byte mask to selectively write the individual bytes. This instruction also uses a non-temporal 
hint.

10.4.6.2   Caching of Temporal vs. Non-Temporal Data

Data referenced by a program can be temporal (data will be used again) or non-temporal (data will be referenced 
once and not reused in the immediate future). For example, program code is generally temporal, whereas, multi-
media data, such as the display list in a 3-D graphics application, is often non-temporal. To make efficient use of 
the processor’s caches, it is generally desirable to cache temporal data and not cache non-temporal data. Over-
loading the processor’s caches with non-temporal data is sometimes referred to as “polluting the caches.” The SSE 
and SSE2 cacheability control instructions enable a program to write non-temporal data to memory in a manner 
that minimizes pollution of caches. 
These SSE and SSE2 non-temporal store instructions minimize cache pollutions by treating the memory being 
accessed as the write combining (WC) type. If a program specifies a non-temporal store with one of these instruc-
tions and the memory type of the destination region is write back (WB), write through (WT), or write combining 
(WC), the processor will do the following:

If the memory location being written to is present in the cache hierarchy, the data in the caches is evicted.

1

1. Some older CPU implementations (e.g., Pentium M) allowed addresses being written with a non-temporal store instruction to be 

updated in-place if the memory type was not WC and line was already in the cache.