background image

11-24 Vol. 1

PROGRAMMING WITH INTEL® STREAMING SIMD EXTENSIONS 2 (INTEL® SSE2)

Use the LDMXCSR and STMXCSR instructions to save and restore, respectively, the contents of the MXCSR register 
on a procedure call and return.

11.6.10.3   Caller-Save Recommendation for Procedure and Function Calls

When making procedure (or function) calls from SSE or SSE2 code, a caller-save convention is recommended for 
saving the state of the calling procedure. Using this convention, any register whose content must survive intact 
across a procedure call must be stored in memory by the calling procedure prior to executing the call. 
The primary reason for using the caller-save convention is to prevent performance degradation. XMM registers can 
contain packed or scalar double-precision floating-point, packed single-precision floating-point, and 128-bit 
packed integer data types. The called procedure has no way of knowing the data types in XMM registers following 
a call; so it is unlikely to use the correctly typed move instruction to store the contents of XMM registers in memory 
or to restore the contents of XMM registers from memory. 
As described in Section 11.6.9, “Mixing Packed and Scalar Floating-Point and 128-Bit SIMD Integer Instructions 
and Data,”
 executing a move instruction that does not match the type for the data being moved to/from XMM regis-
ters will be carried out correctly, but can lead to a greater instruction latency.

11.6.11  Updating Existing MMX Technology Routines Using 128-Bit SIMD Integer Instructions

SSE2 extensions extend all 64-bit MMX SIMD integer instructions to operate on 128-bit SIMD integers using XMM 
registers. The extended 128-bit SIMD integer instructions operate like the 64-bit SIMD integer instructions; this 
simplifies the porting of MMX technology applications. However, there are considerations:

To take advantage of wider 128-bit SIMD integer instructions, MMX technology code must be recompiled to 
reference the XMM registers instead of MMX registers.

Computation instructions that reference memory operands that are not aligned on 16-byte boundaries should 
be replaced with an unaligned 128-bit load (MOVUDQ instruction) followed by a version of the same 
computation operation that uses register instead of memory operands. Use of 128-bit packed integer 
computation instructions with memory operands that are not 16-byte aligned results in a general protection 
exception (#GP).

Extension of the PSHUFW instruction (shuffle word across 64-bit integer operand) across a full 128-bit operand 
is emulated by a combination of the following instructions: PSHUFHW, PSHUFLW, and PSHUFD.

Use of the 64-bit shift by bit instructions (PSRLQ, PSLLQ) can be extended to 128 bits in either of two ways:
— Use of PSRLQ and PSLLQ, along with masking logic operations. 
— Rewriting the code sequence to use PSRLDQ and PSLLDQ (shift double quadword operand by bytes)

Loop counters need to be updated, since each 128-bit SIMD integer instruction operates on twice the amount 
of data as its 64-bit SIMD integer counterpart.

11.6.12  Branching on Arithmetic Operations

There are no condition codes in SSE or SSE2 states. A packed-data comparison instruction generates a mask which 
can then be transferred to an integer register. The following code sequence provides an example of how to perform 
a conditional branch, based on the result of an SSE2 arithmetic operation. 

cmppd 

XMM0, XMM1

; generates a mask in XMM0

movmskpd

EAX, XMM0

; moves a 2 bit mask to eax

test

EAX, 0

; compare with desired result

jne

BRANCH TARGET

The COMISD and UCOMISD instructions update the EFLAGS as the result of a scalar comparison. A conditional 
branch can then be scheduled immediately following COMISD/UCOMISD.