12-12 Vol. 1
PROGRAMMING WITH INTEL® SSE3, SSSE3, INTEL® SSE4 AND INTEL® AESNI
12.10.3 Streaming Load Hint Instruction
Historically, CPU read accesses of WC memory type regions have significantly lower throughput than accesses to
cacheable memory.
The streaming load instruction in SSE4.1, MOVNTDQA, provides a non-temporal hint that can cause adjacent 16-
byte items within an aligned 64-byte region of WC memory type (a streaming line) to be fetched and held in a small
set of temporary buffers (“streaming load buffers”). Subsequent streaming loads to other aligned 16-byte items in
the same streaming line may be satisfied from the streaming load buffer and can improve throughput.
Programmers are advised to use the following practices to improve the efficiency of MOVNTDQA streaming loads
from WC memory:
•
Streaming loads must be 16-byte aligned.
•
Temporally group streaming loads of the same streaming cache line for effective use of the small number of
streaming load buffers. If loads to the same streaming line are excessively spaced apart, it may cause the
streaming line to be re-fetched from memory.
•
Temporally group streaming loads from at most a few streaming lines together. The number of streaming load
buffers is small; grouping a modest number of streams will avoid running out of streaming load buffers and the
resultant re-fetching of streaming lines from memory.
•
Avoid writing to a streaming line until all 16-byte-aligned reads from the streaming line have occurred. Reading
a 16-byte item from a streaming line that has been written, may cause the streaming line to be re-fetched.
•
Avoid reading a given 16-byte item within a streaming line more than once; repeated loads of a particular 16-
byte item are likely to cause the streaming line to be re-fetched.
•
The streaming load buffers, reflecting the WC memory type characteristics, are not required to be snooped by
operations from other agents. Software should not rely upon such coherency actions to provide any data
coherency with respect to other logical processors or bus agents. Rather, software must insure the consistency
of WC memory accesses between producers and consumers.
•
Streaming loads may be weakly ordered and may appear to software to execute out of order with respect to
other memory operations. Software must explicitly use MFENCE if it needs to preserve order among streaming
loads or between streaming loads and other memory operations.
•
Streaming loads must not be used to reference memory addresses that are mapped to I/O devices having side
effects or when reads to these devices are destructive. This is because MOVNTDQA is speculative in nature.
Example 12-1 provides a sketch of the basic assembly sequences that illustrate the principles of using MOVNTDQA
in a situation with a producer-consumer accessing a WC memory region.