background image

8-16 Vol. 3A

MULTIPLE-PROCESSOR MANAGEMENT

Synchronization mechanisms in multiple-processor systems may depend upon a strong memory-ordering model. 
Here, a program can use a locking instruction such as the XCHG instruction or the LOCK prefix to ensure that a 
read-modify-write operation on memory is carried out atomically. Locking operations typically operate like I/O 
operations in that they wait for all previous instructions to complete and for all buffered writes to drain to memory 
(see Section 8.1.2, “Bus Locking”).
Program synchronization can also be carried out with serializing instructions (see Section 8.3). These instructions 
are typically used at critical procedure or task boundaries to force completion of all previous instructions before a 
jump to a new section of code or a context switch occurs. Like the I/O and locking instructions, the processor waits 
until all previous instructions have been completed and all buffered writes have been drained to memory before 
executing the serializing instruction.
The SFENCE, LFENCE, and MFENCE instructions provide a performance-efficient way of ensuring load and store 
memory ordering between routines that produce weakly-ordered results and routines that consume that data. The 
functions of these instructions are as follows:

SFENCE — Serializes all store (write) operations that occurred prior to the SFENCE instruction in the program 
instruction stream, but does not affect load operations.

LFENCE — Serializes all load (read) operations that occurred prior to the LFENCE instruction in the program 
instruction stream, but does not affect store operations.

2

MFENCE — Serializes all store and load operations that occurred prior to the MFENCE instruction in the 
program instruction stream.

Note that the SFENCE, LFENCE, and MFENCE instructions provide a more efficient method of controlling memory 
ordering than the CPUID instruction.
The MTRRs were introduced in the P6 family processors to define the cache characteristics for specified areas of 
physical memory. The following are two examples of how memory types set up with MTRRs can be used strengthen 
or weaken memory ordering for the Pentium 4, Intel Xeon, and P6 family processors:

The strong uncached (UC) memory type forces a strong-ordering model on memory accesses. Here, all reads 
and writes to the UC memory region appear on the bus and out-of-order or speculative accesses are not 
performed. This memory type can be applied to an address range dedicated to memory mapped I/O devices to 
force strong memory ordering.

For areas of memory where weak ordering is acceptable, the write back (WB) memory type can be chosen. 
Here, reads can be performed speculatively and writes can be buffered and combined. For this type of memory, 
cache locking is performed on atomic (locked) operations that do not split across cache lines, which helps to 
reduce the performance penalty associated with the use of the typical synchronization instructions, such as 
XCHG, that lock the bus during the entire read-modify-write operation. With the WB memory type, the XCHG 
instruction locks the cache instead of the bus if the memory access is contained within a cache line.

The PAT was introduced in the Pentium III processor to enhance the caching characteristics that can be assigned to 
pages or groups of pages. The PAT mechanism typically used to strengthen caching characteristics at the page level 
with respect to the caching characteristics established by the MTRRs. Table 11-7 shows the interaction of the PAT 
with the MTRRs.
Intel recommends that software written to run on Intel Core 2 Duo, Intel Atom, Intel Core Duo, Pentium 4, Intel 
Xeon, and P6 family processors assume the processor-ordering model or a weaker memory-ordering model. The 
Intel Core 2 Duo, Intel Atom, Intel Core Duo, Pentium 4, Intel Xeon, and P6 family processors do not implement a 
strong memory-ordering model, except when using the UC memory type. Despite the fact that Pentium 4, Intel 
Xeon, and P6 family processors support processor ordering, Intel does not guarantee that future processors will 
support this model. To make software portable to future processors, it is recommended that operating systems 
provide critical region and resource control constructs and API’s (application program interfaces) based on I/O, 
locking, and/or serializing instructions be used to synchronize access to shared areas of memory in multiple-
processor systems. Also, software should not depend on processor ordering in situations where the system hard-
ware does not support this memory-ordering model.

2. Specifically, LFENCE does not execute until all prior instructions have completed locally, and no later instruction begins execution 

until LFENCE completes. As a result, an instruction that loads from memory and that precedes an LFENCE receives data from mem-

ory prior to completion of the LFENCE. An LFENCE that follows an instruction that stores to memory might complete before the data 

being stored have become globally visible. Instructions following an LFENCE may be fetched from memory before the LFENCE, but 

they will not execute until the LFENCE completes.