background image

11-8 Vol. 3A

MEMORY CACHE CONTROL

buffer. The WC buffer is not snooped and thus does not provide data coherency. Buffering of writes to WC memory 
is done to allow software a small window of time to supply more modified data to the WC buffer while remaining as 
non-intrusive to software as possible. The buffering of writes to WC memory also causes data to be collapsed; that 
is, multiple writes to the same memory location will leave the last data written in the location and the other writes 
will be lost.
The size and structure of the WC buffer is not architecturally defined. For the Intel Core 2 Duo, Intel Atom, Intel 
Core Duo, Pentium M, Pentium 4 and Intel Xeon processors; the WC buffer is made up of several 64-byte WC 
buffers. For the P6 family processors, the WC buffer is made up of several 32-byte WC buffers. 
When software begins writing to WC memory, the processor begins filling the WC buffers one at a time. When one 
or more WC buffers has been filled, the processor has the option of evicting the buffers to system memory. The 
protocol for evicting the WC buffers is implementation dependent and should not be relied on by software for 
system memory coherency. When using the WC memory type, software must be sensitive to the fact that the 
writing of data to system memory is being delayed and must deliberately empty the WC buffers when system 
memory coherency is required.
Once the processor has started to evict data from the WC buffer into system memory, it will make a bus-transaction 
style decision based on how much of the buffer contains valid data. If the buffer is full (for example, all bytes are 
valid), the processor will execute a burst-write transaction on the bus. This results in all 32 bytes (P6 family proces-
sors) or 64 bytes (Pentium 4 and more recent processor) being transmitted on the data bus in a single burst trans-
action. If one or more of the WC buffer’s bytes are invalid (for example, have not been written by software), the 
processor will transmit the data to memory using “partial write” transactions (one chunk at a time, where a “chunk” 
is 8 bytes). 
This will result in a maximum of 4 partial write transactions (for P6 family processors) or 8 partial write transactions 
(for the Pentium 4 and more recent processors) for one WC buffer of data sent to memory. 
The WC memory type is weakly ordered by definition. Once the eviction of a WC buffer has started, the data is 
subject to the weak ordering semantics of its definition. Ordering is not maintained between the successive alloca-
tion/deallocation of WC buffers (for example, writes to WC buffer 1 followed by writes to WC buffer 2 may appear 
as buffer 2 followed by buffer 1 on the system bus). When a WC buffer is evicted to memory as partial writes there 
is no guaranteed ordering between successive partial writes (for example, a partial write for chunk 2 may appear 
on the bus before the partial write for chunk 1 or vice versa). 
The only elements of WC propagation to the system bus that are guaranteed are those provided by transaction 
atomicity. For example, with a P6 family processor, a completely full WC buffer will always be propagated as a 
single 32-bit burst transaction using any chunk order. In a WC buffer eviction where data will be evicted as partials, 
all data contained in the same chunk (0 mod 8 aligned) will be propagated simultaneously. Likewise, for more 
recent processors starting with those based on Intel NetBurst microarchitectures, a full WC buffer will always be 
propagated as a single burst transactions, using any chunk order within a transaction. For partial buffer propaga-
tions, all data contained in the same chunk will be propagated simultaneously.

11.3.2 

Choosing a Memory Type

The simplest system memory model does not use memory-mapped I/O with read or write side effects, does not 
include a frame buffer, and uses the write-back memory type for all memory. An I/O agent can perform direct 
memory access (DMA) to write-back memory and the cache protocol maintains cache coherency.
A system can use strong uncacheable memory for other memory-mapped I/O, and should always use strong 
uncacheable memory for memory-mapped I/O with read side effects.
Dual-ported memory can be considered a write side effect, making relatively prompt writes desirable, because 
those writes cannot be observed at the other port until they reach the memory agent. A system can use strong 
uncacheable, uncacheable, write-through, or write-combining memory for frame buffers or dual-ported memory 
that contains pixel values displayed on a screen. Frame buffer memory is typically large (a few megabytes) and is 
usually written more than it is read by the processor. Using strong uncacheable memory for a frame buffer gener-
ates very large amounts of bus traffic, because operations on the entire buffer are implemented using partial writes 
rather than line writes. Using write-through memory for a frame buffer can displace almost all other useful cached 
lines in the processor's L2 and L3 caches and L1 data cache. Therefore, systems should use write-combining 
memory for frame buffers whenever possible.