background image

Vol. 3B 19-119

PERFORMANCE-MONITORING EVENTS

• The load is from bytes written by the preceding store, 

the store is misaligned and the load is not aligned on the 

beginning of the store.  

• The load is split over an eight byte boundary (excluding 

16-byte loads). 

• The load and store have the same offset relative to the 

beginning of different 4-KByte pages. This case is also 

called 4-KByte aliasing. 

• In all these cases the load is blocked until after the 

blocking store retires and the stored data is committed to 

the cache hierarchy.

03H

10H

LOAD_BLOCK.

UNTIL_RETIRE

Loads blocked until 

retirement.

This event indicates that load operations were blocked until 

retirement. The number of events is greater or equal to the 

number of load operations that were blocked. 

This includes mainly uncacheable loads and split loads (loads 

that cross the cache line boundary) but may include other 

cases where loads are blocked until retirement.

03H

20H

LOAD_BLOCK.L1D

Loads blocked by the 

L1 data cache.

This event indicates that loads are blocked due to one or 

more reasons. Some triggers for this event are: 
• The number of L1 data cache misses exceeds the 

maximum number of outstanding misses supported by 

the processor. This includes misses generated as result of 

demand fetches, software prefetches or hardware 

prefetches. 

• Cache line split loads. 

• Partial reads, such as reads to un-cacheable memory, I/O 

instructions and more. 

• A locked load operation is in progress. The number of 

events is greater or equal to the number of load 

operations that were blocked.

04H

01H

SB_DRAIN_

CYCLES

Cycles while stores are 

blocked due to store 

buffer drain.

This event counts every cycle during which the store buffer 

is draining. This includes: 
• Serializing operations such as CPUID 

• Synchronizing operations such as XCHG 

• Interrupt acknowledgment 

• Other conditions, such as cache flushing

04H

02H

STORE_BLOCK.

ORDER

Cycles while store is 

waiting for a 

preceding store to be 

globally observed.

This event counts the total duration, in number of cycles, 

which stores are waiting for a preceding stored cache line to 

be observed by other cores. 

This situation happens as a result of the strong store 

ordering behavior, as defined in “Memory Ordering,” Chapter 

8, Intel® 64 and IA-32 Architectures Software Developer’s 

Manual, Volume 3A

The stall may occur and be noticeable if there are many 

cases when a store either misses the L1 data cache or hits a 

cache line in the Shared state. If the store requires a bus 

transaction to read the cache line then the stall ends when 

snoop response for the bus transaction arrives.

04

H

08

H

STORE_BLOCK.

SNOOP

A store is blocked due 

to a conflict with an 

external or internal 

snoop.

This event counts the number of cycles the store port was 

used for snooping the L1 data cache and a store was stalled 

by the snoop. The store is typically resubmitted one cycle 

later.

Table 19-23.  Non-Architectural Performance Events in Processors Based on Intel® Core™ Microarchitecture (Contd.)

Event 

Num

Umask

Value

Event Name 

Definition

Description and

Comment