background image

Vol. 3B 15-13

MACHINE-CHECK ARCHITECTURE

15.4 

ENHANCED CACHE ERROR REPORTING

Starting with Intel Core Duo processors, cache error reporting was enhanced. In earlier Intel processors, cache 
status was based on the number of correction events that occurred in a cache. In the new paradigm, called 
“threshold-based error status”, cache status is based on the number of lines (ECC blocks) in a cache that incur 
repeated corrections. The threshold is chosen by Intel, based on various factors. If a processor supports threshold-
based error status, it sets IA32_MCG_CAP[11] (MCG_TES_P) to 1; if not, to 0. 
A processor that supports enhanced cache error reporting contains hardware that tracks the operating status of 
certain caches and provides an indicator of their “health”. The hardware reports a “green” status when the number 
of lines that incur repeated corrections is at or below a pre-defined threshold, and a “yellow” status when the 
number of affected lines exceeds the threshold. Yellow status means that the cache reporting the event is oper-
ating correctly, but you should schedule the system for servicing within a few weeks.
Intel recommends that you rely on this mechanism for structures supported by threshold-base error reporting. 
The CPU/system/platform response to a yellow event should be less severe than its response to an uncorrected 
error. An uncorrected error means that a serious error has actually occurred, whereas the yellow condition is a 
warning that the number of affected lines has exceeded the threshold but is not, in itself, a serious event: the error 
was corrected and system state was not compromised. 
The green/yellow status indicator is not a foolproof early warning for an uncorrected error resulting from the failure 
of two bits in the same ECC block. Such a failure can occur and cause an uncorrected error before the yellow 
threshold is reached. However, the chance of an uncorrected error increases as the number of affected lines 
increases. 

15.5 

CORRECTED MACHINE CHECK ERROR INTERRUPT

Corrected machine-check error interrupt (CMCI) is an architectural enhancement to the machine-check architec-
ture. It provides capabilities beyond those of threshold-based error reporting (Section 15.4). With threshold-based 
error reporting, software is limited to use periodic polling to query the status of hardware corrected MC errors. 
CMCI provides a signaling mechanism to deliver a local interrupt based on threshold values that software can 
program using the IA32_MCi_CTL2 MSRs. 
CMCI is disabled by default. System software is required to enable CMCI for each IA32_MCi bank that support the 
reporting of hardware corrected errors if IA32_MCG_CAP[10] = 1.
System software use IA32_MCi_CTL2 MSR to enable/disable the CMCI capability for each bank and program 
threshold values into IA32_MCi_CTL2 MSR. CMCI is not affected by the CR4.MCE bit, and it is not affected by the 
IA32_MCi_CTL MSRs.
To detect the existence of thresholding for a given bank, software writes only bits 14:0 with the threshold value. If 
the bits persist, then thresholding is available (and CMCI is available). If the bits are all 0's, then no thresholding 
exists. To detect that CMCI signaling exists, software writes a 1 to bit 30 of the MCi_CTL2 register. Upon subse-
quent read, if bit 30 = 0, no CMCI is available for this bank and no corrected or UCNA errors will be reported on this 
bank. If bit 30 = 1, then CMCI is available and enabled.

15.5.1 

CMCI Local APIC Interface

The operation of CMCI is depicted in Figure 15-10.