background image

15-30 Vol. 3B

MACHINE-CHECK ARCHITECTURE

can use external bus related model-specific information provided with the error report to localize the source of the 
error within the system and determine the appropriate recovery strategy. 

15.10.4 Machine-Check 

Software 

Handler Guidelines for Error Recovery

15.10.4.1   Machine-Check Exception Handler for Error Recovery

When writing a machine-check exception (MCE) handler to support software recovery from Uncorrected Recover-
able (UCR) errors, consider the following: 

When IA32_MCG_CAP [24] is zero, there are no recoverable errors supported and all machine-check are fatal 
exceptions. The logging of status and error information is therefore a baseline implementation requirement. 

When IA32_MCG_CAP [24] is 1, certain uncorrected errors called uncorrected recoverable (UCR) errors may be 
software recoverable. The handler can analyze the reported error information, and in some cases attempt to 
recover from the uncorrected error and continue execution.

For processors on which CPUID reports DisplayFamily_DisplayModel as 06H_0EH and onward, an MCA signal is 
broadcast to all logical processors in the system (see CPUID instruction in Chapter 3, “Instruction Set 
Reference, A-L” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 2A
). Due to the 
potentially shared machine check MSR resources among the logical processors on the same package/core, the 
MCE handler may be required to synchronize with the other processors that received a machine check error and 
serialize access to the machine check registers when analyzing, logging and clearing the information in the 
machine check registers.
— On processors that indicate ability for local machine-check exception (MCG_LMCE_P), hardware can choose 

to report the error to only a single logical processor if system software has enabled LMCE by setting 
IA32_MCG_EXT_CTL[LMCE_EN] = 1 as outlined in Section 15.3.1.5.

The VAL (valid) flag in each IA32_MCi_STATUS register indicates whether the error information in the register 
is valid. If this flag is clear, the registers in that bank do not contain valid error information and should not be 
checked.

The MCE handler is primarily responsible for processing uncorrected errors. The UC flag in each 
IA32_MCi_Status register indicates whether the reported error was corrected (UC=0) or uncorrected (UC=1). 
The MCE handler can optionally log and clear the corrected errors in the MC banks if it can implement software 
algorithm to avoid the undesired race conditions with the CMCI or CMC polling handler.

For uncorrectable errors, the EIPV flag in the IA32_MCG_STATUS register indicates (when set) that the 
instruction pointed to by the instruction pointer pushed onto the stack when the machine-check exception is 
generated is directly associated with the error. When this flag is cleared, the instruction pointed to may not be 
associated with the error. 

The MCIP flag in the IA32_MCG_STATUS register indicates whether a machine-check exception was generated. 
When a machine check exception is generated, it is expected that the MCIP flag in the IA32_MCG_STATUS 
register is set to 1. If it is not set, this machine check was generated by either an INT 18 instruction or some 
piece of hardware signaling an interrupt with vector 18. 

When IA32_MCG_CAP [24] is 1, the following rules can apply when writing a machine check exception (MCE) 
handler to support software recovery: 

The PCC flag in each IA32_MCi_STATUS register indicates whether recovery from the error is possible for 
uncorrected errors (UC=1). If the PCC flag is set for enabled uncorrected errors (UC=1 and EN=1), recovery is 
not possible. When recovery is not possible, the MCE handler typically records the error information and signals 
the operating system to reset the system. 

The RIPV flag in the IA32_MCG_STATUS register indicates whether restarting the program execution from the 
instruction pointer saved on the stack for the machine check exception is possible. When the RIPV is set, 
program execution can be restarted reliably when recovery is possible. If the RIPV flag is not set, program 
execution cannot be restarted reliably. In this case the recovery algorithm may involve terminating the current 
program execution and resuming an alternate thread of execution upon return from the machine check handler 
when recovery is possible. When recovery is not possible, the MCE handler signals the operating system to 
reset the system.