background image

15-26 Vol. 3B

MACHINE-CHECK ARCHITECTURE

In either case, this indicates that the error is detected at the instruction pointer saved on the stack for this 
machine check exception and restarting execution with the interrupted context is not possible. System 
software may take the following recovery actions for the affected logical processor: 

The current executing thread cannot be continued. System software must terminate the interrupted 

stream of execution and provide a new stream of execution on return from the machine check handler 
for the affected logical processor.

SRAR Error And Non-Affected Logical Processors

The logical processors that observed but not affected by an SRAR error should find that the RIPV flag in the 
IA32_MCG_STATUS register is set and the EIPV flag in the IA32_MCG_STATUS register is cleared, indicating that it 
is safe to restart the execution at the instruction saved on the stack for the machine check exception on these 
processors after the recovery action is successfully taken by system software. 

15.9.4 

Multiple MCA Errors 

When multiple MCA errors are detected within a certain detection window, the processor may aggregate the 
reporting of these errors together as a single event, i.e. a single machine exception condition.  If this occurs, 
system software may find multiple MCA errors logged in different MC banks on one logical processor or find multiple 
MCA errors logged across different processors for a single machine check broadcast event.  In order to handle 
multiple UCR errors reported from a single machine check event and possibly recover from multiple errors, system 
software may consider the following: 

Whether it can recover from multiple errors is determined by the most severe error reported on the system.  If 
the most severe error is found to be an unrecoverable error (VAL=1, UC=1, PCC=1 and EN=1) after system 
software examines the MC banks of all processors to which the MCA signal is broadcast, recovery from the 
multiple errors is not possible and system software needs to reset the system. 

When multiple recoverable errors are reported and no other fatal condition (e.g. overflowed condition for SRAR 
error) is found for the reported recoverable errors, it is possible for system software to recover from the 
multiple recoverable errors by taking necessary recovery action for each individual recoverable error. However, 
system software can no longer expect one to one relationship with the error information recorded in the 
IA32_MCi_STATUS register and the states of the RIPV and EIPV flags in the IA32_MCG_STATUS register as the 
states of the RIPV and the EIPV flags in the IA32_MCG_STATUS register may indicate the information for the 
most severe error recorded on the processor. System software is required to use the RIPV flag indication in the 
IA32_MCG_STATUS register to make a final decision of recoverability of the errors and find the restart-ability 
requirement after examining each IA32_MCi_STATUS register error information in the MC banks. 
In certain cases where system software observes more than one SRAR error logged for a single logical 
processor, it can no longer rely on affected threads as specified in Table 15-20 above. System software is 
recommended to reset the system if this condition is observed. 

15.9.5 

Machine-Check Error Codes Interpretation

Chapter 16, “Interpreting Machine-Check Error Codes,” provides information on interpreting the MCA error code, 
model-specific error code, and other information error code fields. For P6 family processors, information has been 
included on decoding external bus errors. For Pentium 4 and Intel Xeon processors; information is included on 
external bus, internal timer and cache hierarchy errors.

15.10  GUIDELINES FOR WRITING MACHINE-CHECK SOFTWARE

The machine-check architecture and error logging can be used in three different ways:

To detect machine errors during normal instruction execution, using the machine-check exception (#MC).

To periodically check and log machine errors.

To examine recoverable UCR errors, determine software recoverability and perform recovery actions via a 
machine-check exception handler or a corrected machine-check interrupt handler.