这是用户在 2024-4-26 15:18 为 https://app.immersivetranslate.com/pdf-pro/799e136f-d857-49a8-bc01-85523dc6c2d3 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?
2024_04_26_0a980a70489ec4861c3ag
RAS Feature Description
Standard
RAS
SKU
Advanced
RAS SKU
Error Reporting (MCA, AER) -
错误报告(MCA,AER)-
core, uncore, and IIO
核心、非核心和 IIO
Error reporting includes error logging and signaling within core,
错误报告包括核心、非核心和 IIO 子系统内的错误日志记录和信令。
uncore, and IIO sub-systems. It covers following domains of error
它涵盖了错误的以下领域。
reporting:
1. Machine Check Architecture (MCA)
1. 机器检查架构(MCA)
2. PCIe Advanced Error Reporting (AER) and additional IIO error
2. PCIe 高级错误报告(AER)和额外的 IIO 错误
reporting through Integrated Error Handler (IEH)
通过集成错误处理程序(IEH)进行报告
3. Platform-specific Intel UPI Error reporting
3. 特定平台 Intel UPI 错误报告
4. Platform-specific Integrated Voltage Regulator (IVR) Error
4. 特定平台集成电压调节器(IVR)错误
Reporting
5. Platform-specific memory error reporting
5. 平台特定的内存错误报告
Yes Yes
Memory Corrected Error 内存校正错误
Reporting
It provides per rank corrected error counters with leaky bucket, can
它提供了每个 rank 的校正错误计数器,带有漏桶功能,可以
trigger either SMI/NMI/ERROR_N[0] Error Signaling for Platform use
触发平台使用的 SMI/NMI/ERROR_N[0] 错误信号
only. It can be used by firmware for invocation of various memory
仅限。固件可以使用它来调用各种内存
device sparing and mirroring features.
设备备用和镜像功能。
Yes Yes
Error Reporting via IOMCA
通过 IOMCA 进行错误报告
IOMCA extends the 'Legacy IA- 32 MCA' error reporting to the IIO
IOMCA 将“Legacy IA-32 MCA”错误报告扩展到 IIO 子系统
sub-system. Processor has added a dedicated Machine Check Bank
处理器已添加了专用的机器检查银行
within the UBOX for reporting IIO uncorrected error.
在 UBOX 中报告 IIO 未校正错误。
Yes Yes
First Corrected Error (FCERR)
第一个已校正错误(FCERR)
Mode of Reporting 报告模式
FCERR allows latching error log information specific to the first
FCERR 允许锁存特定于第一个已纠正错误的错误日志信息,并防止后续错误覆盖错误日志寄存器
corrected error and prevents over-writing the error logging registers
with subsequent errors. 与后续错误。
Yes Yes
Error Reporting through MCA
通过 MCA 进行错误报告
2.0 (EMCA Gen2) 2.0(EMCA Gen2)
EMCA Gen2 is an enhancement to the 'Legacy IA-32 MCA'
EMCA Gen2 是对“传统 IA-32 MCA”的增强
supporting Firmware First Model (FFM) of error reporting. All the
支持首先固件模型(FFM)的错误报告。所有检测到的错误都通过系统管理中断(SMI)首先发出信号。
detected errors are signaled via System Management Interrupt
SMM 处理程序被允许将错误信号传递给软件/操作系统。
(SMI) first. SMM Handler is allowed to signal the error to SW/OS as
(SMI)首先。 SMM 处理程序被允许将错误信号传递给软件/操作系统。
per the platform RAS policies. (Adv .RAS SkU, SMM hander can set
根据平台RAS政策。 (Adv .RAS SkU,SMM hander可以设置
to 1 bits in MCA banks)
在 MCA 银行中的 1 位
Yes Yes
PCIe Corrected Error PCIe 纠正错误
Reporting (error counters and
报告(错误计数器和
leaky-bucket) 漏桶)
PCIe corrected error counter, and threshold setting. Also supports
PCIe 纠正错误计数器和阈值设置。还支持
leaky-bucket logic to periodically deplete the count.
漏桶逻辑以定期减少计数。
Yes Yes
Thresholding for Corrected
修正后的阈值
Errors (Uncore MCA banks)
错误(Uncore MCA 银行)的阈值化
Threshold support for CSMI generation from all uncore MCA banks.
从所有 Uncore MCA 银行生成 CSMI 的阈值支持
It allows signaling corrected error events once a threshold is
一旦达到阈值,它允许信号纠正错误事件
reached.
Yes Yes
MCA Bank Error Control (aka
MCA Bank Error Control(又名
'cloaking')
It gives UEFI FW and PECI visibility into Corrected and UCNA errors,
它使 UEFI 固件和 PECI 能够看到已纠正和 UCNA 错误,
and mask signaling corrected and UCNA errors to OS/SW. (OEM
并将掩码信号校正和 UCNA 错误传递给 OS/SW。(OEM
specific application.) 特定应用程序。)
Yes Yes
CSR Error Log Cloaking
CSR 错误日志伪装
DEVHIDE enables UEFI FW to fully manage platform error logs, and
DEVHIDE使UEFI固件能够完全管理平台错误日志。
prevent non UEFI FW or PECI agent from accessing configuration
防止非 UEFI 固件或 PECI 代理访问配置
CSRs.
Yes Yes
Enhanced SMM (ESMM) 增强的 SMM(ESMM)
Enhancements to the existing SMM mode used for platform specific
对用于特定平台的现有 SMM 模式进行增强
error reporting. Key attributes of this feature are:
错误报告。此功能的关键属性包括:
1. Thread in Long Flow/Blocked indicators
1. 长流程/阻塞指示器中的线程
2. Targeted SMI-- Not supported.
2. 目标 SMI-- 不支持。
3. Execution outside of SMRR region detection
3. 在 SMRR 区域之外执行检测
4. SMM dump state storage into internal MSRs
4. 将 SMM 转储状态存储到内部 MSRs 中
5. Spurious SMI handling
5. 虚假 SMI 处理
6. 32-bit protected mode SMM entry
6. 32 位保护模式 SMM 入口
Yes Yes

NOTES:  注意:

  1. RAS features may not be supported on all SKUs of a processor type.
    处理器类型的所有 SKU 可能不支持 RAS 功能。
  2. A two socket workstation SKU supports a standard RAS SKU.
    一个双插槽工作站 SKU 支持标准的 RAS SKU。

11.2.1 Error Detection and Correction
11.2.1 错误检测和纠正

The processor implements extensive error detection and correction capability within various internal modules to maintain data integrity and target level of processor reliability.
处理器在各个内部模块中实现了广泛的错误检测和纠正能力,以保持数据完整性和处理器可靠性水平。
This feature covers entire processor level fault detection and correction capability. It offers data protection and data integrity via error detection within the core and uncore. It includes enhanced cache error reporting, Data Path Parity Protection (DPPP) and Address Path Parity Protection (APPP) within the processor interconnect hierarchy.
此功能涵盖整个处理器级别的故障检测和纠正能力。它通过核心和非核心内部的错误检测提供数据保护和数据完整性。它包括增强的缓存错误报告、数据路径奇偶校验保护(DPPP)和地址路径奇偶校验保护(APPP)在处理器互连层次结构内。
Detected errors are the errors that have been detected and reported by the error handling logic within the processor. Detected errors can be corrected through the embedded ECC in the IMC module, or through a retry transaction. A not correctable error is called Uncorrected Error (UC error).
检测到的错误是由处理器内的错误处理逻辑检测并报告的错误。检测到的错误可以通过 IMC 模块中的嵌入式 ECC 或通过重试事务进行纠正。一个无法纠正的错误称为未纠正错误(UC 错误)。
When the processor is configured in the legacy IA-32 MCA mode, the UC errors are reported as fatal and leads to an MCE (Machine Check Exception/error) resulting in a system reset. When the processor is configured in the corrupt data containment (poison or poison viral) mode, certain types of UC data errors will result in a hardware poison of the data, and allow the software (OS/VMware) to manage the error condition (recover or crash the system). Such errors are further classified as UCNA, SRAO, or SRAR as described next and shown in Figure 27 on page 226.
当处理器配置在传统的 IA-32 MCA 模式下时,UC 错误被报告为致命错误,并导致机器检查异常/错误(MCE),导致系统重置。当处理器配置在腐败数据封装(毒药或毒药病毒)模式下时,某些类型的 UC 数据错误将导致数据的硬件毒药,并允许软件(操作系统/VMware)管理错误条件(恢复或崩溃系统)。这些错误进一步分类为 UCNA、SRAO 或 SRAR,如下所述并在第 226 页的第 27 页上显示。
  • Uncorrected No Action (UCNA) - Uncorrecteable data is logged in the MCA bank with a unique signature ( ). The error containment bit (also known as poison bit) is attached to the bad data and forwarded to the requesting agent. Such an error classification can trigger a CSMI in the eMCA gen2 mode.
    未校正无操作(UCNA)- 无法校正的数据记录在具有唯一签名( )的 MCA 存储器中。错误包含位(也称为毒位)附加到错误数据并转发给请求代理。这种错误分类可以在 eMCA gen2 模式下触发 CSMI。
  • Software Recoverable Action Optional (SRAO) - There is no SRAO error type support on the platform. SRAO type events is reverted to UCNA. The OS/software action remains the same for error types that used to signal SRAO and now signaling UCNA.
    软件可恢复操作可选(SRAO)- 平台不支持 SRAO 错误类型。SRAO 类型事件被还原为 UCNA。对于以前用于表示 SRAO 并且现在表示 UCNA 的错误类型,OS/软件操作保持不变。
  • Software Recoverable Action Required (SRAR) - Bad data or instruction in the core execution path can result in an SRAR error in the MCA bank status with unique signature ( ). For example, Data Cache Unit ( ) load/ store and Instruction Fetch Unit (IFU) load operations on poisoned data. Such error classification can trigger MSMI in eMCA gen2.
    软件可恢复操作必需(SRAR)- 核心执行路径中的错误数据或指令可能导致 MCA 存储器状态中具有唯一签名( )的 SRAR 错误。例如,对毒数据进行的数据缓存单元( )加载/存储和指令获取单元(IFU)加载操作。这种错误分类可以在 eMCA gen2 中触发 MSMI。
Figure 27. Error Classification
图 27. 错误分类
  • Machine check architecture: MCA
    机器检查架构:MCA
  • Advance error reporting (exclude UCR): AER
    高级错误报告(不包括 UCR):AER
  • Detected but uncorrectable error: DUE
    检测到但无法纠正的错误: DUE
  • Uncorrected recoverable: UCR
    未纠正的可恢复错误: UCR
  • Uncorrected no action required IA32_MCi_STATUS; ( : UCNA
    未纠正的无需采取行动的错误 IA32_MCi_STATUS; ( : UCNA
  • SRAO IA32_MCi_STATUS;