那台很牛气的 256GB 内存的服务器,运行起来表面上没有什么问题,但仔细观察的话,会有一些蛛丝马迹表示出其实是有隐疾的。查看 dmesg 命令的输出,会发现不少类似以下的输出:
1 2 3 4 |
[799565.759238] EDAC MC0: 1 CE memory scrubbing error on CPU_SrcID#0_Ha#0_Chan#1_DIMM#0 or CPU_SrcID#0_Ha#0_Chan#1_DIMM#1 (channel:1 page:0x3245599 offset:0x0 grain:32 syndrome:0x0 - area:DRAM err_code:0008:00c1 socket:0 ha:0 channel_mask:2 rank:255) [810416.341174] EDAC MC0: 1 CE memory scrubbing error on CPU_SrcID#0_Ha#0_Chan#1_DIMM#0 or CPU_SrcID#0_Ha#0_Chan#1_DIMM#1 (channel:1 page:0x3245599 offset:0x0 grain:32 syndrome:0x0 - area:DRAM err_code:0008:00c1 socket:0 ha:0 channel_mask:2 rank:255) [821265.915541] EDAC MC0: 1 CE memory scrubbing error on CPU_SrcID#0_Ha#0_Chan#1_DIMM#0 or CPU_SrcID#0_Ha#0_Chan#1_DIMM#1 (channel:1 page:0x3245599 offset:0x0 grain:32 syndrome:0x0 - area:DRAM err_code:0008:00c1 socket:0 ha:0 channel_mask:2 rank:255) [832099.455869] EDAC MC0: 1 CE memory scrubbing error on CPU_SrcID#0_Ha#0_Chan#1_DIMM#0 or CPU_SrcID#0_Ha#0_Chan#1_DIMM#1 (channel:1 page:0x3245599 offset:0x0 grain:32 syndrome:0x0 - area:DRAM err_code:0008:00c1 socket:0 ha:0 channel_mask:2 rank:255) |
意思是内存巡检出错。由于对信息中的编号跟主板上的对应关系一时还没有完全映射明白,就打开机箱把第二条内存换了根测试,结果问题依旧存在。昨晚又耳哥发来一条命令 grep “[0-9]” /sys/devices/system/edac/mc/mc*/csrow*/ch*_ce_count,得到的结果如下:
1 2 3 4 5 6 7 8 |
/sys/devices/system/edac/mc/mc0/csrow0/ch0_ce_count:0 /sys/devices/system/edac/mc/mc0/csrow0/ch1_ce_count:0 /sys/devices/system/edac/mc/mc0/csrow0/ch2_ce_count:0 /sys/devices/system/edac/mc/mc0/csrow0/ch3_ce_count:0 /sys/devices/system/edac/mc/mc0/csrow1/ch0_ce_count:0 /sys/devices/system/edac/mc/mc0/csrow1/ch1_ce_count:0 /sys/devices/system/edac/mc/mc0/csrow1/ch2_ce_count:9 /sys/devices/system/edac/mc/mc0/csrow1/ch3_ce_count:0 |
当时正在地铁上,是用手机远程上去执行的,一开始以为每行的最后那个数字都是 0,是没有发现异常的,复制给又耳哥,他眼尖,提醒我倒数第二行那个数字是 9 而不是 0,应该就是那条有问题。今天中午,又扔过来两条命令:yum install edac-utils -y 和 edac-util -v,执行结果见下:
1 2 3 4 5 6 7 8 9 10 11 12 |
mc0: 0 Uncorrected Errors with no DIMM info mc0: 0 Corrected Errors with no DIMM info mc0: csrow0: 0 Uncorrected Errors mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#0_DIMM#0: 0 Corrected Errors mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#1_DIMM#0: 0 Corrected Errors mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#2_DIMM#0: 0 Corrected Errors mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#3_DIMM#0: 0 Corrected Errors mc0: csrow1: 0 Uncorrected Errors mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#0_DIMM#1: 0 Corrected Errors mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#1_DIMM#1: 0 Corrected Errors mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#2_DIMM#1: 9 Corrected Errors mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#3_DIMM#1: 0 Corrected Errors |
果然,那个鬼魅 9 又出现在倒数第二行。目前的推断是,应该是第七条内存有问题,改天换它试试。感觉已经逼近真相了。
更新(2019-11-29):
根据网上的信息,第一次的检索输出中,mc* 表示第若干个 CPU(有人特别提醒说,貌似不与主板上的物理 CPU 一一一对应),csrow* 表示内存通道,而 ch* 表示通道内的第几根内存。