鲲鹏920服务器多次 Internal error: Oops: 96000004 [#1] SMP , 踩空指针， kdump

honda2022 · 2024 年7 月 21 日 03:38

多台920服务器FIO 压力测试出现踩空指针问题，出现 kdump，每次都出现在 #cpu1, 使用 FIO 绑定其他的 CPU，非 CPU1 后，测试正常，无法判断是RAID模块的问题, 还是机器硬件问题

[ 3086.276313] Unable to handle kernel NULL pointer dereference at virtual address 00000000000001b8
[ 3086.287480] Mem abort info:
[ 3086.293571]   ESR = 0x96000004
[ 3086.300236]   EC = 0x25: DABT (current EL), IL = 32 bits
[ 3086.308656]   SET = 0, FnV = 0
[ 3086.320182]   EA = 0, S1PTW = 0
[ 3086.324025] Data abort info:
[ 3086.327608]   ISV = 0, ISS = 0x00000004
[ 3086.332144]   CM = 0, WnR = 0
[ 3086.335814] user pgtable: 4k pages, 48-bit VAs, pgdp=000020206e378000
[ 3086.342961] [00000000000001b8] pgd=0000000000000000, p4d=0000000000000000
[ 3086.350462] Internal error: Oops: 96000004 [#1] SMP

[ 3086.496490] CPU: 32 PID: 174 Comm: ksoftirqd/32 Kdump: loaded Not tainted 5.10.0-60.18.0.50.oe2203.aarch64 #1
[ 3086.508077] Hardware name: Huawei Hengshan TS02F-F30/BC82AMDDHA, BIOS 6.70.K 04/03/2024
[ 3086.517801] pstate: 80400009 (Nzcv daif +PAN -UAO -TCO BTYPE=--)
[ 3086.525043] pc : complete_cmd_fusion+0x1b8/0x58c [megaraid_sas]
[ 3086.532187] lr : megasas_irqpoll+0xbc/0xe0 [megaraid_sas]
[ 3086.538805] sp : ffff800011c13c30
[ 3086.543336] x29: ffff800011c13c30 x28: ffff62b1c1069002 
[ 3086.549854] x27: ffff42923d21a6f0 x26: ffff42b25611f000 
[ 3086.556364] x25: ffff62b1c1069000 x24: 0000000000000000 
[ 3086.562861] x23: 0000000000000000 x22: ffff62b1c1069008 
[ 3086.569347] x21: ffff42b255192c08 x20: ffff42b255190b10 
[ 3086.575824] x19: ffff42b2551929a4 x18: 000000000000000e 
[ 3086.582288] x17: 0000000000000001 x16: ffffb4ea39856110 
[ 3086.588745] x15: 0000000000000033 x14: 000000000000004c 
[ 3086.595189] x13: 0000000000000068 x12: 0000000000000040 
[ 3086.601619] x11: 0000000000000000 x10: ffff42b24abd16d2 
[ 3086.608036] x9 : ffffb4e9e877d79c x8 : ffff42b24abd66d0 
[ 3086.614440] x7 : 0000000000000001 x6 : 0000000000000000 
[ 3086.620834] x5 : ffff42923ce6d200 x4 : ffff42923e028e60 
[ 3086.627218] x3 : 0000000000000000 x2 : 0000000000000000 
[ 3086.633592] x1 : 0000000000000000 x0 : ffff42b255300000 
[ 3086.639952] Call trace:
[ 3086.643445]  complete_cmd_fusion+0x1b8/0x58c [megaraid_sas]
[ 3086.650050]  megasas_irqpoll+0xbc/0xe0 [megaraid_sas]
[ 3086.656138]  irq_poll_softirq+0xa0/0x1a0
[ 3086.661086]  __do_softirq+0x130/0x358
[ 3086.665764]  run_ksoftirqd+0x68/0x90
[ 3086.670343]  smpboot_thread_fn+0x15c/0x1a0
[ 3086.675421]  kthread+0x108/0x13c
[ 3086.679627]  ret_from_fork+0x10/0x18
[ 3086.684162] Code: 54000948 35fff846 f9455f20 b4000080 (b941b965) 
[ 3086.691291] SMP: stopping secondary CPUs
[ 3086.698215] Starting crashdump kernel
[ 3086.703106] Bye!

uname: 5.10.0-216.0.0.115.oe2203sp4.aarch64/kernel
megaraid_sas: 07.714.04.00-rc1
lspci : RAID bus controller: Broadcom / LSI MegaRAID 12GSAS/PCIe Secure SAS39xx (9460-8i)


(gdb) l*complete_cmd_fusion+0x1b8
0xe5a4 is in complete_cmd_fusion (drivers/scsi/megaraid/megaraid_sas_fusion.c:3650).
3645                    /*
3646                     * Write to reply post host index register after completing threshold
3647                     * number of reply counts and still there are more replies in reply queue
3648                     * pending to be completed
3649                     */
3650                    if (threshold_reply_count >= instance->threshold_reply_count) {
3651                            if (instance->msix_combined)
3652                                    writel(((MSIxIndex & 0x7) << 24) |
3653                                            fusion->last_reply_idx[MSIxIndex],
3654                                            instance->reply_post_host_index_addr[MSIxIndex/8]);
(gdb)