多台920服务器FIO 压力测试出现 踩空指针问题,出现 kdump,每次都出现在 #cpu1, 使用 FIO 绑定其他的 CPU, 非 CPU1 后,测试正常,无法判断是RAID模块的问题, 还是机器硬件问题
[ 3086.276313] Unable to handle kernel NULL pointer dereference at virtual address 00000000000001b8
[ 3086.287480] Mem abort info:
[ 3086.293571] ESR = 0x96000004
[ 3086.300236] EC = 0x25: DABT (current EL), IL = 32 bits
[ 3086.308656] SET = 0, FnV = 0
[ 3086.320182] EA = 0, S1PTW = 0
[ 3086.324025] Data abort info:
[ 3086.327608] ISV = 0, ISS = 0x00000004
[ 3086.332144] CM = 0, WnR = 0
[ 3086.335814] user pgtable: 4k pages, 48-bit VAs, pgdp=000020206e378000
[ 3086.342961] [00000000000001b8] pgd=0000000000000000, p4d=0000000000000000
[ 3086.350462] Internal error: Oops: 96000004 [#1] SMP
[ 3086.496490] CPU: 32 PID: 174 Comm: ksoftirqd/32 Kdump: loaded Not tainted 5.10.0-60.18.0.50.oe2203.aarch64 #1
[ 3086.508077] Hardware name: Huawei Hengshan TS02F-F30/BC82AMDDHA, BIOS 6.70.K 04/03/2024
[ 3086.517801] pstate: 80400009 (Nzcv daif +PAN -UAO -TCO BTYPE=--)
[ 3086.525043] pc : complete_cmd_fusion+0x1b8/0x58c [megaraid_sas]
[ 3086.532187] lr : megasas_irqpoll+0xbc/0xe0 [megaraid_sas]
[ 3086.538805] sp : ffff800011c13c30
[ 3086.543336] x29: ffff800011c13c30 x28: ffff62b1c1069002
[ 3086.549854] x27: ffff42923d21a6f0 x26: ffff42b25611f000
[ 3086.556364] x25: ffff62b1c1069000 x24: 0000000000000000
[ 3086.562861] x23: 0000000000000000 x22: ffff62b1c1069008
[ 3086.569347] x21: ffff42b255192c08 x20: ffff42b255190b10
[ 3086.575824] x19: ffff42b2551929a4 x18: 000000000000000e
[ 3086.582288] x17: 0000000000000001 x16: ffffb4ea39856110
[ 3086.588745] x15: 0000000000000033 x14: 000000000000004c
[ 3086.595189] x13: 0000000000000068 x12: 0000000000000040
[ 3086.601619] x11: 0000000000000000 x10: ffff42b24abd16d2
[ 3086.608036] x9 : ffffb4e9e877d79c x8 : ffff42b24abd66d0
[ 3086.614440] x7 : 0000000000000001 x6 : 0000000000000000
[ 3086.620834] x5 : ffff42923ce6d200 x4 : ffff42923e028e60
[ 3086.627218] x3 : 0000000000000000 x2 : 0000000000000000
[ 3086.633592] x1 : 0000000000000000 x0 : ffff42b255300000
[ 3086.639952] Call trace:
[ 3086.643445] complete_cmd_fusion+0x1b8/0x58c [megaraid_sas]
[ 3086.650050] megasas_irqpoll+0xbc/0xe0 [megaraid_sas]
[ 3086.656138] irq_poll_softirq+0xa0/0x1a0
[ 3086.661086] __do_softirq+0x130/0x358
[ 3086.665764] run_ksoftirqd+0x68/0x90
[ 3086.670343] smpboot_thread_fn+0x15c/0x1a0
[ 3086.675421] kthread+0x108/0x13c
[ 3086.679627] ret_from_fork+0x10/0x18
[ 3086.684162] Code: 54000948 35fff846 f9455f20 b4000080 (b941b965)
[ 3086.691291] SMP: stopping secondary CPUs
[ 3086.698215] Starting crashdump kernel
[ 3086.703106] Bye!
uname: 5.10.0-216.0.0.115.oe2203sp4.aarch64/kernel
megaraid_sas: 07.714.04.00-rc1
lspci : RAID bus controller: Broadcom / LSI MegaRAID 12GSAS/PCIe Secure SAS39xx (9460-8i)
(gdb) l*complete_cmd_fusion+0x1b8
0xe5a4 is in complete_cmd_fusion (drivers/scsi/megaraid/megaraid_sas_fusion.c:3650).
3645 /*
3646 * Write to reply post host index register after completing threshold
3647 * number of reply counts and still there are more replies in reply queue
3648 * pending to be completed
3649 */
3650 if (threshold_reply_count >= instance->threshold_reply_count) {
3651 if (instance->msix_combined)
3652 writel(((MSIxIndex & 0x7) << 24) |
3653 fusion->last_reply_idx[MSIxIndex],
3654 instance->reply_post_host_index_addr[MSIxIndex/8]);
(gdb)