Mailing List Archive

Unexpected server reboot following DRBD 9.2.5 Update
Hello everyone,

We are using LINSTOR and DRBD. After updating DRBD from 9.2.4 to 9.2.5, one
server started to reboot during the loading of the DRBD module (this
problem did not occur with the other servers). The following error appears
in the logs:

```
Oct 26 11:43:30 oms02-dhw05 kernel: drbd
pvc-5fc91592-f3ed-41c3-9af6-6614615c2d70/0 drbd1048 oms02-dhw02: repl(
PausedSyncT -> SyncTarget )
Oct 26 11:43:30 oms02-dhw05 kernel: drbd
pvc-5fc91592-f3ed-41c3-9af6-6614615c2d70/0 drbd1048 oms02-dhw02: Syncer
continues.
Oct 26 11:43:30 oms02-dhw05 kernel: drbd
pvc-791eb020-14ea-4f89-ab62-9c7cda434b48/0 drbd1014 oms02-dhw03: Request
depends on dagtag from disconnected peer, cancelling
Oct 26 11:43:30 oms02-dhw05 kernel: drbd
pvc-791eb020-14ea-4f89-ab62-9c7cda434b48/0 drbd1014: ASSERTION
!drbd_interval_empty(i) FAILED in drbd_remove_peer_req_interval
Oct 26 11:43:30 oms02-dhw05 kernel: list_del corruption,
ffff9f7e148e0320->next is LIST_POISON1 (dead000000000100)
Oct 26 11:43:30 oms02-dhw05 kernel: ------------[ cut here ]------------
Oct 26 11:43:30 oms02-dhw05 kernel: kernel BUG at lib/list_debug.c:55!
Oct 26 11:43:30 oms02-dhw05 kernel: invalid opcode: 0000 [#1] SMP NOPTI
Oct 26 11:43:30 oms02-dhw05 kernel: CPU: 16 PID: 35038 Comm:
drbd_r_pvc-791e Tainted: P OE 5.15.125-1.el7.3.x86_64 #1
Oct 26 11:43:30 oms02-dhw05 kernel: Hardware name: Lenovo ThinkSystem SR650
-[7X05CTO1WW]-/-[7X05CTO1WW]-, BIOS -[IVE182H-4.10]- 04/19/2023
Oct 26 11:43:30 oms02-dhw05 kernel: RIP:
0010:__list_del_entry_valid.cold.1+0x56/0x69
Oct 26 11:43:30 oms02-dhw05 kernel: Code: e8 c5 31 fe ff 0f 0b 48 89 fe 48
c7 c7 58 83 64 a2 e8 b4 31 fe ff 0f 0b 48 89 fe 48 89 c2 48 c7 c7 20 83 64
a2 e8 a0 31 fe ff <0f> 0b 48 89 fe 48 c7 c7 f0 82 64 a2 e8 8f 31 fe ff 0f
0b 48 c7 c7
Oct 26 11:43:30 oms02-dhw05 kernel: RSP: 0018:ffffac26ac177de0 EFLAGS:
00010046
Oct 26 11:43:30 oms02-dhw05 kernel: RAX: 000000000000004e RBX:
ffff9f7e148e0300 RCX: 0000000000000027
Oct 26 11:43:30 oms02-dhw05 kernel: RDX: 0000000000000000 RSI:
ffff9ffabfc205c0 RDI: ffff9ffabfc205c8
Oct 26 11:43:30 oms02-dhw05 kernel: RBP: 0000000000000000 R08:
0000000000000000 R09: c0000000ffff7fff
Oct 26 11:43:30 oms02-dhw05 kernel: R10: 0000000000000001 R11:
ffffac26ac177bf8 R12: ffff9ffdaf5a8800
Oct 26 11:43:30 oms02-dhw05 kernel: R13: ffff9ffda6518364 R14:
ffff9ffda6e09800 R15: ffff9ffda6518364
Oct 26 11:43:30 oms02-dhw05 kernel: FS: 0000000000000000(0000)
GS:ffff9ffabfc00000(0000) knlGS:0000000000000000
Oct 26 11:43:30 oms02-dhw05 kernel: CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
Oct 26 11:43:30 oms02-dhw05 kernel: CR2: 000000000052d320 CR3:
000000822fe0a006 CR4: 00000000007706e0
Oct 26 11:43:30 oms02-dhw05 kernel: DR0: 0000000000000000 DR1:
0000000000000000 DR2: 0000000000000000
Oct 26 11:43:30 oms02-dhw05 kernel: DR3: 0000000000000000 DR6:
00000000fffe0ff0 DR7: 0000000000000400
Oct 26 11:43:30 oms02-dhw05 kernel: PKRU: 55555554
Oct 26 11:43:30 oms02-dhw05 kernel: Call Trace:
Oct 26 11:43:30 oms02-dhw05 kernel: <TASK>
Oct 26 11:43:30 oms02-dhw05 kernel: ? __die_body+0x1a/0x60
Oct 26 11:43:30 oms02-dhw05 kernel: ? die+0x2a/0x50
Oct 26 11:43:30 oms02-dhw05 kernel: ? do_trap+0xe2/0x110
Oct 26 11:43:30 oms02-dhw05 kernel: ?
__list_del_entry_valid.cold.1+0x56/0x69
Oct 26 11:43:30 oms02-dhw05 kernel: ? do_error_trap+0x64/0xa0
Oct 26 11:43:30 oms02-dhw05 kernel: ?
__list_del_entry_valid.cold.1+0x56/0x69
Oct 26 11:43:30 oms02-dhw05 kernel: ? exc_invalid_op+0x4c/0x60
Oct 26 11:43:30 oms02-dhw05 kernel: ?
__list_del_entry_valid.cold.1+0x56/0x69
Oct 26 11:43:30 oms02-dhw05 kernel: ? asm_exc_invalid_op+0x16/0x20
Oct 26 11:43:30 oms02-dhw05 kernel: ?
__list_del_entry_valid.cold.1+0x56/0x69
Oct 26 11:43:30 oms02-dhw05 kernel: drbd_free_peer_req+0xa9/0x210 [drbd]
Oct 26 11:43:30 oms02-dhw05 kernel:
receive_common_data_request+0x298/0x7a0 [drbd]
Oct 26 11:43:30 oms02-dhw05 kernel: ?
receive_common_data_request+0x7a0/0x7a0 [drbd]
Oct 26 11:43:30 oms02-dhw05 kernel: drbd_receiver+0x5cd/0x7e0 [drbd]
Oct 26 11:43:30 oms02-dhw05 kernel: drbd_thread_setup+0x76/0x1c0 [drbd]
Oct 26 11:43:30 oms02-dhw05 kernel: ?
__drbd_next_peer_device_ref+0x1a0/0x1a0 [drbd]
Oct 26 11:43:30 oms02-dhw05 kernel: kthread+0x118/0x140
Oct 26 11:43:30 oms02-dhw05 kernel: ? set_kthread_struct+0x50/0x50
Oct 26 11:43:30 oms02-dhw05 kernel: ret_from_fork+0x1f/0x30
Oct 26 11:43:30 oms02-dhw05 kernel: </TASK>

```

Output from modinfo drbd:

```
modinfo drbd
filename: /lib/modules/5.15.125-1.el7.3.x86_64/updates/drbd.ko
alias: block-major-147-*
license: GPL
version: 9.2.5
description: drbd - Distributed Replicated Block Device v9.2.5
author: Philipp Reisner <phil@linbit.com>, Lars Ellenberg <
lars@linbit.com>
srcversion: CB9C7655FDF10CFDFEF8796
depends: lru_cache
retpoline: Y
name: drbd
vermagic: 5.15.125-1.el7.3.x86_64 SMP mod_unload
parm: enable_faults:int
parm: fault_rate:int
parm: fault_count:int
parm: fault_devs:int
parm: disable_sendpage:bool
parm: allow_oos:DONT USE! (bool)
parm: minor_count:Approximate number of drbd devices (1U-255U)
(uint)
parm: usermode_helper:string
parm: protocol_version_min:drbd_protocol_version
parm: strict_names:restrict resource and connection names to
ascii alnum and a subset of punct (drbd_strict_names)
```

We managed to solve the problem by deactivating the resource
pvc-791eb020-14ea-4f89-ab62-9c7cda434b48 on the server where the reboot
occurred.

A more detailed log is attached.

-- Best Regards,
Aleksandr Zimin