Mailing List Archive: md_raid: mdX_raid6 looping after sync

md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition

Nov 28, 2020, 2:15 PM

Post #1 of 49 (609 views)

Dear Linux mdraid people,

we are using raid6 on several servers. Occasionally we had failures, where a mdX_raid6 process seems to go into a busy loop and all I/O to the md device blocks. We've seen this on various kernel versions.

The last time this happened (in this case with Linux 5.10.0-rc4), I took some data.

The triggering event seems to be the md_check cron job trying to pause the ongoing check operation in the morning with

echo idle > /sys/devices/virtual/block/md1/md/sync_action

This doesn't complete. Here's /proc/stack of this process:

root@done:~/linux_problems/mdX_raid6_looping/2020-11-27# ps -fp 23333
UID PID PPID C STIME TTY TIME CMD
root 23333 23331 0 02:00 ? 00:00:00 /bin/bash /usr/bin/mdcheck --continue --duration 06:00
root@done:~/linux_problems/mdX_raid6_looping/2020-11-27# cat /proc/23333/stack
[<0>] kthread_stop+0x6e/0x150
[<0>] md_unregister_thread+0x3e/0x70
[<0>] md_reap_sync_thread+0x1f/0x1e0
[<0>] action_store+0x141/0x2b0
[<0>] md_attr_store+0x71/0xb0
[<0>] kernfs_fop_write+0x113/0x1a0
[<0>] vfs_write+0xbc/0x250
[<0>] ksys_write+0xa1/0xe0
[<0>] do_syscall_64+0x33/0x40
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

Note, that md0 has been paused successfully just before.

2020-11-27T02:00:01+01:00 done CROND[23333]: (root) CMD (/usr/bin/mdcheck --continue --duration "06:00")
2020-11-27T02:00:01+01:00 done root: mdcheck continue checking /dev/md0 from 10623180920
2020-11-27T02:00:01.382994+01:00 done kernel: [378596.606381] md: data-check of RAID array md0
2020-11-27T02:00:01+01:00 done root: mdcheck continue checking /dev/md1 from 11582849320
2020-11-27T02:00:01.437999+01:00 done kernel: [378596.661559] md: data-check of RAID array md1

2020-11-27T06:00:01.842003+01:00 done kernel: [392996.625147] md: md0: data-check interrupted.
2020-11-27T06:00:02+01:00 done root: pause checking /dev/md0 at 13351127680
2020-11-27T06:00:02.338989+01:00 done kernel: [392997.122520] md: md1: data-check interrupted.
[ nothing related following.... ]

After that, we see md1_raid6 in a busy loop:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2376 root 20 0 0 0 0 R 100.0 0.0 1387:38 md1_raid6

Also, all processes doing I/O do the md device block.

This is /proc/mdstat:

Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4] [multipath]
md1 : active raid6 sdk[0] sdj[15] sdi[14] sdh[13] sdg[12] sdf[11] sde[10] sdd[9] sdc[8] sdr[7] sdq[6] sdp[5] sdo[4] sdn[3] sdm[2] sdl[1]
109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUUU]
[==================>..] check = 94.0% (7350290348/7813894144) finish=57189.3min speed=135K/sec
bitmap: 0/59 pages [0KB], 65536KB chunk

md0 : active raid6 sds[0] sdah[15] sdag[16] sdaf[13] sdae[12] sdad[11] sdac[10] sdab[9] sdaa[8] sdz[7] sdy[6] sdx[17] sdw[4] sdv[3] sdu[2] sdt[1]
109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUUU]
bitmap: 0/59 pages [0KB], 65536KB chunk

There doesn't seem to be any further progress.

I've taken a function_graph trace of the looping md1_raid6 process: https://owww.molgen.mpg.de/~buczek/2020-11-27_trace.txt (30 MB)

Maybe this helps to get an idea what might be going on?

Best
Donald

--
Donald Buczek
buczek@molgen.mpg.de
Tel: +49 30 8413 1433

Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition [ In reply to ]

guoqing.jiang at cloud

Nov 29, 2020, 6:15 PM

Post #2 of 49 (609 views)

Permalink

On 11/28/20 13:25, Donald Buczek wrote:
> Dear Linux mdraid people,
>
> we are using raid6 on several servers. Occasionally we had failures,
> where a mdX_raid6 process seems to go into a busy loop and all I/O to
> the md device blocks. We've seen this on various kernel versions.
>
> The last time this happened (in this case with Linux 5.10.0-rc4), I took
> some data.
>
> The triggering event seems to be the md_check cron job trying to pause
> the ongoing check operation in the morning with
>
>     echo idle > /sys/devices/virtual/block/md1/md/sync_action
>
> This doesn't complete. Here's /proc/stack of this process:
>
>     root@done:~/linux_problems/mdX_raid6_looping/2020-11-27# ps -fp 23333
>     UID        PID PPID C STIME TTY          TIME CMD
>     root     23333 23331 0 02:00 ?        00:00:00 /bin/bash
> /usr/bin/mdcheck --continue --duration 06:00
>     root@done:~/linux_problems/mdX_raid6_looping/2020-11-27# cat
> /proc/23333/stack
>     [<0>] kthread_stop+0x6e/0x150
>     [<0>] md_unregister_thread+0x3e/0x70
>     [<0>] md_reap_sync_thread+0x1f/0x1e0
>     [<0>] action_store+0x141/0x2b0
>     [<0>] md_attr_store+0x71/0xb0
>     [<0>] kernfs_fop_write+0x113/0x1a0
>     [<0>] vfs_write+0xbc/0x250
>     [<0>] ksys_write+0xa1/0xe0
>     [<0>] do_syscall_64+0x33/0x40
>     [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> Note, that md0 has been paused successfully just before.

What is the personality of md0? Is it also raid6?

>
>     2020-11-27T02:00:01+01:00 done CROND[23333]: (root) CMD
> (/usr/bin/mdcheck --continue --duration "06:00")
>     2020-11-27T02:00:01+01:00 done root: mdcheck continue checking
> /dev/md0 from 10623180920
>     2020-11-27T02:00:01.382994+01:00 done kernel: [378596.606381] md:
> data-check of RAID array md0
>     2020-11-27T02:00:01+01:00 done root: mdcheck continue checking
> /dev/md1 from 11582849320
>     2020-11-27T02:00:01.437999+01:00 done kernel: [378596.661559] md:
> data-check of RAID array md1
>     2020-11-27T06:00:01.842003+01:00 done kernel: [392996.625147] md:
> md0: data-check interrupted.
>     2020-11-27T06:00:02+01:00 done root: pause checking /dev/md0 at
> 13351127680
>     2020-11-27T06:00:02.338989+01:00 done kernel: [392997.122520] md:
> md1: data-check interrupted.
>     [ nothing related following.... ]
>
> After that, we see md1_raid6 in a busy loop:
>
>     PID USER      PR NI    VIRT    RES    SHR S %CPU %MEM     TIME+
> COMMAND
>     2376 root     20   0       0      0      0 R 100.0 0.0   1387:38
> md1_raid6

Seems the reap sync thread was trying to stop md1_raid6 while md1_raid6
was triggered again and again.

>
> Also, all processes doing I/O do the md device block.
>
> This is /proc/mdstat:
>
>     Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4]
> [multipath]
>     md1 : active raid6 sdk[0] sdj[15] sdi[14] sdh[13] sdg[12] sdf[11]
> sde[10] sdd[9] sdc[8] sdr[7] sdq[6] sdp[5] sdo[4] sdn[3] sdm[2] sdl[1]
>           109394518016 blocks super 1.2 level 6, 512k chunk, algorithm
> 2 [16/16] [UUUUUUUUUUUUUUUU]
>           [==================>..] check = 94.0%
> (7350290348/7813894144) finish=57189.3min speed=135K/sec
>           bitmap: 0/59 pages [0KB], 65536KB chunk
>     md0 : active raid6 sds[0] sdah[15] sdag[16] sdaf[13] sdae[12]
> sdad[11] sdac[10] sdab[9] sdaa[8] sdz[7] sdy[6] sdx[17] sdw[4] sdv[3]
> sdu[2] sdt[1]
>           109394518016 blocks super 1.2 level 6, 512k chunk, algorithm
> 2 [16/16] [UUUUUUUUUUUUUUUU]
>           bitmap: 0/59 pages [0KB], 65536KB chunk
>

So the RECOVERY_CHECK flag should be set, not sure if the simple changes
helps, but you may give it a try.

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 98bac4f..e2697d0 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -9300,7 +9300,8 @@ void md_check_recovery(struct mddev *mddev)
md_update_sb(mddev, 0);

if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) &&
- !test_bit(MD_RECOVERY_DONE, &mddev->recovery)) {
+ (!test_bit(MD_RECOVERY_DONE, &mddev->recovery) ||
+ test_bit(MD_RECOVERY_CHECK, &mddev->recovery))) {
/* resync/recovery still happening */
clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
goto unlock;

Thanks,
Guoqing

Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition [ In reply to ]

buczek at molgen

Dec 1, 2020, 2:15 AM

Post #3 of 49 (609 views)

Permalink

Am 30.11.20 um 03:06 schrieb Guoqing Jiang:
>
>
> On 11/28/20 13:25, Donald Buczek wrote:
>> Dear Linux mdraid people,
>>
>> we are using raid6 on several servers. Occasionally we had failures, where a mdX_raid6 process seems to go into a busy loop and all I/O to the md device blocks. We've seen this on various kernel versions.
>>
>> The last time this happened (in this case with Linux 5.10.0-rc4), I took some data.
>>
>> The triggering event seems to be the md_check cron job trying to pause the ongoing check operation in the morning with
>>
>>      echo idle > /sys/devices/virtual/block/md1/md/sync_action
>>
>> This doesn't complete. Here's /proc/stack of this process:
>>
>>      root@done:~/linux_problems/mdX_raid6_looping/2020-11-27# ps -fp 23333
>>      UID        PID PPID C STIME TTY          TIME CMD
>>      root     23333 23331 0 02:00 ?        00:00:00 /bin/bash /usr/bin/mdcheck --continue --duration 06:00
>>      root@done:~/linux_problems/mdX_raid6_looping/2020-11-27# cat /proc/23333/stack
>>      [<0>] kthread_stop+0x6e/0x150
>>      [<0>] md_unregister_thread+0x3e/0x70
>>      [<0>] md_reap_sync_thread+0x1f/0x1e0
>>      [<0>] action_store+0x141/0x2b0
>>      [<0>] md_attr_store+0x71/0xb0
>>      [<0>] kernfs_fop_write+0x113/0x1a0
>>      [<0>] vfs_write+0xbc/0x250
>>      [<0>] ksys_write+0xa1/0xe0
>>      [<0>] do_syscall_64+0x33/0x40
>>      [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>
>> Note, that md0 has been paused successfully just before.
>
> What is the personality of md0? Is it also raid6?

Yes.

>
>>
>>      2020-11-27T02:00:01+01:00 done CROND[23333]: (root) CMD (/usr/bin/mdcheck --continue --duration "06:00")
>>      2020-11-27T02:00:01+01:00 done root: mdcheck continue checking /dev/md0 from 10623180920
>>      2020-11-27T02:00:01.382994+01:00 done kernel: [378596.606381] md: data-check of RAID array md0
>>      2020-11-27T02:00:01+01:00 done root: mdcheck continue checking /dev/md1 from 11582849320
>>      2020-11-27T02:00:01.437999+01:00 done kernel: [378596.661559] md: data-check of RAID array md1
>>      2020-11-27T06:00:01.842003+01:00 done kernel: [392996.625147] md: md0: data-check interrupted.
>>      2020-11-27T06:00:02+01:00 done root: pause checking /dev/md0 at 13351127680
>>      2020-11-27T06:00:02.338989+01:00 done kernel: [392997.122520] md: md1: data-check interrupted.
>>      [ nothing related following.... ]
>>
>> After that, we see md1_raid6 in a busy loop:
>>
>>      PID USER      PR NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
>>      2376 root     20   0       0      0      0 R 100.0 0.0   1387:38 md1_raid6
>
> Seems the reap sync thread was trying to stop md1_raid6 while md1_raid6 was triggered again and again.
>
>>
>> Also, all processes doing I/O do the md device block.
>>
>> This is /proc/mdstat:
>>
>>      Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4] [multipath]
>>      md1 : active raid6 sdk[0] sdj[15] sdi[14] sdh[13] sdg[12] sdf[11] sde[10] sdd[9] sdc[8] sdr[7] sdq[6] sdp[5] sdo[4] sdn[3] sdm[2] sdl[1]
>>            109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUUU]
>>            [==================>..] check = 94.0% (7350290348/7813894144) finish=57189.3min speed=135K/sec
>>            bitmap: 0/59 pages [0KB], 65536KB chunk
>>      md0 : active raid6 sds[0] sdah[15] sdag[16] sdaf[13] sdae[12] sdad[11] sdac[10] sdab[9] sdaa[8] sdz[7] sdy[6] sdx[17] sdw[4] sdv[3] sdu[2] sdt[1]
>>            109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUUU]
>>            bitmap: 0/59 pages [0KB], 65536KB chunk
>>
>
> So the RECOVERY_CHECK flag should be set, not sure if the simple changes
> helps, but you may give it a try.

Thanks. I've copied the original condition block to execute before the modified one and added some logging to it to see, if the change actually triggers. I will pause and unpause the check frequently on a busy machine to get this code executed more often.

Best
Donald

>
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 98bac4f..e2697d0 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -9300,7 +9300,8 @@ void md_check_recovery(struct mddev *mddev)
>                         md_update_sb(mddev, 0);
>
>                 if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) &&
> -                   !test_bit(MD_RECOVERY_DONE, &mddev->recovery)) {
> +                   (!test_bit(MD_RECOVERY_DONE, &mddev->recovery) ||
> +                    test_bit(MD_RECOVERY_CHECK, &mddev->recovery))) {
>                         /* resync/recovery still happening */
>                         clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
>                         goto unlock;
>
> Thanks,
> Guoqing

--
Donald Buczek
buczek@molgen.mpg.de
Tel: +49 30 8413 1433

Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition [ In reply to ]

buczek at molgen

Dec 2, 2020, 10:15 AM

Post #4 of 49 (609 views)

Permalink

Dear Guoqing,

unfortunately the patch didn't fix the problem (unless I messed it up with my logging). This is what I used:

--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -9305,6 +9305,14 @@ void md_check_recovery(struct mddev *mddev)
clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
goto unlock;
}
+ if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) &&
+ (!test_bit(MD_RECOVERY_DONE, &mddev->recovery) ||
+ test_bit(MD_RECOVERY_CHECK, &mddev->recovery))) {
+ /* resync/recovery still happening */
+ pr_info("md: XXX BUGFIX applied\n");
+ clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
+ goto unlock;
+ }
if (mddev->sync_thread) {
md_reap_sync_thread(mddev);
goto unlock;

With pausing and continuing the check four times an hour, I could trigger the problem after about 48 hours. This time, the other device (md0) has locked up on `echo idle > /sys/devices/virtual/block/md0/md/sync_action` , while the check of md1 is still ongoing:

Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4] [multipath]
md1 : active raid6 sdk[0] sdj[15] sdi[14] sdh[13] sdg[12] sdf[11] sde[10] sdd[9] sdc[8] sdr[7] sdq[6] sdp[5] sdo[4] sdn[3] sdm[2] sdl[1]
109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUUU]
[=>...................] check = 8.5% (666852112/7813894144) finish=1271.2min speed=93701K/sec
bitmap: 0/59 pages [0KB], 65536KB chunk

md0 : active raid6 sds[0] sdah[15] sdag[16] sdaf[13] sdae[12] sdad[11] sdac[10] sdab[9] sdaa[8] sdz[7] sdy[6] sdx[17] sdw[4] sdv[3] sdu[2] sdt[1]
109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUUU]
[>....................] check = 0.2% (19510348/7813894144) finish=253779.6min speed=511K/sec
bitmap: 0/59 pages [0KB], 65536KB chunk

after 1 minute:

Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4] [multipath]
md1 : active raid6 sdk[0] sdj[15] sdi[14] sdh[13] sdg[12] sdf[11] sde[10] sdd[9] sdc[8] sdr[7] sdq[6] sdp[5] sdo[4] sdn[3] sdm[2] sdl[1]
109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUUU]
[=>...................] check = 8.6% (674914560/7813894144) finish=941.1min speed=126418K/sec
bitmap: 0/59 pages [0KB], 65536KB chunk

md0 : active raid6 sds[0] sdah[15] sdag[16] sdaf[13] sdae[12] sdad[11] sdac[10] sdab[9] sdaa[8] sdz[7] sdy[6] sdx[17] sdw[4] sdv[3] sdu[2] sdt[1]
109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUUU]
[>....................] check = 0.2% (19510348/7813894144) finish=256805.0min speed=505K/sec
bitmap: 0/59 pages [0KB], 65536KB chunk

A data point, I didn't mention in my previous mail, is that the mdX_resync thread is not gone when the problem occurs:

buczek@done:/scratch/local/linux (v5.10-rc6-mpi)$ ps -Af|fgrep [md
root 134 2 0 Nov30 ? 00:00:00 [md]
root 1321 2 27 Nov30 ? 12:57:14 [md0_raid6]
root 1454 2 26 Nov30 ? 12:37:23 [md1_raid6]
root 5845 2 0 16:20 ? 00:00:30 [md0_resync]
root 5855 2 13 16:20 ? 00:14:11 [md1_resync]
buczek 9880 9072 0 18:05 pts/0 00:00:00 grep -F [md

buczek@done:/scratch/local/linux (v5.10-rc6-mpi)$ sudo cat /proc/5845/stack
[<0>] md_bitmap_cond_end_sync+0x12d/0x170
[<0>] raid5_sync_request+0x24b/0x390
[<0>] md_do_sync+0xb41/0x1030
[<0>] md_thread+0x122/0x160
[<0>] kthread+0x118/0x130
[<0>] ret_from_fork+0x1f/0x30

I guess, md_bitmap_cond_end_sync+0x12d is the `wait_event(bitmap->mddev->recovery_wait,atomic_read(&bitmap->mddev->recovery_active) == 0);` in md-bitmap.c.

Donald

On 01.12.20 10:29, Donald Buczek wrote:
> Am 30.11.20 um 03:06 schrieb Guoqing Jiang:
>>
>>
>> On 11/28/20 13:25, Donald Buczek wrote:
>>> Dear Linux mdraid people,
>>>
>>> we are using raid6 on several servers. Occasionally we had failures, where a mdX_raid6 process seems to go into a busy loop and all I/O to the md device blocks. We've seen this on various kernel versions.
>>>
>>> The last time this happened (in this case with Linux 5.10.0-rc4), I took some data.
>>>
>>> The triggering event seems to be the md_check cron job trying to pause the ongoing check operation in the morning with
>>>
>>>      echo idle > /sys/devices/virtual/block/md1/md/sync_action
>>>
>>> This doesn't complete. Here's /proc/stack of this process:
>>>
>>>      root@done:~/linux_problems/mdX_raid6_looping/2020-11-27# ps -fp 23333
>>>      UID        PID PPID C STIME TTY          TIME CMD
>>>      root     23333 23331 0 02:00 ?        00:00:00 /bin/bash /usr/bin/mdcheck --continue --duration 06:00
>>>      root@done:~/linux_problems/mdX_raid6_looping/2020-11-27# cat /proc/23333/stack
>>>      [<0>] kthread_stop+0x6e/0x150
>>>      [<0>] md_unregister_thread+0x3e/0x70
>>>      [<0>] md_reap_sync_thread+0x1f/0x1e0
>>>      [<0>] action_store+0x141/0x2b0
>>>      [<0>] md_attr_store+0x71/0xb0
>>>      [<0>] kernfs_fop_write+0x113/0x1a0
>>>      [<0>] vfs_write+0xbc/0x250
>>>      [<0>] ksys_write+0xa1/0xe0
>>>      [<0>] do_syscall_64+0x33/0x40
>>>      [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>>
>>> Note, that md0 has been paused successfully just before.
>>
>> What is the personality of md0? Is it also raid6?
>
> Yes.
>
>>
>>>
>>>      2020-11-27T02:00:01+01:00 done CROND[23333]: (root) CMD (/usr/bin/mdcheck --continue --duration "06:00")
>>>      2020-11-27T02:00:01+01:00 done root: mdcheck continue checking /dev/md0 from 10623180920
>>>      2020-11-27T02:00:01.382994+01:00 done kernel: [378596.606381] md: data-check of RAID array md0
>>>      2020-11-27T02:00:01+01:00 done root: mdcheck continue checking /dev/md1 from 11582849320
>>>      2020-11-27T02:00:01.437999+01:00 done kernel: [378596.661559] md: data-check of RAID array md1
>>>      2020-11-27T06:00:01.842003+01:00 done kernel: [392996.625147] md: md0: data-check interrupted.
>>>      2020-11-27T06:00:02+01:00 done root: pause checking /dev/md0 at 13351127680
>>>      2020-11-27T06:00:02.338989+01:00 done kernel: [392997.122520] md: md1: data-check interrupted.
>>>      [ nothing related following.... ]
>>>
>>> After that, we see md1_raid6 in a busy loop:
>>>
>>>      PID USER      PR NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
>>>      2376 root     20   0       0      0      0 R 100.0 0.0   1387:38 md1_raid6
>>
>> Seems the reap sync thread was trying to stop md1_raid6 while md1_raid6 was triggered again and again.
>>
>>>
>>> Also, all processes doing I/O do the md device block.
>>>
>>> This is /proc/mdstat:
>>>
>>>      Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4] [multipath]
>>>      md1 : active raid6 sdk[0] sdj[15] sdi[14] sdh[13] sdg[12] sdf[11] sde[10] sdd[9] sdc[8] sdr[7] sdq[6] sdp[5] sdo[4] sdn[3] sdm[2] sdl[1]
>>>            109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUUU]
>>>            [==================>..] check = 94.0% (7350290348/7813894144) finish=57189.3min speed=135K/sec
>>>            bitmap: 0/59 pages [0KB], 65536KB chunk
>>>      md0 : active raid6 sds[0] sdah[15] sdag[16] sdaf[13] sdae[12] sdad[11] sdac[10] sdab[9] sdaa[8] sdz[7] sdy[6] sdx[17] sdw[4] sdv[3] sdu[2] sdt[1]
>>>            109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUUU]
>>>            bitmap: 0/59 pages [0KB], 65536KB chunk
>>>
>>
>> So the RECOVERY_CHECK flag should be set, not sure if the simple changes
>> helps, but you may give it a try.
>
> Thanks. I've copied the original condition block to execute before the modified one and added some logging to it to see, if the change actually triggers. I will pause and unpause the check frequently on a busy machine to get this code executed more often.
>
> Best
> Donald
>
>>
>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>> index 98bac4f..e2697d0 100644
>> --- a/drivers/md/md.c
>> +++ b/drivers/md/md.c
>> @@ -9300,7 +9300,8 @@ void md_check_recovery(struct mddev *mddev)
>>                          md_update_sb(mddev, 0);
>>
>>                  if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) &&
>> -                   !test_bit(MD_RECOVERY_DONE, &mddev->recovery)) {
>> +                   (!test_bit(MD_RECOVERY_DONE, &mddev->recovery) ||
>> +                    test_bit(MD_RECOVERY_CHECK, &mddev->recovery))) {
>>                          /* resync/recovery still happening */
>>                          clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
>>                          goto unlock;
>>
>> Thanks,
>> Guoqing
>

--
Donald Buczek
buczek@molgen.mpg.de
Tel: +49 30 8413 1433

Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition [ In reply to ]

guoqing.jiang at cloud

Dec 2, 2020, 6:15 PM

Post #5 of 49 (609 views)

Permalink

Hi Donald,

On 12/2/20 18:28, Donald Buczek wrote:
> Dear Guoqing,
>
> unfortunately the patch didn't fix the problem (unless I messed it up
> with my logging). This is what I used:
>
>     --- a/drivers/md/md.c
>     +++ b/drivers/md/md.c
>     @@ -9305,6 +9305,14 @@ void md_check_recovery(struct mddev *mddev)
>                             clear_bit(MD_RECOVERY_NEEDED,
> &mddev->recovery);
>                             goto unlock;
>                     }

I think you can add the check of RECOVERY_CHECK in above part instead of
add a new part.

>     +               if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) &&
>     +                   (!test_bit(MD_RECOVERY_DONE, &mddev->recovery) ||
>     +                    test_bit(MD_RECOVERY_CHECK, &mddev->recovery))) {
>     +                       /* resync/recovery still happening */
>     +                       pr_info("md: XXX BUGFIX applied\n");
>     +                       clear_bit(MD_RECOVERY_NEEDED,
> &mddev->recovery);
>     +                       goto unlock;
>     +               }
>                     if (mddev->sync_thread) {
>                             md_reap_sync_thread(mddev);
>                             goto unlock;

>
> With pausing and continuing the check four times an hour, I could
> trigger the problem after about 48 hours. This time, the other device
> (md0) has locked up on `echo idle >
> /sys/devices/virtual/block/md0/md/sync_action` , while the check of md1
> is still ongoing:

Without the patch, md0 was good while md1 was locked. So the patch
switches the status of the two arrays, a little weird ...

What is the stack of the process? I guess it is same as the stack of
23333 in your previous mail, but just to confirm.

>
>     Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4]
> [multipath]
>     md1 : active raid6 sdk[0] sdj[15] sdi[14] sdh[13] sdg[12] sdf[11]
> sde[10] sdd[9] sdc[8] sdr[7] sdq[6] sdp[5] sdo[4] sdn[3] sdm[2] sdl[1]
>           109394518016 blocks super 1.2 level 6, 512k chunk, algorithm
> 2 [16/16] [UUUUUUUUUUUUUUUU]
>           [=>...................] check = 8.5% (666852112/7813894144)
> finish=1271.2min speed=93701K/sec
>           bitmap: 0/59 pages [0KB], 65536KB chunk
>     md0 : active raid6 sds[0] sdah[15] sdag[16] sdaf[13] sdae[12]
> sdad[11] sdac[10] sdab[9] sdaa[8] sdz[7] sdy[6] sdx[17] sdw[4] sdv[3]
> sdu[2] sdt[1]
>           109394518016 blocks super 1.2 level 6, 512k chunk, algorithm
> 2 [16/16] [UUUUUUUUUUUUUUUU]
>           [>....................] check = 0.2% (19510348/7813894144)
> finish=253779.6min speed=511K/sec
>           bitmap: 0/59 pages [0KB], 65536KB chunk
>
> after 1 minute:
>
>     Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4]
> [multipath]
>     md1 : active raid6 sdk[0] sdj[15] sdi[14] sdh[13] sdg[12] sdf[11]
> sde[10] sdd[9] sdc[8] sdr[7] sdq[6] sdp[5] sdo[4] sdn[3] sdm[2] sdl[1]
>           109394518016 blocks super 1.2 level 6, 512k chunk, algorithm
> 2 [16/16] [UUUUUUUUUUUUUUUU]
>           [=>...................] check = 8.6% (674914560/7813894144)
> finish=941.1min speed=126418K/sec
>           bitmap: 0/59 pages [0KB], 65536KB chunk
>     md0 : active raid6 sds[0] sdah[15] sdag[16] sdaf[13] sdae[12]
> sdad[11] sdac[10] sdab[9] sdaa[8] sdz[7] sdy[6] sdx[17] sdw[4] sdv[3]
> sdu[2] sdt[1]
>           109394518016 blocks super 1.2 level 6, 512k chunk, algorithm
> 2 [16/16] [UUUUUUUUUUUUUUUU]
>           [>....................] check = 0.2% (19510348/7813894144)
> finish=256805.0min speed=505K/sec
>           bitmap: 0/59 pages [0KB], 65536KB chunk
>
> A data point, I didn't mention in my previous mail, is that the
> mdX_resync thread is not gone when the problem occurs:
>
>     buczek@done:/scratch/local/linux (v5.10-rc6-mpi)$ ps -Af|fgrep [.md
>     root       134     2 0 Nov30 ?        00:00:00 [md]
>     root      1321     2 27 Nov30 ?        12:57:14 [md0_raid6]
>     root      1454     2 26 Nov30 ?        12:37:23 [md1_raid6]
>     root      5845     2 0 16:20 ?        00:00:30 [md0_resync]
>     root      5855     2 13 16:20 ?        00:14:11 [md1_resync]
>     buczek    9880 9072 0 18:05 pts/0    00:00:00 grep -F [.md
>     buczek@done:/scratch/local/linux (v5.10-rc6-mpi)$ sudo cat
> /proc/5845/stack
>     [<0>] md_bitmap_cond_end_sync+0x12d/0x170
>     [<0>] raid5_sync_request+0x24b/0x390
>     [<0>] md_do_sync+0xb41/0x1030
>     [<0>] md_thread+0x122/0x160
>     [<0>] kthread+0x118/0x130
>     [<0>] ret_from_fork+0x1f/0x30
>
> I guess, md_bitmap_cond_end_sync+0x12d is the
> `wait_event(bitmap->mddev->recovery_wait,atomic_read(&bitmap->mddev->recovery_active)
> == 0);` in md-bitmap.c.
>

Could be, if so, then I think md_done_sync was not triggered by the path
md0_raid6 -> ... -> handle_stripe.

I'd suggest to compare the stacks between md0 and md1 to find the
difference.

Thanks,
Guoqing

Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition [ In reply to ]

buczek at molgen

Dec 3, 2020, 4:15 AM

Post #6 of 49 (609 views)

Permalink

Dear Guoqing,

On 12/3/20 2:55 AM, Guoqing Jiang wrote:
> Hi Donald,
>
> On 12/2/20 18:28, Donald Buczek wrote:
>> Dear Guoqing,
>>
>> unfortunately the patch didn't fix the problem (unless I messed it up with my logging). This is what I used:
>>
>>      --- a/drivers/md/md.c
>>      +++ b/drivers/md/md.c
>>      @@ -9305,6 +9305,14 @@ void md_check_recovery(struct mddev *mddev)
>>                              clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
>>                              goto unlock;
>>                      }
>
> I think you can add the check of RECOVERY_CHECK in above part instead of add a new part.

I understand that. I did it this way to get a log only when the changed expression makes a difference (original block not triggering and using goto and and modified block triggering) and even if both of us make some stupid error with boolean expressions.

>
>>      +               if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) &&
>>      +                   (!test_bit(MD_RECOVERY_DONE, &mddev->recovery) ||
>>      +                    test_bit(MD_RECOVERY_CHECK, &mddev->recovery))) {
>>      +                       /* resync/recovery still happening */
>>      +                       pr_info("md: XXX BUGFIX applied\n");
>>      +                       clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
>>      +                       goto unlock;
>>      +               }
>>                      if (mddev->sync_thread) {
>>                              md_reap_sync_thread(mddev);
>>                              goto unlock;
>
>
>>
>> With pausing and continuing the check four times an hour, I could trigger the problem after about 48 hours. This time, the other device (md0) has locked up on `echo idle > /sys/devices/virtual/block/md0/md/sync_action` , while the check of md1 is still ongoing:
>
> Without the patch, md0 was good while md1 was locked. So the patch switches the status of the two arrays, a little weird ...

I think its just random. There is a slight chance for a deadlock depending on I/O to the device or other system load when stopping the resync. In the last experiment, both md0 and md1 where changed from "check" to "idle" for about 400 times before one of it blocked.

OTOH - and I don't understand this at all - it looks like md0 hung again after reboot to a 5.4 kernel at the very first "check" to "idle" transition which seems to contradict the "slight" chance and "once after 400 attempts".

> What is the stack of the process? I guess it is same as the stack of 23333 in your previous mail, but just to confirm.

Of md0_raid6 thread ? Sorry, didn't take the data before reboot this time, and concentrated on the non-terminating md0_resync thread.

>>      Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4] [multipath]
>>      md1 : active raid6 sdk[0] sdj[15] sdi[14] sdh[13] sdg[12] sdf[11] sde[10] sdd[9] sdc[8] sdr[7] sdq[6] sdp[5] sdo[4] sdn[3] sdm[2] sdl[1]
>>            109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUUU]
>>            [=>...................] check = 8.5% (666852112/7813894144) finish=1271.2min speed=93701K/sec
>>            bitmap: 0/59 pages [0KB], 65536KB chunk
>>      md0 : active raid6 sds[0] sdah[15] sdag[16] sdaf[13] sdae[12] sdad[11] sdac[10] sdab[9] sdaa[8] sdz[7] sdy[6] sdx[17] sdw[4] sdv[3] sdu[2] sdt[1]
>>            109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUUU]
>>            [>....................] check = 0.2% (19510348/7813894144) finish=253779.6min speed=511K/sec
>>            bitmap: 0/59 pages [0KB], 65536KB chunk
>>
>> after 1 minute:
>>
>>      Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4] [multipath]
>>      md1 : active raid6 sdk[0] sdj[15] sdi[14] sdh[13] sdg[12] sdf[11] sde[10] sdd[9] sdc[8] sdr[7] sdq[6] sdp[5] sdo[4] sdn[3] sdm[2] sdl[1]
>>            109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUUU]
>>            [=>...................] check = 8.6% (674914560/7813894144) finish=941.1min speed=126418K/sec
>>            bitmap: 0/59 pages [0KB], 65536KB chunk
>>      md0 : active raid6 sds[0] sdah[15] sdag[16] sdaf[13] sdae[12] sdad[11] sdac[10] sdab[9] sdaa[8] sdz[7] sdy[6] sdx[17] sdw[4] sdv[3] sdu[2] sdt[1]
>>            109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUUU]
>>            [>....................] check = 0.2% (19510348/7813894144) finish=256805.0min speed=505K/sec
>>            bitmap: 0/59 pages [0KB], 65536KB chunk
>>
>> A data point, I didn't mention in my previous mail, is that the mdX_resync thread is not gone when the problem occurs:
>>
>>      buczek@done:/scratch/local/linux (v5.10-rc6-mpi)$ ps -Af|fgrep [.md
>>      root       134     2 0 Nov30 ?        00:00:00 [md]
>>      root      1321     2 27 Nov30 ?        12:57:14 [md0_raid6]
>>      root      1454     2 26 Nov30 ?        12:37:23 [md1_raid6]
>>      root      5845     2 0 16:20 ?        00:00:30 [md0_resync]
>>      root      5855     2 13 16:20 ?        00:14:11 [md1_resync]
>>      buczek    9880 9072 0 18:05 pts/0    00:00:00 grep -F [.md
>>      buczek@done:/scratch/local/linux (v5.10-rc6-mpi)$ sudo cat /proc/5845/stack
>>      [<0>] md_bitmap_cond_end_sync+0x12d/0x170
>>      [<0>] raid5_sync_request+0x24b/0x390
>>      [<0>] md_do_sync+0xb41/0x1030
>>      [<0>] md_thread+0x122/0x160
>>      [<0>] kthread+0x118/0x130
>>      [<0>] ret_from_fork+0x1f/0x30
>>
>> I guess, md_bitmap_cond_end_sync+0x12d is the `wait_event(bitmap->mddev->recovery_wait,atomic_read(&bitmap->mddev->recovery_active) == 0);` in md-bitmap.c.
>>
>
> Could be, if so, then I think md_done_sync was not triggered by the path md0_raid6 -> ... -> handle_stripe.

Ok. Maybe I find a way to add some logging there.

> I'd suggest to compare the stacks between md0 and md1 to find the difference.

Compare stack of md0_raid6 and md1_raid6?

Okay, but when one md is deadlocked, the other is operating just normally. I/O can be done and you can even start and stop a resync on it. So I'd expect the stack of the operating raid to be varying.

Thanks.

Donald

>
> Thanks,
> Guoqing

--
Donald Buczek
buczek@molgen.mpg.de
Tel: +49 30 8413 1433

Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition [ In reply to ]

buczek at molgen

Dec 21, 2020, 6:15 AM

Post #7 of 49 (609 views)

Permalink

Dear Guoging,

I think now that this is not an issue for md. I've driven a system into that situation again and have clear indication, that this is a problem of the member block device driver.

With md0 in the described errornous state (md0_raid6 busy looping, echo idle > .../sync_action blocked, no progress in mdstat) and md1 operating normally:

root:deadbird:/scratch/local/# for f in /sys/devices/virtual/block/md?/md/rd*/block/inflight;do echo $f: $(cat $f);done
/sys/devices/virtual/block/md0/md/rd0/block/inflight: 1 0
/sys/devices/virtual/block/md0/md/rd1/block/inflight: 1 0
/sys/devices/virtual/block/md0/md/rd10/block/inflight: 1 0
/sys/devices/virtual/block/md0/md/rd11/block/inflight: 1 0
/sys/devices/virtual/block/md0/md/rd12/block/inflight: 1 0
/sys/devices/virtual/block/md0/md/rd13/block/inflight: 1 0
/sys/devices/virtual/block/md0/md/rd14/block/inflight: 1 0
/sys/devices/virtual/block/md0/md/rd15/block/inflight: 1 0
/sys/devices/virtual/block/md0/md/rd2/block/inflight: 1 0
/sys/devices/virtual/block/md0/md/rd3/block/inflight: 1 0
/sys/devices/virtual/block/md0/md/rd4/block/inflight: 1 0
/sys/devices/virtual/block/md0/md/rd5/block/inflight: 1 0
/sys/devices/virtual/block/md0/md/rd6/block/inflight: 1 0
/sys/devices/virtual/block/md0/md/rd7/block/inflight: 1 0
/sys/devices/virtual/block/md0/md/rd8/block/inflight: 1 0
/sys/devices/virtual/block/md0/md/rd9/block/inflight: 1 0
/sys/devices/virtual/block/md1/md/rd0/block/inflight: 0 0
/sys/devices/virtual/block/md1/md/rd1/block/inflight: 0 0
/sys/devices/virtual/block/md1/md/rd10/block/inflight: 0 0
/sys/devices/virtual/block/md1/md/rd11/block/inflight: 0 0
/sys/devices/virtual/block/md1/md/rd12/block/inflight: 0 0
/sys/devices/virtual/block/md1/md/rd13/block/inflight: 0 0
/sys/devices/virtual/block/md1/md/rd14/block/inflight: 0 0
/sys/devices/virtual/block/md1/md/rd15/block/inflight: 0 0
/sys/devices/virtual/block/md1/md/rd2/block/inflight: 0 0
/sys/devices/virtual/block/md1/md/rd3/block/inflight: 0 0
/sys/devices/virtual/block/md1/md/rd4/block/inflight: 0 0
/sys/devices/virtual/block/md1/md/rd5/block/inflight: 0 0
/sys/devices/virtual/block/md1/md/rd6/block/inflight: 0 0
/sys/devices/virtual/block/md1/md/rd7/block/inflight: 0 0
/sys/devices/virtual/block/md1/md/rd8/block/inflight: 0 0
/sys/devices/virtual/block/md1/md/rd9/block/inflight: 0 0

Best
Donald

Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition [ In reply to ]

buczek at molgen

Jan 19, 2021, 4:15 AM

Post #8 of 49 (609 views)

Permalink

Dear md-raid people,

I've reported a problem in this thread in December:

"We are using raid6 on several servers. Occasionally we had failures, where a mdX_raid6 process seems to go into a busy loop and all I/O to the md device blocks. We've seen this on various kernel versions." It was clear, that this is related to "mdcheck" running, because we found, that the command, which should stop the scrubbing in the morning (`echo idle > /sys/devices/virtual/block/md1/md/sync_action`) is also blocked.

On 12/21/20, I've reported, that the problem might be caused by a failure of the underlying block device driver, because I've found "inflight" counters of the block devices not being zero. However, this is not the case. We were able to run into the mdX_raid6 looping condition a few times again, but the non-zero inflight counters have not been observed again.

I was able to collect a lot of additional information from a blocked system.

- The `cat idle > /sys/devices/virtual/block/md1/md/sync_action` command is waiting at kthread_stop to stop the sync thread. [ https://elixir.bootlin.com/linux/latest/source/drivers/md/md.c#L7989 ]

- The sync thread ("md1_resync") does not finish, because its blocked at

[<0>] raid5_get_active_stripe+0x4c4/0x660 # [1]
[<0>] raid5_sync_request+0x364/0x390
[<0>] md_do_sync+0xb41/0x1030
[<0>] md_thread+0x122/0x160
[<0>] kthread+0x118/0x130
[<0>] ret_from_fork+0x22/0x30

[1] https://elixir.bootlin.com/linux/latest/source/drivers/md/raid5.c#L735

- yes, gdb confirms that `conf->cache_state` is 0x03 ( R5_INACTIVE_BLOCKED + R5_ALLOC_MORE )

- We have lots of active stripes:

root@deadbird:~/linux_problems/mdX_raid6_looping# cat /sys/block/md1/md/stripe_cache_active
27534

- handle_stripe() doesn't make progress:

echo "func handle_stripe =pflt" > /sys/kernel/debug/dynamic_debug/control

In dmesg, we see the debug output from https://elixir.bootlin.com/linux/latest/source/drivers/md/raid5.c#L4925 but never from https://elixir.bootlin.com/linux/latest/source/drivers/md/raid5.c#L4958:

[171908.896651] [1388] handle_stripe:4929: handling stripe 4947089856, state=0x2029 cnt=1, pd_idx=9, qd_idx=10
, check:4, reconstruct:0
[171908.896657] [1388] handle_stripe:4929: handling stripe 4947089872, state=0x2029 cnt=1, pd_idx=9, qd_idx=10
, check:4, reconstruct:0
[171908.896669] [1388] handle_stripe:4929: handling stripe 4947089912, state=0x2029 cnt=1, pd_idx=9, qd_idx=10
, check:4, reconstruct:0

The sector numbers repeat after some time. We have only the following variants of stripe state and "check":

state=0x2031 cnt=1, check:0 # ACTIVE +INSYNC+REPLACED+IO_STARTED, check_state_idle
state=0x2029 cnt=1, check:4 # ACTIVE+SYNCING +REPLACED+IO_STARTED, check_state_check_result
state=0x2009 cnt=1, check:0 # ACTIVE+SYNCING +IO_STARTED, check_state_idle

- We have MD_SB_CHANGE_PENDING set:

root@deadbird:~/linux_problems/mdX_raid6_looping# cat /sys/block/md1/md/array_state
write-pending

gdb confirms that sb_flags = 6 (MD_SB_CHANGE_CLEAN + MD_SB_CHANGE_PENDING)

So it can be assumed that handle_stripe breaks out at https://elixir.bootlin.com/linux/latest/source/drivers/md/raid5.c#L4939

- The system can manually be freed from the deadlock:

When `echo active > /sys/block/md1/md/array_state` is used, the scrubbing and other I/O continue. Probably because of https://elixir.bootlin.com/linux/latest/source/drivers/md/md.c#L4520

I, of coruse, don't fully understand it yet. Any ideas?

I append some data from a hanging raid... (mddev, r5conf and a sample stripe_head from handle_list with it first disks)

Donald

$L_002: (struct mddev *) 0xffff8898a48e3000 : {
private = (struct r5conf *) 0xffff88988695a800 /* --> $L_004 */
pers = (struct md_personality *) 0xffffffff826ee280 <raid6_personality>
unit = 9437185
md_minor = 1
disks = {next = 0xffff889886ee8c00, prev = 0xffff889891506000}
flags = 0
sb_flags = 6
suspended = 0
active_io = { counter=4 } /* atomic_t */
ro = 0
sysfs_active = 0
gendisk = (struct gendisk *) 0xffff8898bc357800
kobj = {
name = (const char *) 0xffffffff8220cef8 "md"
entry = {next = 0xffff8898a48e3058, prev = 0xffff8898a48e3058} /* empty */
parent = (struct kobject *) 0xffff8898bc357868
kset = 0x0
ktype = (struct kobj_type *) 0xffffffff826ef140 <md_ktype>
sd = (struct kernfs_node *) 0xffff8898b5f65980
kref = {
refcount = {
refs = { counter=17 } /* atomic_t */
}
}
state_initialized = 1
state_in_sysfs = 1
state_add_uevent_sent = 0
state_remove_uevent_sent = 0
uevent_suppress = 0
}
hold_active = 0
major_version = 1
minor_version = 2
patch_version = 0
persistent = 1
external = 0
metadata_type = [
(8)
[0] : 0 '\000'
[1] : 0 '\000'
[2] : 0 '\000'
[3] : 0 '\000'
[4] : 0 '\000'
[5] : 0 '\000'
[6] : 0 '\000'
[7] : 0 '\000'
[8] : 0 '\000'
[9] : 0 '\000'
[10] : 0 '\000'
[11] : 0 '\000'
[12] : 0 '\000'
[13] : 0 '\000'
[14] : 0 '\000'
[15] : 0 '\000'
[16] : 0 '\000'
]
chunk_sectors = 1024
ctime = 1607680540
utime = 1610853337
level = 6
layout = 2
clevel = [
(8)
[0] : 114 'r'
[1] : 97 'a'
[2] : 105 'i'
[3] : 100 'd'
[4] : 54 '6'
[5] : 0 '\000'
[6] : 0 '\000'
[7] : 0 '\000'
[8] : 0 '\000'
[9] : 0 '\000'
[10] : 0 '\000'
[11] : 0 '\000'
[12] : 0 '\000'
[13] : 0 '\000'
[14] : 0 '\000'
[15] : 0 '\000'
]
raid_disks = 16
max_disks = 1920
dev_sectors = 15627788288
array_sectors = 218789036032
external_size = 0
events = 32409
can_decrease_events = 0
uuid = [
(8)
[0] : -49 '\317'
[1] : 69 'E'
[2] : 119 'w'
[3] : 62 '>'
[4] : -113 '\217'
[5] : -107 '\225'
[6] : 9 '\t'
[7] : -90 '\246'
[8] : -2 '\376'
[9] : 84 'T'
[10] : -12 '\364'
[11] : -31 '\341'
[12] : -110 '\222'
[13] : 24 '\030'
[14] : -15 '\361'
[15] : 109 'm'
]
reshape_position = 18446744073709551615
delta_disks = 0
new_level = 6
new_layout = 2
new_chunk_sectors = 1024
reshape_backwards = 0
thread = (struct md_thread *) 0xffff889887bde080
sync_thread = 0x0
last_sync_action = (char *) 0xffffffff8225e246 "check"
curr_resync = 4947256648
curr_resync_completed = 4947016640
resync_mark = 4415850173
resync_mark_cnt = 66402936
curr_mark_cnt = 75262560
resync_max_sectors = 15627788288
resync_mismatches = {
counter = 0
}
suspend_lo = 0
suspend_hi = 0
sync_speed_min = 0
sync_speed_max = 0
parallel_resync = 0
ok_start_degraded = 0
recovery = 203
recovery_disabled = 16
in_sync = 0
open_mutex = {
owner = { counter=0 } /* atomic_t */
wait_lock = {counter=0} /* spinlock_t */
osq = {
tail = { counter=0 } /* atomic_t */
}
wait_list = {next = 0xffff8898a48e31d8, prev = 0xffff8898a48e31d8} /* empty */
}
reconfig_mutex = {
owner = { counter=-131247629823744 } /* atomic_t */
wait_lock = {counter=0} /* spinlock_t */
osq = {
tail = { counter=0 } /* atomic_t */
}
wait_list = {next = 0xffff8898a48e31f8, prev = 0xffff8898a48e31f8} /* empty */
}
active = { counter=2 } /* atomic_t */
openers = { counter=1 } /* atomic_t */
changed = 0
degraded = 0
recovery_active = { counter=181360 } /* atomic_t */
recovery_wait = {
lock = {counter=0} /* spinlock_t */
head = {next = 0xffff8898a48e3228, prev = 0xffff8898a48e3228} /* empty */
}
recovery_cp = 18446744073709551615
resync_min = 4871994088
resync_max = 18446744073709551615
sysfs_state = (struct kernfs_node *) 0xffff8898b5f65380
sysfs_action = (struct kernfs_node *) 0xffff889887d87600
sysfs_completed = (struct kernfs_node *) 0xffff8898ac6f4e80
sysfs_degraded = (struct kernfs_node *) 0xffff8898ac6f4b80
sysfs_level = (struct kernfs_node *) 0xffff8898b5f65700
del_work = {
data = { counter=128 } /* atomic_t */
entry = {next = 0xffff8898a48e3280, prev = 0xffff8898a48e3280} /* empty */
func = (work_func_t) 0xffffffff819ad430 <md_start_sync>
}
lock = {counter=0} /* spinlock_t */
sb_wait = {
lock = {counter=0} /* spinlock_t */
head = {next = 0xffffc9000e4a7a68, prev = 0xffffc9000cf97d30}
}
pending_writes = { counter=0 } /* atomic_t */
safemode = 0
safemode_delay = 201
safemode_timer = {
entry = {
next = (struct hlist_node *) 0xdead000000000122
pprev = 0x0
}
expires = 4415857239
function = (void (*)(struct timer_list *)) 0xffffffff819a6a40 <md_safemode_timeout>
flags = 314572820
}
writes_pending = {
percpu_count_ptr = 105896157983096
data = (struct percpu_ref_data *) 0xffff8898b340fc80
}
sync_checkers = 0
queue = (struct request_queue *) 0xffff8898b65847c8
bitmap = (struct bitmap *) 0xffff889887d84400
bitmap_info = {
file = 0x0
offset = 8
space = 0
default_offset = 2
default_space = 6
mutex = {
owner = { counter=0 } /* atomic_t */
wait_lock = {counter=0} /* spinlock_t */
osq = {
tail = { counter=0 } /* atomic_t */
}
wait_list = {next = 0xffff8898a48e3350, prev = 0xffff8898a48e3350} /* empty */
}
chunksize = 67108864
daemon_sleep = 5000
max_write_behind = 0
external = 0
nodes = 0
cluster_name = [
(8)
[0] : 0 '\000'
[1] : 0 '\000'
[2] : 0 '\000'
[3] : 0 '\000'
[4] : 0 '\000'
[5] : 0 '\000'
[6] : 0 '\000'
[7] : 0 '\000'
[8] : 0 '\000'
[9] : 0 '\000'
[10] : 0 '\000'
[11] : 0 '\000'
[12] : 0 '\000'
[13] : 0 '\000'
[14] : 0 '\000'
[15] : 0 '\000'
[16] : 0 '\000'
[17] : 0 '\000'
[18] : 0 '\000'
[19] : 0 '\000'
[20] : 0 '\000'
[21] : 0 '\000'
[22] : 0 '\000'
[23] : 0 '\000'
[24] : 0 '\000'
[25] : 0 '\000'
[26] : 0 '\000'
[27] : 0 '\000'
[28] : 0 '\000'
[29] : 0 '\000'
[30] : 0 '\000'
[31] : 0 '\000'
[32] : 0 '\000'
[33] : 0 '\000'
[34] : 0 '\000'
[35] : 0 '\000'
[36] : 0 '\000'
[37] : 0 '\000'
[38] : 0 '\000'
[39] : 0 '\000'
[40] : 0 '\000'
[41] : 0 '\000'
[42] : 0 '\000'
[43] : 0 '\000'
[44] : 0 '\000'
[45] : 0 '\000'
[46] : 0 '\000'
[47] : 0 '\000'
[48] : 0 '\000'
[49] : 0 '\000'
[50] : 0 '\000'
[51] : 0 '\000'
[52] : 0 '\000'
[53] : 0 '\000'
[54] : 0 '\000'
[55] : 0 '\000'
[56] : 0 '\000'
[57] : 0 '\000'
[58] : 0 '\000'
[59] : 0 '\000'
[60] : 0 '\000'
[61] : 0 '\000'
[62] : 0 '\000'
[63] : 0 '\000'
]
}
max_corr_read_errors = { counter=20 } /* atomic_t */
all_mddevs = {next = 0xffffffff826eef10, prev = 0xffff88810a551bc8}
to_remove = 0x0
bio_set = {
bio_slab = (struct kmem_cache *) 0xffff889886fbdc00
front_pad = 0
bio_pool = {
lock = {counter=0} /* spinlock_t */
min_nr = 2
curr_nr = 2
elements = (void **) 0xffff8898b4fae300
pool_data = (void *) 0xffff889886fbdc00
alloc = (mempool_alloc_t *) 0xffffffff811c8540 <mempool_alloc_slab>
free = (mempool_free_t *) 0xffffffff811c8560 <mempool_free_slab>
wait = {
lock = {counter=0} /* spinlock_t */
head = {next = 0xffff8898a48e3428, prev = 0xffff8898a48e3428} /* empty */
}
}
bvec_pool = {
lock = {counter=0} /* spinlock_t */
min_nr = 2
curr_nr = 2
elements = (void **) 0xffff8898b4fae2e0
pool_data = (void *) 0xffff8881078ecb40
alloc = (mempool_alloc_t *) 0xffffffff811c8540 <mempool_alloc_slab>
free = (mempool_free_t *) 0xffffffff811c8560 <mempool_free_slab>
wait = {
lock = {counter=0} /* spinlock_t */
head = {next = 0xffff8898a48e3470, prev = 0xffff8898a48e3470} /* empty */
}
}
bio_integrity_pool = {
lock = {counter=0} /* spinlock_t */
min_nr = 0
curr_nr = 0
elements = 0x0
pool_data = 0x0
alloc = 0x0
free = 0x0
wait = {
lock = {counter=0} /* spinlock_t */
head = {next = 0x0000000000000000, prev = 0x0000000000000000}
}
}
bvec_integrity_pool = {
lock = {counter=0} /* spinlock_t */
min_nr = 0
curr_nr = 0
elements = 0x0
pool_data = 0x0
alloc = 0x0
free = 0x0
wait = {
lock = {counter=0} /* spinlock_t */
head = {next = 0x0000000000000000, prev = 0x0000000000000000}
}
}
rescue_lock = {counter=0} /* spinlock_t */
rescue_list = {
head = 0x0
tail = 0x0
}
rescue_work = {
data = { counter=68719476704 } /* atomic_t */
entry = {next = 0xffff8898a48e3530, prev = 0xffff8898a48e3530} /* empty */
func = (work_func_t) 0xffffffff81430230 <bio_alloc_rescue>
}
rescue_workqueue = 0x0
}
sync_set = {
bio_slab = (struct kmem_cache *) 0xffff889886fbdc00
front_pad = 0
bio_pool = {
lock = {counter=0} /* spinlock_t */
min_nr = 2
curr_nr = 2
elements = (void **) 0xffff8898b4fae2c0
pool_data = (void *) 0xffff889886fbdc00
alloc = (mempool_alloc_t *) 0xffffffff811c8540 <mempool_alloc_slab>
free = (mempool_free_t *) 0xffffffff811c8560 <mempool_free_slab>
wait = {
lock = {counter=0} /* spinlock_t */
head = {next = 0xffff8898a48e3598, prev = 0xffff8898a48e3598} /* empty */
}
}
bvec_pool = {
lock = {counter=0} /* spinlock_t */
min_nr = 2
curr_nr = 2
elements = (void **) 0xffff8898b4fae2a0
pool_data = (void *) 0xffff8881078ecb40
alloc = (mempool_alloc_t *) 0xffffffff811c8540 <mempool_alloc_slab>
free = (mempool_free_t *) 0xffffffff811c8560 <mempool_free_slab>
wait = {
lock = {counter=0} /* spinlock_t */
head = {next = 0xffff8898a48e35e0, prev = 0xffff8898a48e35e0} /* empty */
}
}
bio_integrity_pool = {
lock = {counter=0} /* spinlock_t */
min_nr = 0
curr_nr = 0
elements = 0x0
pool_data = 0x0
alloc = 0x0
free = 0x0
wait = {
lock = {counter=0} /* spinlock_t */
head = {next = 0x0000000000000000, prev = 0x0000000000000000}
}
}
bvec_integrity_pool = {
lock = {counter=0} /* spinlock_t */
min_nr = 0
curr_nr = 0
elements = 0x0
pool_data = 0x0
alloc = 0x0
free = 0x0
wait = {
lock = {counter=0} /* spinlock_t */
head = {next = 0x0000000000000000, prev = 0x0000000000000000}
}
}
rescue_lock = {counter=0} /* spinlock_t */
rescue_list = {
head = 0x0
tail = 0x0
}
rescue_work = {
data = { counter=68719476704 } /* atomic_t */
entry = {next = 0xffff8898a48e36a0, prev = 0xffff8898a48e36a0} /* empty */
func = (work_func_t) 0xffffffff81430230 <bio_alloc_rescue>
}
rescue_workqueue = 0x0
}
md_io_pool = {
lock = {counter=0} /* spinlock_t */
min_nr = 2
curr_nr = 2
elements = (void **) 0xffff8898b236af60
pool_data = (void *) 0x28 <fixed_percpu_data+40>
alloc = (mempool_alloc_t *) 0xffffffff811c8b50 <mempool_kmalloc>
free = (mempool_free_t *) 0xffffffff811c84b0 <mempool_kfree>
wait = {
lock = {counter=0} /* spinlock_t */
head = {next = 0xffff8898a48e36f8, prev = 0xffff8898a48e36f8} /* empty */
}
}
flush_bio = 0x0
flush_pending = { counter=0 } /* atomic_t */
start_flush = 121249416090188
last_flush = 121249416090188
flush_work = {
data = { counter=2176 } /* atomic_t */
entry = {next = 0xffff8898a48e3730, prev = 0xffff8898a48e3730} /* empty */
func = (work_func_t) 0xffffffff819aca80 <md_submit_flush_data>
}
event_work = {
data = { counter=0 } /* atomic_t */
entry = {next = 0x0000000000000000, prev = 0x0000000000000000}
func = 0x0
}
serial_info_pool = 0x0
sync_super = 0x0
cluster_info = 0x0
good_device_nr = 0
noio_flag = 0
has_superblocks = true_21_
fail_last_dev = false_21_
serialize_policy = false_21_
}

$L_004: (struct r5conf *) 0xffff88988695a800 : {
stripe_hashtbl = (struct hlist_head *) 0xffff8898a1cb9000 /* --> $L_006 */
hash_locks = [
(23)
[0] : {counter=0} /* spinlock_t */
[1] : {counter=0} /* spinlock_t */
[2] : {counter=0} /* spinlock_t */
[3] : {counter=0} /* spinlock_t */
[4] : {counter=0} /* spinlock_t */
[5] : {counter=0} /* spinlock_t */
[6] : {counter=0} /* spinlock_t */
[7] : {counter=0} /* spinlock_t */
]
mddev = (struct mddev *) 0xffff8898a48e3000
chunk_sectors = 1024
level = 6
algorithm = 2
rmw_level = 1
max_degraded = 2
raid_disks = 16
max_nr_stripes = 29816
min_nr_stripes = 256
reshape_progress = 18446744073709551615
reshape_safe = 0
previous_raid_disks = 16
prev_chunk_sectors = 1024
prev_algo = 2
generation = 0
gen_lock = {
seqcount = {
sequence = 0
}
}
reshape_checkpoint = 0
min_offset_diff = 0
handle_list = {next = 0xffff888a7893c890, prev = 0xffff8889de3cc7d0}
loprio_list = {next = 0xffff88988695a898, prev = 0xffff88988695a898} /* empty */
hold_list = {next = 0xffff88988695a8a8, prev = 0xffff88988695a8a8} /* empty */
delayed_list = {next = 0xffff88988695a8b8, prev = 0xffff88988695a8b8} /* empty */
bitmap_list = {next = 0xffff88988695a8c8, prev = 0xffff88988695a8c8} /* empty */
retry_read_aligned = 0x0
retry_read_offset = 0
retry_read_aligned_list = 0x0
preread_active_stripes = { counter=0 } /* atomic_t */
active_aligned_reads = { counter=0 } /* atomic_t */
pending_full_writes = { counter=0 } /* atomic_t */
bypass_count = 0
bypass_threshold = 1
skip_copy = 0
last_hold = (struct list_head *) 0xffff88a299ea0010
reshape_stripes = { counter=0 } /* atomic_t */
active_name = 0
cache_name = [
(2)
[0] : [
(8)
[0] : 114 'r'
[1] : 97 'a'
[2] : 105 'i'
[3] : 100 'd'
[4] : 54 '6'
[5] : 45 '-'
[6] : 109 'm'
[7] : 100 'd'
[8] : 49 '1'
[9] : 0 '\000'
[10] : 0 '\000'
[11] : 0 '\000'
[12] : 0 '\000'
[13] : 0 '\000'
[14] : 0 '\000'
[15] : 0 '\000'
[16] : 0 '\000'
[17] : 0 '\000'
[18] : 0 '\000'
[19] : 0 '\000'
[20] : 0 '\000'
[21] : 0 '\000'
[22] : 0 '\000'
[23] : 0 '\000'
[24] : 0 '\000'
[25] : 0 '\000'
[26] : 0 '\000'
[27] : 0 '\000'
[28] : 0 '\000'
[29] : 0 '\000'
[30] : 0 '\000'
[31] : 0 '\000'
]
[1] : [
(8)
[0] : 114 'r'
[1] : 97 'a'
[2] : 105 'i'
[3] : 100 'd'
[4] : 54 '6'
[5] : 45 '-'
[6] : 109 'm'
[7] : 100 'd'
[8] : 49 '1'
[9] : 45 '-'
[10] : 97 'a'
[11] : 108 'l'
[12] : 116 't'
[13] : 0 '\000'
[14] : 0 '\000'
[15] : 0 '\000'
[16] : 0 '\000'
[17] : 0 '\000'
[18] : 0 '\000'
[19] : 0 '\000'
[20] : 0 '\000'
[21] : 0 '\000'
[22] : 0 '\000'
[23] : 0 '\000'
[24] : 0 '\000'
[25] : 0 '\000'
[26] : 0 '\000'
[27] : 0 '\000'
[28] : 0 '\000'
[29] : 0 '\000'
[30] : 0 '\000'
[31] : 0 '\000'
]
]
slab_cache = (struct kmem_cache *) 0xffff8881078eca80
cache_size_mutex = {
owner = { counter=0 } /* atomic_t */
wait_lock = {counter=0} /* spinlock_t */
osq = {
tail = { counter=0 } /* atomic_t */
}
wait_list = {next = 0xffff88988695a970, prev = 0xffff88988695a970} /* empty */
}
seq_flush = 5031591
seq_write = 5031591
quiesce = 0
fullsync = 0
recovery_disabled = 15
percpu = (struct raid5_percpu *) 0x604fdee147e8
scribble_disks = 16
scribble_sectors = 1024
node = {
next = 0x0
pprev = (struct hlist_node **) 0xffff88810a5b89a8
}
active_stripes = { counter=27534 } /* atomic_t */
inactive_list = [
(3)
[0] : {next = 0xffff88990fd76710, prev = 0xffff8889b7f72550}
[1] : {next = 0xffff88959af002d0, prev = 0xffff889958cd4590}
[2] : {next = 0xffff88a60a070150, prev = 0xffff88a37d3d0850}
[3] : {next = 0xffff889e211e0690, prev = 0xffff8893a4000110}
[4] : {next = 0xffff888c2139a350, prev = 0xffff88901a188410}
[5] : {next = 0xffff88a83413e3d0, prev = 0xffff88a2090bc350}
[6] : {next = 0xffff88a720d54310, prev = 0xffff888a0da6c5d0}
[7] : {next = 0xffff888a0412c650, prev = 0xffff889250596550}
]
r5c_cached_full_stripes = { counter=0 } /* atomic_t */
r5c_full_stripe_list = {next = 0xffff88988695aa48, prev = 0xffff88988695aa48} /* empty */
r5c_cached_partial_stripes = { counter=0 } /* atomic_t */
r5c_partial_stripe_list = {next = 0xffff88988695aa60, prev = 0xffff88988695aa60} /* empty */
r5c_flushing_full_stripes = { counter=0 } /* atomic_t */
r5c_flushing_partial_stripes = { counter=0 } /* atomic_t */
empty_inactive_list_nr = { counter=0 } /* atomic_t */
released_stripes = {
first = 0x0
}
wait_for_quiescent = {
lock = {counter=0} /* spinlock_t */
head = {next = 0xffff88988695aa90, prev = 0xffff88988695aa90} /* empty */
}
wait_for_stripe = {
lock = {counter=0} /* spinlock_t */
head = {next = 0xffffc9000e3f7c78, prev = 0xffffc9000e3f7c78}
}
wait_for_overlap = {
lock = {counter=0} /* spinlock_t */
head = {next = 0xffff88988695aac0, prev = 0xffff88988695aac0} /* empty */
}
cache_state = 3
shrinker = {
count_objects = (long unsigned int (*)(struct shrinker *, struct shrink_control *)) 0xffffffff8198bc60 <raid5_cache_count>
scan_objects = (long unsigned int (*)(struct shrinker *, struct shrink_control *)) 0xffffffff8198de90 <raid5_cache_scan>
batch = 128
seeks = 128
flags = 0
list = {next = 0xffff8898867e8c30, prev = 0xffffffffa0e14520}
id = 0
nr_deferred = (atomic_long_t *) 0xffff88988d8955c0
}
pool_size = 16
device_lock = {counter=0} /* spinlock_t */
disks = (struct disk_info *) 0xffff8898b1df3800
bio_split = {
bio_slab = (struct kmem_cache *) 0xffff889886fbdc00
front_pad = 0
bio_pool = {
lock = {counter=0} /* spinlock_t */
min_nr = 2
curr_nr = 2
elements = (void **) 0xffff8898b4fae280
pool_data = (void *) 0xffff889886fbdc00
alloc = (mempool_alloc_t *) 0xffffffff811c8540 <mempool_alloc_slab>
free = (mempool_free_t *) 0xffffffff811c8560 <mempool_free_slab>
wait = {
lock = {counter=0} /* spinlock_t */
head = {next = 0xffff88988695ab70, prev = 0xffff88988695ab70} /* empty */
}
}
bvec_pool = {
lock = {counter=0} /* spinlock_t */
min_nr = 0
curr_nr = 0
elements = 0x0
pool_data = 0x0
alloc = 0x0
free = 0x0
wait = {
lock = {counter=0} /* spinlock_t */
head = {next = 0x0000000000000000, prev = 0x0000000000000000}
}
}
bio_integrity_pool = {
lock = {counter=0} /* spinlock_t */
min_nr = 0
curr_nr = 0
elements = 0x0
pool_data = 0x0
alloc = 0x0
free = 0x0
wait = {
lock = {counter=0} /* spinlock_t */
head = {next = 0x0000000000000000, prev = 0x0000000000000000}
}
}
bvec_integrity_pool = {
lock = {counter=0} /* spinlock_t */
min_nr = 0
curr_nr = 0
elements = 0x0
pool_data = 0x0
alloc = 0x0
free = 0x0
wait = {
lock = {counter=0} /* spinlock_t */
head = {next = 0x0000000000000000, prev = 0x0000000000000000}
}
}
rescue_lock = {counter=0} /* spinlock_t */
rescue_list = {
head = 0x0
tail = 0x0
}
rescue_work = {
data = { counter=68719476704 } /* atomic_t */
entry = {next = 0xffff88988695ac78, prev = 0xffff88988695ac78} /* empty */
func = (work_func_t) 0xffffffff81430230 <bio_alloc_rescue>
}
rescue_workqueue = 0x0
}
thread = 0x0
temp_inactive_list = [
(3)
[0] : {next = 0xffff88988695aca0, prev = 0xffff88988695aca0} /* empty */
[1] : {next = 0xffff88988695acb0, prev = 0xffff88988695acb0} /* empty */
[2] : {next = 0xffff88988695acc0, prev = 0xffff88988695acc0} /* empty */
[3] : {next = 0xffff88988695acd0, prev = 0xffff88988695acd0} /* empty */
[4] : {next = 0xffff88988695ace0, prev = 0xffff88988695ace0} /* empty */
[5] : {next = 0xffff88988695acf0, prev = 0xffff88988695acf0} /* empty */
[6] : {next = 0xffff88988695ad00, prev = 0xffff88988695ad00} /* empty */
[7] : {next = 0xffff88988695ad10, prev = 0xffff88988695ad10} /* empty */
]
worker_groups = 0x0
group_cnt = 0
worker_cnt_per_group = 0
log = 0x0
log_private = 0x0
pending_bios_lock = {counter=0} /* spinlock_t */
batch_bio_dispatch = true_21_
pending_data = (struct r5pending_data *) 0xffff88988fae0000
free_list = {next = 0xffff88988fae4fd8, prev = 0xffff88988fae0000}
pending_list = {next = 0xffff88988695ad60, prev = 0xffff88988695ad60} /* empty */
pending_data_cnt = 0
next_pending_data = 0x0
}

set $sh = container_of ((void *)0xffff88a57f1e6550, struct stripe_head, lru)

(gdb) dump *$sh

$L_010: (struct stripe_head *) 0xffff88a57f1e6540 : {
hash = {
next = (struct hlist_node *) 0xffff888768f7c680
pprev = (struct hlist_node **) 0xffff8892f0520080
}
lru = {next = 0xffff88945e034510, prev = 0xffff8891d8f04210}
release_list = {
next = (struct llist_node *) 0xffff88945e034520
}
raid_conf = (struct r5conf *) 0xffff88988695a800
generation = 0
sector = 4947068032
pd_idx = 14
qd_idx = 15
ddf_layout = 0
hash_lock_index = 0
state = 8242 # 0x2032 HANDLE + INSYNC + REPLACED + STARTED
count = { counter=0 } /* atomic_t */
bm_seq = 5030861
disks = 16
overwrite_disks = 0
check_state = check_state_idle_5_
reconstruct_state = reconstruct_state_idle_5_
stripe_lock = {counter=0} /* spinlock_t */
cpu = 17
group = 0x0
batch_head = 0x0 # no batch
batch_lock = {counter=0} /* spinlock_t */
batch_list = {next = 0xffff88a57f1e65c8, prev = 0xffff88a57f1e65c8} /* empty */
/* anon union*/ {
log_io = 0x0
ppl_io = 0x0
}
log_list = {next = 0xffff88a57f1e65e0, prev = 0xffff88a57f1e65e0} /* empty */
log_start = 18446744073709551615
r5c = {next = 0xffff88a57f1e65f8, prev = 0xffff88a57f1e65f8} /* empty */
ppl_page = 0x0
ops = {
target = 0
target2 = 0
zero_sum_result = 0_5_
}

(gdb) dump $sh->dev[0]

$L_011: (struct r5dev *) 0xffff88a57f1e6620 : {
req = {
bi_next = 0x0
bi_disk = 0x0
bi_opf = 0
bi_flags = 0
bi_ioprio = 0
bi_write_hint = 0
bi_status = 0 '\000'
bi_partno = 0 '\000'
__bi_remaining = { counter=1 } /* atomic_t */
bi_iter = {
bi_sector = 0
bi_size = 0
bi_idx = 0
bi_bvec_done = 0
}
bi_end_io = 0x0
bi_private = 0x0
bi_blkg = 0x0
bi_issue = {
value = 0
}
/* anon union*/ {
bi_integrity = 0x0
}
bi_vcnt = 0
bi_max_vecs = 1
__bi_cnt = { counter=1 } /* atomic_t */
bi_io_vec = (struct bio_vec *) 0xffff88a57f1e6710
bi_pool = 0x0
bi_inline_vecs = [
]
}
rreq = {
bi_next = 0x0
bi_disk = 0x0
bi_opf = 0
bi_flags = 0
bi_ioprio = 0
bi_write_hint = 0
bi_status = 0 '\000'
bi_partno = 0 '\000'
__bi_remaining = { counter=1 } /* atomic_t */
bi_iter = {
bi_sector = 0
bi_size = 0
bi_idx = 0
bi_bvec_done = 0
}
bi_end_io = 0x0
bi_private = 0x0
bi_blkg = 0x0
bi_issue = {
value = 0
}
/* anon union*/ {
bi_integrity = 0x0
}
bi_vcnt = 0
bi_max_vecs = 1
__bi_cnt = { counter=1 } /* atomic_t */
bi_io_vec = (struct bio_vec *) 0xffff88a57f1e6720
bi_pool = 0x0
bi_inline_vecs = [
]
}
vec = {
bv_page = (struct page *) 0xffffea008ffccc80
bv_len = 4096
bv_offset = 0
}
rvec = {
bv_page = 0x0
bv_len = 0
bv_offset = 0
}
page = (struct page *) 0xffffea008ffccc80
orig_page = (struct page *) 0xffffea008ffccc80
offset = 0
toread = 0x0
read = 0x0
towrite = 0x0
written = 0x0
sector = 69258950784
flags = 17 # 0x11 : R5_UPTODATE + R5_INSYNC
log_checksum = 0
write_hint = 0
}

$sh->dev[1] .. $sh->dev[15] omitted. All the same, just different page buffers and sector numbers.

$sh->dev[14] and $sh->dev[15] have sector = 0.

Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition [ In reply to ]

guoqing.jiang at cloud

Jan 20, 2021, 10:15 AM

Post #9 of 49 (609 views)

Permalink

Hi Donald,

On 1/19/21 12:30, Donald Buczek wrote:
> Dear md-raid people,
>
> I've reported a problem in this thread in December:
>
> "We are using raid6 on several servers. Occasionally we had failures,
> where a mdX_raid6 process seems to go into a busy loop and all I/O to
> the md device blocks. We've seen this on various kernel versions." It
> was clear, that this is related to "mdcheck" running, because we found,
> that the command, which should stop the scrubbing in the morning (`echo
> idle > /sys/devices/virtual/block/md1/md/sync_action`) is also blocked.
>
> On 12/21/20, I've reported, that the problem might be caused by a
> failure of the underlying block device driver, because I've found
> "inflight" counters of the block devices not being zero. However, this
> is not the case. We were able to run into the mdX_raid6 looping
> condition a few times again, but the non-zero inflight counters have not
> been observed again.
>
> I was able to collect a lot of additional information from a blocked
> system.
>
> - The `cat idle > /sys/devices/virtual/block/md1/md/sync_action` command
> is waiting at kthread_stop to stop the sync thread. [
> https://elixir.bootlin.com/linux/latest/source/drivers/md/md.c#L7989 ]
>
> - The sync thread ("md1_resync") does not finish, because its blocked at
>
> [<0>] raid5_get_active_stripe+0x4c4/0x660     # [1]
> [<0>] raid5_sync_request+0x364/0x390
> [<0>] md_do_sync+0xb41/0x1030
> [<0>] md_thread+0x122/0x160
> [<0>] kthread+0x118/0x130
> [<0>] ret_from_fork+0x22/0x30
>
> [1] https://elixir.bootlin.com/linux/latest/source/drivers/md/raid5.c#L735
>
> - yes, gdb confirms that `conf->cache_state` is 0x03 (
> R5_INACTIVE_BLOCKED + R5_ALLOC_MORE )

The resync thread is blocked since it can't get sh from inactive list,
so R5_ALLOC_MORE and R5_INACTIVE_BLOCKED are set, that is why `echo idle
> /sys/devices/virtual/block/md1/md/sync_action` can't stop sync thread.

>
> - We have lots of active stripes:
>
> root@deadbird:~/linux_problems/mdX_raid6_looping# cat
> /sys/block/md1/md/stripe_cache_active
> 27534

There are too many active stripes, so this is false:

atomic_read(&conf->active_stripes) < (conf->max_nr_stripes * 3 / 4)

so raid5_get_active_stripe has to wait till it becomes true, either
increase max_nr_stripes or decrease active_stripes.

1. Increase max_nr_stripes
since "mdX_raid6 process seems to go into a busy loop" and R5_ALLOC_MORE
is set, if there is enough memory, I suppose mdX_raid6 (raid5d) could
alloc new stripe in grow_one_stripe and increase max_nr_stripes. So
please check the memory usage of your system.

Another thing is you can try to increase the number of sh manually by
write new number to stripe_cache_size if there is enough memory.

2. Or decrease active_stripes
This is suppose to be done by do_release_stripe, but STRIPE_HANDLE is set

> - handle_stripe() doesn't make progress:
>
> echo "func handle_stripe =pflt" > /sys/kernel/debug/dynamic_debug/control
>
> In dmesg, we see the debug output from
> https://elixir.bootlin.com/linux/latest/source/drivers/md/raid5.c#L4925
> but never from
> https://elixir.bootlin.com/linux/latest/source/drivers/md/raid5.c#L4958:
>
> [171908.896651] [1388] handle_stripe:4929: handling stripe 4947089856,
> state=0x2029 cnt=1, pd_idx=9, qd_idx=10
>                 , check:4, reconstruct:0
> [171908.896657] [1388] handle_stripe:4929: handling stripe 4947089872,
> state=0x2029 cnt=1, pd_idx=9, qd_idx=10
>                 , check:4, reconstruct:0
> [171908.896669] [1388] handle_stripe:4929: handling stripe 4947089912,
> state=0x2029 cnt=1, pd_idx=9, qd_idx=10
>                 , check:4, reconstruct:0
>
> The sector numbers repeat after some time. We have only the following
> variants of stripe state and "check":
>
> state=0x2031 cnt=1, check:0 # ACTIVE        +INSYNC+REPLACED+IO_STARTED,
> check_state_idle
> state=0x2029 cnt=1, check:4 # ACTIVE+SYNCING       +REPLACED+IO_STARTED,
> check_state_check_result
> state=0x2009 cnt=1, check:0 # ACTIVE+SYNCING                +IO_STARTED,
> check_state_idle
>
> - We have MD_SB_CHANGE_PENDING set:

because MD_SB_CHANGE_PENDING flag. So do_release_stripe can't call
atomic_dec(&conf->active_stripes).

Normally, SB_CHANGE_PENDING is cleared from md_update_sb which could be
called by md_reap_sync_thread. But md_reap_sync_thread stuck with
unregister sync_thread (it was blocked in raid5_get_active_stripe).

Still I don't understand well why mdX_raid6 in a busy loop, maybe raid5d
can't break from the while(1) loop because "if (!batch_size &&
!released)" is false. I would assume released is '0' since
> released_stripes = {
> first = 0x0
> }
And __get_priority_stripe fetches sh from conf->handle_list and delete
it from handle_list, handle_stripe marks the state of the sh with
STRIPE_HANDLE, then do_release_stripe put the sh back to handle_list.
So batch_size can't be '0', and handle_active_stripes in the loop
repeats the process in the busy loop. This is my best guess to explain
the busy loop situation.

>
> root@deadbird:~/linux_problems/mdX_raid6_looping# cat
> /sys/block/md1/md/array_state
> write-pending
>
> gdb confirms that sb_flags = 6 (MD_SB_CHANGE_CLEAN + MD_SB_CHANGE_PENDING)

since rdev_set_badblocks could set them, could you check if there is
badblock of underlying device (sd*)?

>
> So it can be assumed that handle_stripe breaks out at
> https://elixir.bootlin.com/linux/latest/source/drivers/md/raid5.c#L4939
>
> - The system can manually be freed from the deadlock:
>
> When `echo active > /sys/block/md1/md/array_state` is used, the
> scrubbing and other I/O continue. Probably because of
> https://elixir.bootlin.com/linux/latest/source/drivers/md/md.c#L4520

Hmm, seems clear SB_CHANGE_PENDING made the trick, so the blocked
process can make progress.

>
> I, of coruse, don't fully understand it yet. Any ideas?
>
> I append some data from a hanging raid... (mddev, r5conf and a sample
> stripe_head from handle_list with it first disks)

These data did help for investigation!

Thanks,
Guoqing

Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition [ In reply to ]

buczek at molgen

Jan 23, 2021, 6:15 AM

Post #10 of 49 (609 views)

Permalink

Dear Guoqing,

On 20.01.21 17:33, Guoqing Jiang wrote:
> Hi Donald,
>
> On 1/19/21 12:30, Donald Buczek wrote:
>> Dear md-raid people,
>>
>> I've reported a problem in this thread in December:
>>
>> "We are using raid6 on several servers. Occasionally we had failures, where a mdX_raid6 process seems to go into a busy loop and all I/O to the md device blocks. We've seen this on various kernel versions." It was clear, that this is related to "mdcheck" running, because we found, that the command, which should stop the scrubbing in the morning (`echo idle > /sys/devices/virtual/block/md1/md/sync_action`) is also blocked.
>>
>> On 12/21/20, I've reported, that the problem might be caused by a failure of the underlying block device driver, because I've found "inflight" counters of the block devices not being zero. However, this is not the case. We were able to run into the mdX_raid6 looping condition a few times again, but the non-zero inflight counters have not been observed again.
>>
>> I was able to collect a lot of additional information from a blocked system.
>>
>> - The `cat idle > /sys/devices/virtual/block/md1/md/sync_action` command is waiting at kthread_stop to stop the sync thread. [ https://elixir.bootlin.com/linux/latest/source/drivers/md/md.c#L7989 ]
>>
>> - The sync thread ("md1_resync") does not finish, because its blocked at
>>
>> [<0>] raid5_get_active_stripe+0x4c4/0x660     # [1]
>> [<0>] raid5_sync_request+0x364/0x390
>> [<0>] md_do_sync+0xb41/0x1030
>> [<0>] md_thread+0x122/0x160
>> [<0>] kthread+0x118/0x130
>> [<0>] ret_from_fork+0x22/0x30
>>
>> [1] https://elixir.bootlin.com/linux/latest/source/drivers/md/raid5.c#L735
>>
>> - yes, gdb confirms that `conf->cache_state` is 0x03 ( R5_INACTIVE_BLOCKED + R5_ALLOC_MORE )
>
> The resync thread is blocked since it can't get sh from inactive list, so R5_ALLOC_MORE and R5_INACTIVE_BLOCKED are set, that is why `echo idle > /sys/devices/virtual/block/md1/md/sync_action` can't stop sync thread.
>
>>
>> - We have lots of active stripes:
>>
>> root@deadbird:~/linux_problems/mdX_raid6_looping# cat /sys/block/md1/md/stripe_cache_active
>> 27534
>
> There are too many active stripes, so this is false:
>
> atomic_read(&conf->active_stripes) < (conf->max_nr_stripes * 3 / 4)
>
> so raid5_get_active_stripe has to wait till it becomes true, either increase max_nr_stripes or decrease active_stripes.
>
> 1. Increase max_nr_stripes
> since "mdX_raid6 process seems to go into a busy loop" and R5_ALLOC_MORE is set, if there is enough memory, I suppose mdX_raid6 (raid5d) could alloc new stripe in grow_one_stripe and increase max_nr_stripes. So please check the memory usage of your system.
>
> Another thing is you can try to increase the number of sh manually by write new number to stripe_cache_size if there is enough memory.
>
> 2. Or decrease active_stripes
> This is suppose to be done by do_release_stripe, but STRIPE_HANDLE is set
>
>> - handle_stripe() doesn't make progress:
>>
>> echo "func handle_stripe =pflt" > /sys/kernel/debug/dynamic_debug/control
>>
>> In dmesg, we see the debug output from https://elixir.bootlin.com/linux/latest/source/drivers/md/raid5.c#L4925 but never from https://elixir.bootlin.com/linux/latest/source/drivers/md/raid5.c#L4958:
>>
>> [171908.896651] [1388] handle_stripe:4929: handling stripe 4947089856, state=0x2029 cnt=1, pd_idx=9, qd_idx=10
>>                  , check:4, reconstruct:0
>> [171908.896657] [1388] handle_stripe:4929: handling stripe 4947089872, state=0x2029 cnt=1, pd_idx=9, qd_idx=10
>>                  , check:4, reconstruct:0
>> [171908.896669] [1388] handle_stripe:4929: handling stripe 4947089912, state=0x2029 cnt=1, pd_idx=9, qd_idx=10
>>                  , check:4, reconstruct:0
>>
>> The sector numbers repeat after some time. We have only the following variants of stripe state and "check":
>>
>> state=0x2031 cnt=1, check:0 # ACTIVE        +INSYNC+REPLACED+IO_STARTED, check_state_idle
>> state=0x2029 cnt=1, check:4 # ACTIVE+SYNCING       +REPLACED+IO_STARTED, check_state_check_result
>> state=0x2009 cnt=1, check:0 # ACTIVE+SYNCING                +IO_STARTED, check_state_idle
>>
>> - We have MD_SB_CHANGE_PENDING set:
>
> because MD_SB_CHANGE_PENDING flag. So do_release_stripe can't call atomic_dec(&conf->active_stripes).
>
> Normally, SB_CHANGE_PENDING is cleared from md_update_sb which could be called by md_reap_sync_thread. But md_reap_sync_thread stuck with unregister sync_thread (it was blocked in raid5_get_active_stripe).
>
>
> Still I don't understand well why mdX_raid6 in a busy loop, maybe raid5d can't break from the while(1) loop because "if (!batch_size && !released)" is false. I would assume released is '0' since
> >      released_stripes = {
> >          first = 0x0
> >      }
> And __get_priority_stripe fetches sh from conf->handle_list and delete
> it from handle_list, handle_stripe marks the state of the sh with STRIPE_HANDLE, then do_release_stripe put the sh back to handle_list.
> So batch_size can't be '0', and handle_active_stripes in the loop
> repeats the process in the busy loop. This is my best guess to explain the busy loop situation.
>
>>
>> root@deadbird:~/linux_problems/mdX_raid6_looping# cat /sys/block/md1/md/array_state
>> write-pending
>>
>> gdb confirms that sb_flags = 6 (MD_SB_CHANGE_CLEAN + MD_SB_CHANGE_PENDING)
>
> since rdev_set_badblocks could set them, could you check if there is badblock of underlying device (sd*)?

The problem is, we are still not able to run a system into the deadlocked state by a repeatable procedure. So I was not yet able to explicitly check for badblocks.

However, on the production systems, which locked up, and on the test systems, which we managed to lock up, there always was some file system activity to the affected devices aside from mdcheck itself. This alone would explain any clean->write-pending transitions and might be a required condition for the problem to happen. I'm currently trying to explicitly exercise this path (by write/fsync()/sleep), with no result yet.

Btw: When you monitor md/stripe_cache_active during a ongoing "check" on an otherwise idle system, you see a variety of values up to a certain maximum value, which is visible most of the time. But this maximum value seems to continuously increase. Also, when you start the check at higher blocks (via md/sync_min) right away, the maximum value seems to be higher. Can this be explained? Bigger seek gap between superblocks and active data area? But should this increase the number of active stripes?

Best
Donald

>> So it can be assumed that handle_stripe breaks out at https://elixir.bootlin.com/linux/latest/source/drivers/md/raid5.c#L4939
>>
>> - The system can manually be freed from the deadlock:
>>
>> When `echo active > /sys/block/md1/md/array_state` is used, the scrubbing and other I/O continue. Probably because of https://elixir.bootlin.com/linux/latest/source/drivers/md/md.c#L4520
>
> Hmm, seems clear SB_CHANGE_PENDING made the trick, so the blocked process can make progress.
>
>>
>> I, of coruse, don't fully understand it yet. Any ideas?
>>
>> I append some data from a hanging raid... (mddev, r5conf and a sample stripe_head from handle_list with it first disks)
>
> These data did help for investigation!
>
> Thanks,
> Guoqing

Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition [ In reply to ]

buczek at molgen

Jan 25, 2021, 4:15 PM

Post #11 of 49 (609 views)

Permalink

On 25.01.21 09:54, Donald Buczek wrote:
> Dear Guoqing,
>
> a colleague of mine was able to produce the issue inside a vm and were able to find a procedure to run the vm into the issue within minutes (not unreliably after hours on a physical system as before). This of course helped to pinpoint the problem.
>
> My current theory of what is happening is:
>
> - MD_SB_CHANGE_CLEAN + MD_SB_CHANGE_PENDING are set by md_write_start() when file-system I/O wants to do a write and the array transitions from "clean" to "active". (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L8308)
>
> - Before raid5d gets to write the superblock (its busy processing active stripes because of the sync activity) , userspace wants to pause the check by `echo idle > /sys/block/mdX/md/sync_action`
>
> - action_store() takes the reconfig_mutex before trying to stop the sync thread. (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L4689) Dump of struct mddev of email 1/19/21 confirms reconf_mutex non-zero.
>
> - raid5d is running in its main loop. raid5d()->handle_active_stripes() returns a positive batch size ( https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/raid5.c#L6329 ) although raid5d()->handle_active_stripes()->handle_stripe() doesn't process any stripe because of MD_SB_CHANGE_PENDING. (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/raid5.c#L4729 ). This is the reason, raid5d is busy looping.
>
> - raid5d()->md_check_recovery() is called by the raid5d main loop. One of its duties is to write the superblock, if a change is pending. However to do so, it needs either MD_ALLOW_SB_UPDATE or must be able to take the reconfig_mutex. (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L8967 , https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L9006) Both is not true, so the superblock is not written and MD_SB_CHANGE_PENDING is not cleared.
>
> - (as discussed previously) the sync thread is waiting for the number of active stripes to go down and doesn't terminate. The userspace thread is waiting for the sync thread to terminate.
>
> Does this make sense?
>
> Just for reference, I add the procedure which triggers the issue on the vm (with /dev/md3 mounted on /mnt/raid_md3) and some debug output:
>
> ```
> #! /bin/bash
>
> (
>         while true; do
>                 echo "start check"
>                 echo check > /sys/block/md3/md/sync_action
>                 sleep 10
>                 echo "stop check"
>                 echo idle > /sys/block/md3/md/sync_action
>                 sleep 2
>         done
> ) &
>
> (
>         while true; do
>                 dd bs=1k count=$((5*1024*1024)) if=/dev/zero of=/mnt/raid_md3/bigfile status=none
>                 sync /mnt/raid_md3/bigfile
>                 rm /mnt/raid_md3/bigfile
>                 sleep .1
>         done
> ) &
>
> start="$(date +%s)"
> cd /sys/block/md3/md
> wp_count=0
> while true; do
>         array_state=$(cat array_state)
>         if [ "$array_state" = write-pending ]; then
>                 wp_count=$(($wp_count+1))
>         else
>                 wp_count=0
>         fi
>         echo $(($(date +%s)-$start)) $(cat sync_action) $(cat sync_completed) $array_state $(cat stripe_cache_active)
>         if [ $wp_count -ge 3 ]; then
>                 kill -- -$$
>                 exit
>         fi
>         sleep 1
> done
> ```
>
> The time, this needs to trigger the issue, varies from under a minute to one hour with 5 minute being typical. The output ends like this:
>
>     309 check 6283872 / 8378368 active-idle 4144
>     310 check 6283872 / 8378368 active 1702
>     311 check 6807528 / 8378368 active 4152
>     312 check 7331184 / 8378368 clean 3021
>     stop check
>     313 check 7331184 / 8378368 write-pending 3905
>     314 check 7331184 / 8378368 write-pending 3905
>     315 check 7331184 / 8378368 write-pending 3905
>     Terminated
>
> If I add
>
>     pr_debug("XXX batch_size %d release %d mdddev->sb_flags %lx\n", batch_size, released, mddev->sb_flags);
>
> in raid5d after the call to handle_active_stripes and enable the debug location after the deadlock occurred, I get
>
>     [ 3123.939143] [1223] raid5d:6332: XXX batch_size 8 release 0 mdddev->sb_flags 6
>     [ 3123.939156] [1223] raid5d:6332: XXX batch_size 8 release 0 mdddev->sb_flags 6
>     [ 3123.939170] [1223] raid5d:6332: XXX batch_size 8 release 0 mdddev->sb_flags 6
>     [ 3123.939184] [1223] raid5d:6332: XXX batch_size 8 release 0 mdddev->sb_flags 6
>
> If I add
>
>     pr_debug("XXX 1 %s:%d mddev->flags %08lx mddev->sb_flags %08lx\n", __FILE__, __LINE__, mddev->flags, mddev->sb_flags);
>
> at the head of md_check_recovery, I get:
>
>     [ 789.555462] [1191] md_check_recovery:8970: XXX 1 drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
>     [ 789.555477] [1191] md_check_recovery:8970: XXX 1 drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
>     [ 789.555491] [1191] md_check_recovery:8970: XXX 1 drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
>     [ 789.555505] [1191] md_check_recovery:8970: XXX 1 drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
>     [ 789.555520] [1191] md_check_recovery:8970: XXX 1 drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
>
> More debug lines in md_check_recovery confirm the control flow ( `if (mddev_trylock(mddev))` block not taken )
>
> What approach would you suggest to fix this?

I naively tried the following patch and it seems to fix the problem. The test procedure didn't trigger the deadlock in 10 hours.

D.

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 2d21c298ffa7..f40429843906 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -4687,11 +4687,13 @@ action_store(struct mddev *mddev, const char *page, size_t len)
clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) &&
mddev_lock(mddev) == 0) {
+ set_bit(MD_ALLOW_SB_UPDATE, &mddev->flags);
flush_workqueue(md_misc_wq);
if (mddev->sync_thread) {
set_bit(MD_RECOVERY_INTR, &mddev->recovery);
md_reap_sync_thread(mddev);
}
+ clear_bit(MD_ALLOW_SB_UPDATE, &mddev->flags);
mddev_unlock(mddev);
}
} else if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))
--
2.30.0

Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition [ In reply to ]

guoqing.jiang at cloud

Jan 25, 2021, 6:15 PM

Post #12 of 49 (609 views)

Permalink

Hi Donald,

On 1/25/21 22:32, Donald Buczek wrote:
>
>
> On 25.01.21 09:54, Donald Buczek wrote:
>> Dear Guoqing,
>>
>> a colleague of mine was able to produce the issue inside a vm and were
>> able to find a procedure to run the vm into the issue within minutes
>> (not unreliably after hours on a physical system as before). This of
>> course helped to pinpoint the problem.
>>
>> My current theory of what is happening is:
>>
>> - MD_SB_CHANGE_CLEAN + MD_SB_CHANGE_PENDING are set by
>> md_write_start() when file-system I/O wants to do a write and the
>> array transitions from "clean" to "active".
>> (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L8308)
>>
>> - Before raid5d gets to write the superblock (its busy processing
>> active stripes because of the sync activity) , userspace wants to
>> pause the check by `echo idle > /sys/block/mdX/md/sync_action`
>>
>> - action_store() takes the reconfig_mutex before trying to stop the
>> sync thread.
>> (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L4689) Dump
>> of struct mddev of email 1/19/21 confirms reconf_mutex non-zero.
>>
>> - raid5d is running in its main loop.
>> raid5d()->handle_active_stripes() returns a positive batch size (
>> https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/raid5.c#L6329
>> ) although raid5d()->handle_active_stripes()->handle_stripe() doesn't
>> process any stripe because of MD_SB_CHANGE_PENDING.
>> (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/raid5.c#L4729
>> ). This is the reason, raid5d is busy looping.
>>
>> - raid5d()->md_check_recovery() is called by the raid5d main loop. One
>> of its duties is to write the superblock, if a change is pending.
>> However to do so, it needs either MD_ALLOW_SB_UPDATE or must be able
>> to take the reconfig_mutex.
>> (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L8967
>> ,
>> https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L9006)
>> Both is not true, so the superblock is not written and
>> MD_SB_CHANGE_PENDING is not cleared.
>>
>> - (as discussed previously) the sync thread is waiting for the number
>> of active stripes to go down and doesn't terminate. The userspace
>> thread is waiting for the sync thread to terminate.
>>
>> Does this make sense?

Yes, exactly! That was my thought too, the scenario is twisted.

Then resync thread is blocked due to there are too many active stripes,
because raid5d is in busy loop since SB_CHANGE_PENDING is set which
means tactive stripes can't be decreased, and echo idle cmd can't make
progress given resync thread is blocked while the cmd still hold
reconfig_mutex which make raid5d in busy loop and can't clear
SB_CHANGE_PENDING flag.

And raid5 could suffer from the same issue I think.

>>
>> Just for reference, I add the procedure which triggers the issue on
>> the vm (with /dev/md3 mounted on /mnt/raid_md3) and some debug output:
>>
>> ```
>> #! /bin/bash
>>
>> (
>>          while true; do
>>                  echo "start check"
>>                  echo check > /sys/block/md3/md/sync_action
>>                  sleep 10
>>                  echo "stop check"
>>                  echo idle > /sys/block/md3/md/sync_action
>>                  sleep 2
>>          done
>> ) &
>>
>> (
>>          while true; do
>>                  dd bs=1k count=$((5*1024*1024)) if=/dev/zero
>> of=/mnt/raid_md3/bigfile status=none
>>                  sync /mnt/raid_md3/bigfile
>>                  rm /mnt/raid_md3/bigfile
>>                  sleep .1
>>          done
>> ) &
>>
>> start="$(date +%s)"
>> cd /sys/block/md3/md
>> wp_count=0
>> while true; do
>>          array_state=$(cat array_state)
>>          if [ "$array_state" = write-pending ]; then
>>                  wp_count=$(($wp_count+1))
>>          else
>>                  wp_count=0
>>          fi
>>          echo $(($(date +%s)-$start)) $(cat sync_action) $(cat
>> sync_completed) $array_state $(cat stripe_cache_active)
>>          if [ $wp_count -ge 3 ]; then
>>                  kill -- -$$
>>                  exit
>>          fi
>>          sleep 1
>> done
>> ```
>>
>> The time, this needs to trigger the issue, varies from under a minute
>> to one hour with 5 minute being typical. The output ends like this:
>>
>>      309 check 6283872 / 8378368 active-idle 4144
>>      310 check 6283872 / 8378368 active 1702
>>      311 check 6807528 / 8378368 active 4152
>>      312 check 7331184 / 8378368 clean 3021
>>      stop check
>>      313 check 7331184 / 8378368 write-pending 3905
>>      314 check 7331184 / 8378368 write-pending 3905
>>      315 check 7331184 / 8378368 write-pending 3905
>>      Terminated
>>
>> If I add
>>
>>      pr_debug("XXX batch_size %d release %d mdddev->sb_flags %lx\n",
>> batch_size, released, mddev->sb_flags);
>>
>> in raid5d after the call to handle_active_stripes and enable the debug
>> location after the deadlock occurred, I get
>>
>>      [ 3123.939143] [1223] raid5d:6332: XXX batch_size 8 release 0
>> mdddev->sb_flags 6
>>      [ 3123.939156] [1223] raid5d:6332: XXX batch_size 8 release 0
>> mdddev->sb_flags 6
>>      [ 3123.939170] [1223] raid5d:6332: XXX batch_size 8 release 0
>> mdddev->sb_flags 6
>>      [ 3123.939184] [1223] raid5d:6332: XXX batch_size 8 release 0
>> mdddev->sb_flags 6
>>
>> If I add
>>
>>      pr_debug("XXX 1 %s:%d mddev->flags %08lx mddev->sb_flags
>> %08lx\n", __FILE__, __LINE__, mddev->flags, mddev->sb_flags);
>>
>> at the head of md_check_recovery, I get:
>>
>>      [ 789.555462] [1191] md_check_recovery:8970: XXX 1
>> drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
>>      [ 789.555477] [1191] md_check_recovery:8970: XXX 1
>> drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
>>      [ 789.555491] [1191] md_check_recovery:8970: XXX 1
>> drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
>>      [ 789.555505] [1191] md_check_recovery:8970: XXX 1
>> drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
>>      [ 789.555520] [1191] md_check_recovery:8970: XXX 1
>> drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
>>
>> More debug lines in md_check_recovery confirm the control flow ( `if
>> (mddev_trylock(mddev))` block not taken )
>>

That is great that you have a reproducer now!

>> What approach would you suggest to fix this?
>
> I naively tried the following patch and it seems to fix the problem. The
> test procedure didn't trigger the deadlock in 10 hours.
>
> D.
>
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 2d21c298ffa7..f40429843906 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -4687,11 +4687,13 @@ action_store(struct mddev *mddev, const char
> *page, size_t len)
>              clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
>          if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) &&
>              mddev_lock(mddev) == 0) {
> +            set_bit(MD_ALLOW_SB_UPDATE, &mddev->flags);
>              flush_workqueue(md_misc_wq);
>              if (mddev->sync_thread) {
>                  set_bit(MD_RECOVERY_INTR, &mddev->recovery);
>                  md_reap_sync_thread(mddev);
>              }
> +            clear_bit(MD_ALLOW_SB_UPDATE, &mddev->flags);
>              mddev_unlock(mddev);
>          }
>      } else if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))

Yes, it could break the deadlock issue, but I am not sure if it is the
right way given we only set ALLOW_SB_UPDATE in suspend which makes sense
since the io will be quiesced, but write idle action can't guarantee the
similar thing. I prefer to make resync thread not wait forever here.

wait_event_lock_irq(
conf->wait_for_stripe,
!list_empty(conf->inactive_list + hash) &&

(atomic_read(&conf->active_stripes)
< (conf->max_nr_stripes * 3 / 4)
|| !test_bit(R5_INACTIVE_BLOCKED,
&conf->cache_state)
*(conf->hash_locks + hash));

So, could you please try the attached?

Thanks,
Guoqing

Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition [ In reply to ]

buczek at molgen

Jan 26, 2021, 4:15 AM

Post #13 of 49 (609 views)

Permalink

Dear Guoqing,

On 26.01.21 01:44, Guoqing Jiang wrote:
> Hi Donald,
>
> On 1/25/21 22:32, Donald Buczek wrote:
>>
>>
>> On 25.01.21 09:54, Donald Buczek wrote:
>>> Dear Guoqing,
>>>
>>> a colleague of mine was able to produce the issue inside a vm and were able to find a procedure to run the vm into the issue within minutes (not unreliably after hours on a physical system as before). This of course helped to pinpoint the problem.
>>>
>>> My current theory of what is happening is:
>>>
>>> - MD_SB_CHANGE_CLEAN + MD_SB_CHANGE_PENDING are set by md_write_start() when file-system I/O wants to do a write and the array transitions from "clean" to "active". (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L8308)
>>>
>>> - Before raid5d gets to write the superblock (its busy processing active stripes because of the sync activity) , userspace wants to pause the check by `echo idle > /sys/block/mdX/md/sync_action`
>>>
>>> - action_store() takes the reconfig_mutex before trying to stop the sync thread. (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L4689) Dump of struct mddev of email 1/19/21 confirms reconf_mutex non-zero.
>>>
>>> - raid5d is running in its main loop. raid5d()->handle_active_stripes() returns a positive batch size ( https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/raid5.c#L6329 ) although raid5d()->handle_active_stripes()->handle_stripe() doesn't process any stripe because of MD_SB_CHANGE_PENDING. (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/raid5.c#L4729 ). This is the reason, raid5d is busy looping.
>>>
>>> - raid5d()->md_check_recovery() is called by the raid5d main loop. One of its duties is to write the superblock, if a change is pending. However to do so, it needs either MD_ALLOW_SB_UPDATE or must be able to take the reconfig_mutex. (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L8967 , https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L9006) Both is not true, so the superblock is not written and MD_SB_CHANGE_PENDING is not cleared.
>>>
>>> - (as discussed previously) the sync thread is waiting for the number of active stripes to go down and doesn't terminate. The userspace thread is waiting for the sync thread to terminate.
>>>
>>> Does this make sense?
>
> Yes, exactly! That was my thought too, the scenario is twisted.
>
> Then resync thread is blocked due to there are too many active stripes, because raid5d is in busy loop since SB_CHANGE_PENDING is set which means tactive stripes can't be decreased, and echo idle cmd can't make progress given resync thread is blocked while the cmd still hold reconfig_mutex which make raid5d in busy loop and can't clear SB_CHANGE_PENDING flag.
>
> And raid5 could suffer from the same issue I think.
>
>>>
>>> Just for reference, I add the procedure which triggers the issue on the vm (with /dev/md3 mounted on /mnt/raid_md3) and some debug output:
>>>
>>> ```
>>> #! /bin/bash
>>>
>>> (
>>>          while true; do
>>>                  echo "start check"
>>>                  echo check > /sys/block/md3/md/sync_action
>>>                  sleep 10
>>>                  echo "stop check"
>>>                  echo idle > /sys/block/md3/md/sync_action
>>>                  sleep 2
>>>          done
>>> ) &
>>>
>>> (
>>>          while true; do
>>>                  dd bs=1k count=$((5*1024*1024)) if=/dev/zero of=/mnt/raid_md3/bigfile status=none
>>>                  sync /mnt/raid_md3/bigfile
>>>                  rm /mnt/raid_md3/bigfile
>>>                  sleep .1
>>>          done
>>> ) &
>>>
>>> start="$(date +%s)"
>>> cd /sys/block/md3/md
>>> wp_count=0
>>> while true; do
>>>          array_state=$(cat array_state)
>>>          if [ "$array_state" = write-pending ]; then
>>>                  wp_count=$(($wp_count+1))
>>>          else
>>>                  wp_count=0
>>>          fi
>>>          echo $(($(date +%s)-$start)) $(cat sync_action) $(cat sync_completed) $array_state $(cat stripe_cache_active)
>>>          if [ $wp_count -ge 3 ]; then
>>>                  kill -- -$$
>>>                  exit
>>>          fi
>>>          sleep 1
>>> done
>>> ```
>>>
>>> The time, this needs to trigger the issue, varies from under a minute to one hour with 5 minute being typical. The output ends like this:
>>>
>>>      309 check 6283872 / 8378368 active-idle 4144
>>>      310 check 6283872 / 8378368 active 1702
>>>      311 check 6807528 / 8378368 active 4152
>>>      312 check 7331184 / 8378368 clean 3021
>>>      stop check
>>>      313 check 7331184 / 8378368 write-pending 3905
>>>      314 check 7331184 / 8378368 write-pending 3905
>>>      315 check 7331184 / 8378368 write-pending 3905
>>>      Terminated
>>>
>>> If I add
>>>
>>>      pr_debug("XXX batch_size %d release %d mdddev->sb_flags %lx\n", batch_size, released, mddev->sb_flags);
>>>
>>> in raid5d after the call to handle_active_stripes and enable the debug location after the deadlock occurred, I get
>>>
>>>      [ 3123.939143] [1223] raid5d:6332: XXX batch_size 8 release 0 mdddev->sb_flags 6
>>>      [ 3123.939156] [1223] raid5d:6332: XXX batch_size 8 release 0 mdddev->sb_flags 6
>>>      [ 3123.939170] [1223] raid5d:6332: XXX batch_size 8 release 0 mdddev->sb_flags 6
>>>      [ 3123.939184] [1223] raid5d:6332: XXX batch_size 8 release 0 mdddev->sb_flags 6
>>>
>>> If I add
>>>
>>>      pr_debug("XXX 1 %s:%d mddev->flags %08lx mddev->sb_flags %08lx\n", __FILE__, __LINE__, mddev->flags, mddev->sb_flags);
>>>
>>> at the head of md_check_recovery, I get:
>>>
>>>      [ 789.555462] [1191] md_check_recovery:8970: XXX 1 drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
>>>      [ 789.555477] [1191] md_check_recovery:8970: XXX 1 drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
>>>      [ 789.555491] [1191] md_check_recovery:8970: XXX 1 drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
>>>      [ 789.555505] [1191] md_check_recovery:8970: XXX 1 drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
>>>      [ 789.555520] [1191] md_check_recovery:8970: XXX 1 drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
>>>
>>> More debug lines in md_check_recovery confirm the control flow ( `if (mddev_trylock(mddev))` block not taken )
>>>
>
> That is great that you have a reproducer now!
>
>>> What approach would you suggest to fix this?
>>
>> I naively tried the following patch and it seems to fix the problem. The test procedure didn't trigger the deadlock in 10 hours.
>>
>> D.
>>
>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>> index 2d21c298ffa7..f40429843906 100644
>> --- a/drivers/md/md.c
>> +++ b/drivers/md/md.c
>> @@ -4687,11 +4687,13 @@ action_store(struct mddev *mddev, const char *page, size_t len)
>>               clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
>>           if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) &&
>>               mddev_lock(mddev) == 0) {
>> +            set_bit(MD_ALLOW_SB_UPDATE, &mddev->flags);
>>               flush_workqueue(md_misc_wq);
>>               if (mddev->sync_thread) {
>>                   set_bit(MD_RECOVERY_INTR, &mddev->recovery);
>>                   md_reap_sync_thread(mddev);
>>               }
>> +            clear_bit(MD_ALLOW_SB_UPDATE, &mddev->flags);
>>               mddev_unlock(mddev);
>>           }
>>       } else if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))
>
> Yes, it could break the deadlock issue, but I am not sure if it is the right way given we only set ALLOW_SB_UPDATE in suspend which makes sense since the io will be quiesced, but write idle action can't guarantee the similar thing.

Thinking (and documenting) MD_ALLOW_SB_UPDATE as "the holder of reconfig_mutex promises not to make any changes which would exclude superblocks from being written" might make it easier to accept the usage.

> I prefer to make resync thread not wait forever here.
>
> wait_event_lock_irq(
>     conf->wait_for_stripe,
>     !list_empty(conf->inactive_list + hash) &&
>     (atomic_read(&conf->active_stripes)
>     < (conf->max_nr_stripes * 3 / 4)
>     || !test_bit(R5_INACTIVE_BLOCKED,
>              &conf->cache_state)
>     *(conf->hash_locks + hash));
>
> So, could you please try the attached?
> diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
> index 4337ae0..378ce5c 100644
> --- a/drivers/md/raid5-cache.c
> +++ b/drivers/md/raid5-cache.c
> @@ -1931,7 +1931,7 @@ r5c_recovery_alloc_stripe(
> {
> struct stripe_head *sh;
>
> - sh = raid5_get_active_stripe(conf, stripe_sect, 0, noblock, 0);
> + sh = raid5_get_active_stripe(conf, stripe_sect, 0, 0, noblock, 0);
> if (!sh)
> return NULL; /* no more stripe available */
>
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 8888973..33a2a22 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -700,10 +700,11 @@ static int has_failed(struct r5conf *conf)
> }
>
> struct stripe_head *
> -raid5_get_active_stripe(struct r5conf *conf, sector_t sector,
> +raid5_get_active_stripe(struct r5conf *conf, sector_t sector, int sync_req,
> int previous, int noblock, int noquiesce)
> {
> struct stripe_head *sh;
> + struct mddev *mddev = conf->mddev;
> int hash = stripe_hash_locks_hash(conf, sector);
> int inc_empty_inactive_list_flag;
>
> @@ -738,8 +739,14 @@ raid5_get_active_stripe(struct r5conf *conf, sector_t sector,
> (atomic_read(&conf->active_stripes)
> < (conf->max_nr_stripes * 3 / 4)
> || !test_bit(R5_INACTIVE_BLOCKED,
> - &conf->cache_state)),
> + &conf->cache_state)
> + || (test_bit(MD_RECOVERY_INTR,
> + &mddev->recovery)
> + && sync_req)),
> *(conf->hash_locks + hash));
> + if (test_bit(MD_RECOVERY_INTR, &mddev->recovery)
> + && sync_req)
> + break;
> clear_bit(R5_INACTIVE_BLOCKED,
> &conf->cache_state);
> } else {
> @@ -4527,7 +4534,7 @@ static void handle_stripe_expansion(struct r5conf *conf, struct stripe_head *sh)
> sector_t bn = raid5_compute_blocknr(sh, i, 1);
> sector_t s = raid5_compute_sector(conf, bn, 0,
> &dd_idx, NULL);
> - sh2 = raid5_get_active_stripe(conf, s, 0, 1, 1);
> + sh2 = raid5_get_active_stripe(conf, s, 0, 0, 1, 1);
> if (sh2 == NULL)
> /* so far only the early blocks of this stripe
> * have been requested. When later blocks
> @@ -5164,7 +5171,7 @@ static void handle_stripe(struct stripe_head *sh)
> /* Finish reconstruct operations initiated by the expansion process */
> if (sh->reconstruct_state == reconstruct_state_result) {
> struct stripe_head *sh_src
> - = raid5_get_active_stripe(conf, sh->sector, 1, 1, 1);
> + = raid5_get_active_stripe(conf, sh->sector, 0, 1, 1, 1);
> if (sh_src && test_bit(STRIPE_EXPAND_SOURCE, &sh_src->state)) {
> /* sh cannot be written until sh_src has been read.
> * so arrange for sh to be delayed a little
> @@ -5705,7 +5712,7 @@ static void make_discard_request(struct mddev *mddev, struct bio *bi)
> DEFINE_WAIT(w);
> int d;
> again:
> - sh = raid5_get_active_stripe(conf, logical_sector, 0, 0, 0);
> + sh = raid5_get_active_stripe(conf, logical_sector, 0, 0, 0, 0);
> prepare_to_wait(&conf->wait_for_overlap, &w,
> TASK_UNINTERRUPTIBLE);
> set_bit(R5_Overlap, &sh->dev[sh->pd_idx].flags);
> @@ -5861,7 +5868,7 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
> (unsigned long long)new_sector,
> (unsigned long long)logical_sector);
>
> - sh = raid5_get_active_stripe(conf, new_sector, previous,
> + sh = raid5_get_active_stripe(conf, new_sector, previous, 0,

Mistake here (fourth argument added instead of third)

> (bi->bi_opf & REQ_RAHEAD), 0);
> if (sh) {
> if (unlikely(previous)) {
> @@ -6100,7 +6107,7 @@ static sector_t reshape_request(struct mddev *mddev, sector_t sector_nr, int *sk
> for (i = 0; i < reshape_sectors; i += RAID5_STRIPE_SECTORS(conf)) {
> int j;
> int skipped_disk = 0;
> - sh = raid5_get_active_stripe(conf, stripe_addr+i, 0, 0, 1);
> + sh = raid5_get_active_stripe(conf, stripe_addr+i, 0, 0, 0, 1);
> set_bit(STRIPE_EXPANDING, &sh->state);
> atomic_inc(&conf->reshape_stripes);
> /* If any of this stripe is beyond the end of the old
> @@ -6149,7 +6156,7 @@ static sector_t reshape_request(struct mddev *mddev, sector_t sector_nr, int *sk
> if (last_sector >= mddev->dev_sectors)
> last_sector = mddev->dev_sectors - 1;
> while (first_sector <= last_sector) {
> - sh = raid5_get_active_stripe(conf, first_sector, 1, 0, 1);
> + sh = raid5_get_active_stripe(conf, first_sector, 0, 1, 0, 1);
> set_bit(STRIPE_EXPAND_SOURCE, &sh->state);
> set_bit(STRIPE_HANDLE, &sh->state);
> raid5_release_stripe(sh);
> @@ -6269,9 +6276,14 @@ static inline sector_t raid5_sync_request(struct mddev *mddev, sector_t sector_n
>
> md_bitmap_cond_end_sync(mddev->bitmap, sector_nr, false);
>
> - sh = raid5_get_active_stripe(conf, sector_nr, 0, 1, 0);
> + sh = raid5_get_active_stripe(conf, sector_nr, 1, 0, 1, 0);
> if (sh == NULL) {
> - sh = raid5_get_active_stripe(conf, sector_nr, 0, 0, 0);
> + sh = raid5_get_active_stripe(conf, sector_nr, 1, 0, 0, 0);
> + if (!sh && test_bit(MD_RECOVERY_INTR, &mddev->recovery)) {
> + *skipped = 1;
> + return 0;
> + }
> +
> /* make sure we don't swamp the stripe cache if someone else
> * is trying to get access
> */
> @@ -6334,7 +6346,7 @@ static int retry_aligned_read(struct r5conf *conf, struct bio *raid_bio,
> /* already done this stripe */
> continue;
>
> - sh = raid5_get_active_stripe(conf, sector, 0, 1, 1);
> + sh = raid5_get_active_stripe(conf, sector, 0, 0, 1, 1);
>
> if (!sh) {
> /* failed to get a stripe - must wait */
> diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
> index 5c05acf..d9eab46 100644
> --- a/drivers/md/raid5.h
> +++ b/drivers/md/raid5.h
> @@ -806,7 +806,7 @@ extern sector_t raid5_compute_sector(struct r5conf *conf, sector_t r_sector,
> int previous, int *dd_idx,
> struct stripe_head *sh);
> extern struct stripe_head *
> -raid5_get_active_stripe(struct r5conf *conf, sector_t sector,
> +raid5_get_active_stripe(struct r5conf *conf, sector_t sector, int sync_req,
> int previous, int noblock, int noquiesce);
> extern int raid5_calc_degraded(struct r5conf *conf);
> extern int r5c_journal_mode_set(struct mddev *mddev, int journal_mode);

Unfortunately, this patch did not fix the issue.

root@sloth:~/linux# cat /proc/$(pgrep md3_resync)/stack
[<0>] raid5_get_active_stripe+0x1e7/0x6b0
[<0>] raid5_sync_request+0x2a7/0x3d0
[<0>] md_do_sync.cold+0x3ee/0x97c
[<0>] md_thread+0xab/0x160
[<0>] kthread+0x11b/0x140
[<0>] ret_from_fork+0x22/0x30

which is ( according to objdump -d -Sl drivers/md/raid5.o ) at https://elixir.bootlin.com/linux/v5.11-rc5/source/drivers/md/raid5.c#L735

Isn't it still the case that the superblocks are not written, therefore stripes are not processed, therefore number of active stripes are not decreasing? Who is expected to wake up conf->wait_for_stripe waiters?

Best
Donald

>
> Thanks,
> Guoqing

Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition [ In reply to ]

guoqing.jiang at cloud

Jan 26, 2021, 4:15 AM

Post #14 of 49 (609 views)

Permalink

Hi Donald,

On 1/26/21 10:50, Donald Buczek wrote:
[...]

>>>
>>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>>> index 2d21c298ffa7..f40429843906 100644
>>> --- a/drivers/md/md.c
>>> +++ b/drivers/md/md.c
>>> @@ -4687,11 +4687,13 @@ action_store(struct mddev *mddev, const char
>>> *page, size_t len)
>>>               clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
>>>           if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) &&
>>>               mddev_lock(mddev) == 0) {
>>> +            set_bit(MD_ALLOW_SB_UPDATE, &mddev->flags);
>>>               flush_workqueue(md_misc_wq);
>>>               if (mddev->sync_thread) {
>>>                   set_bit(MD_RECOVERY_INTR, &mddev->recovery);
>>>                   md_reap_sync_thread(mddev);
>>>               }
>>> +            clear_bit(MD_ALLOW_SB_UPDATE, &mddev->flags);
>>>               mddev_unlock(mddev);
>>>           }
>>>       } else if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))
>>
>> Yes, it could break the deadlock issue, but I am not sure if it is the
>> right way given we only set ALLOW_SB_UPDATE in suspend which makes
>> sense since the io will be quiesced, but write idle action can't
>> guarantee the similar thing.
>
> Thinking (and documenting) MD_ALLOW_SB_UPDATE as "the holder of
> reconfig_mutex promises not to make any changes which would exclude
> superblocks from being written" might make it easier to accept the usage.

I am not sure it is safe to set the flag here since write idle can't
prevent IO from fs while mddev_suspend can guarantee that.

>
>> I prefer to make resync thread not wait forever here.
>>

[...]

>>
>> -        sh = raid5_get_active_stripe(conf, new_sector, previous,
>> +        sh = raid5_get_active_stripe(conf, new_sector, previous, 0,
>
>
> Mistake here (fourth argument added instead of third)

Thanks for checking.

[...]

> Unfortunately, this patch did not fix the issue.
>
>     root@sloth:~/linux# cat /proc/$(pgrep md3_resync)/stack
>     [<0>] raid5_get_active_stripe+0x1e7/0x6b0
>     [<0>] raid5_sync_request+0x2a7/0x3d0
>     [<0>] md_do_sync.cold+0x3ee/0x97c
>     [<0>] md_thread+0xab/0x160
>     [<0>] kthread+0x11b/0x140
>     [<0>] ret_from_fork+0x22/0x30
>
> which is ( according to objdump -d -Sl drivers/md/raid5.o ) at
> https://elixir.bootlin.com/linux/v5.11-rc5/source/drivers/md/raid5.c#L735
>
> Isn't it still the case that the superblocks are not written, therefore
> stripes are not processed, therefore number of active stripes are not
> decreasing? Who is expected to wake up conf->wait_for_stripe waiters?

Hmm, how about wake the waiter up in the while loop of raid5d?

@@ -6520,6 +6532,11 @@ static void raid5d(struct md_thread *thread)
md_check_recovery(mddev);
spin_lock_irq(&conf->device_lock);
}
+
+ if ((atomic_read(&conf->active_stripes)
+ < (conf->max_nr_stripes * 3 / 4) ||
+ (test_bit(MD_RECOVERY_INTR, &mddev->recovery))))
+ wake_up(&conf->wait_for_stripe);
}
pr_debug("%d stripes handled\n", handled);

If the issue still appears then I will change the waiter to break just
if MD_RECOVERY_INTR is set, something like.

wait_event_lock_irq(conf->wait_for_stripe,
(test_bit(MD_RECOVERY_INTR, &mddev->recovery) && sync_req) ||
/* the previous condition */,
*(conf->hash_locks + hash));

Thanks,
Guoqing

Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition [ In reply to ]

buczek at molgen

Jan 26, 2021, 6:15 AM

Post #15 of 49 (609 views)

Permalink

On 26.01.21 12:14, Guoqing Jiang wrote:
> Hi Donald,
>
> On 1/26/21 10:50, Donald Buczek wrote:
> [...]
>
>>>>
>>>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>>>> index 2d21c298ffa7..f40429843906 100644
>>>> --- a/drivers/md/md.c
>>>> +++ b/drivers/md/md.c
>>>> @@ -4687,11 +4687,13 @@ action_store(struct mddev *mddev, const char *page, size_t len)
>>>>               clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
>>>>           if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) &&
>>>>               mddev_lock(mddev) == 0) {
>>>> +            set_bit(MD_ALLOW_SB_UPDATE, &mddev->flags);
>>>>               flush_workqueue(md_misc_wq);
>>>>               if (mddev->sync_thread) {
>>>>                   set_bit(MD_RECOVERY_INTR, &mddev->recovery);
>>>>                   md_reap_sync_thread(mddev);
>>>>               }
>>>> +            clear_bit(MD_ALLOW_SB_UPDATE, &mddev->flags);
>>>>               mddev_unlock(mddev);
>>>>           }
>>>>       } else if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))
>>>
>>> Yes, it could break the deadlock issue, but I am not sure if it is the right way given we only set ALLOW_SB_UPDATE in suspend which makes sense since the io will be quiesced, but write idle action can't guarantee the similar thing.
>>
>> Thinking (and documenting) MD_ALLOW_SB_UPDATE as "the holder of reconfig_mutex promises not to make any changes which would exclude superblocks from being written" might make it easier to accept the usage.
>
> I am not sure it is safe to set the flag here since write idle can't prevent IO from fs while mddev_suspend can guarantee that.
>
>>
>>> I prefer to make resync thread not wait forever here.
>>>
>
> [...]
>
>>>
>>> -        sh = raid5_get_active_stripe(conf, new_sector, previous,
>>> +        sh = raid5_get_active_stripe(conf, new_sector, previous, 0,
>>
>>
>> Mistake here (fourth argument added instead of third)
>
> Thanks for checking.
>
> [...]
>
>> Unfortunately, this patch did not fix the issue.
>>
>>      root@sloth:~/linux# cat /proc/$(pgrep md3_resync)/stack
>>      [<0>] raid5_get_active_stripe+0x1e7/0x6b0
>>      [<0>] raid5_sync_request+0x2a7/0x3d0
>>      [<0>] md_do_sync.cold+0x3ee/0x97c
>>      [<0>] md_thread+0xab/0x160
>>      [<0>] kthread+0x11b/0x140
>>      [<0>] ret_from_fork+0x22/0x30
>>
>> which is ( according to objdump -d -Sl drivers/md/raid5.o ) at https://elixir.bootlin.com/linux/v5.11-rc5/source/drivers/md/raid5.c#L735
>>
>> Isn't it still the case that the superblocks are not written, therefore stripes are not processed, therefore number of active stripes are not decreasing? Who is expected to wake up conf->wait_for_stripe waiters?
>
> Hmm, how about wake the waiter up in the while loop of raid5d?
>
> @@ -6520,6 +6532,11 @@ static void raid5d(struct md_thread *thread)
>                         md_check_recovery(mddev);
>                         spin_lock_irq(&conf->device_lock);
>                 }
> +
> +               if ((atomic_read(&conf->active_stripes)
> +                    < (conf->max_nr_stripes * 3 / 4) ||
> +                    (test_bit(MD_RECOVERY_INTR, &mddev->recovery))))
> +                       wake_up(&conf->wait_for_stripe);
>         }
>         pr_debug("%d stripes handled\n", handled);

Hmm... With this patch on top of your other one, we still have the basic symptoms (md3_raid6 busy looping), but the sync thread is now hanging at

root@sloth:~# cat /proc/$(pgrep md3_resync)/stack
[<0>] md_do_sync.cold+0x8ec/0x97c
[<0>] md_thread+0xab/0x160
[<0>] kthread+0x11b/0x140
[<0>] ret_from_fork+0x22/0x30

instead, which is https://elixir.bootlin.com/linux/latest/source/drivers/md/md.c#L8963

And, unlike before, "md: md3: data-check interrupted." from the pr_info two lines above appears in dmesg.

Best
Donald

> If the issue still appears then I will change the waiter to break just if MD_RECOVERY_INTR is set, something like.
>
> wait_event_lock_irq(conf->wait_for_stripe,
>     (test_bit(MD_RECOVERY_INTR, &mddev->recovery) && sync_req) ||
>      /* the previous condition */,
>     *(conf->hash_locks + hash));
>
> Thanks,
> Guoqing

Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition [ In reply to ]

guoqing.jiang at cloud

Jan 26, 2021, 6:15 AM

Post #16 of 49 (609 views)

Permalink

On 1/26/21 13:58, Donald Buczek wrote:
>
>
>> Hmm, how about wake the waiter up in the while loop of raid5d?
>>
>> @@ -6520,6 +6532,11 @@ static void raid5d(struct md_thread *thread)
>>                          md_check_recovery(mddev);
>>                          spin_lock_irq(&conf->device_lock);
>>                  }
>> +
>> +               if ((atomic_read(&conf->active_stripes)
>> +                    < (conf->max_nr_stripes * 3 / 4) ||
>> +                    (test_bit(MD_RECOVERY_INTR, &mddev->recovery))))
>> +                       wake_up(&conf->wait_for_stripe);
>>          }
>>          pr_debug("%d stripes handled\n", handled);
>
> Hmm... With this patch on top of your other one, we still have the basic
> symptoms (md3_raid6 busy looping), but the sync thread is now hanging at
>
>     root@sloth:~# cat /proc/$(pgrep md3_resync)/stack
>     [<0>] md_do_sync.cold+0x8ec/0x97c
>     [<0>] md_thread+0xab/0x160
>     [<0>] kthread+0x11b/0x140
>     [<0>] ret_from_fork+0x22/0x30
>
> instead, which is
> https://elixir.bootlin.com/linux/latest/source/drivers/md/md.c#L8963

Not sure why recovery_active is not zero, because it is set 0 before
blk_start_plug, and raid5_sync_request returns 0 and skipped is also set
to 1. Perhaps handle_stripe calls md_done_sync.

Could you double check the value of recovery_active? Or just don't wait
if resync thread is interrupted.

wait_event(mddev->recovery_wait,
test_bit(MD_RECOVERY_INTR,&mddev->recovery) ||
!atomic_read(&mddev->recovery_active));

> And, unlike before, "md: md3: data-check interrupted." from the pr_info
> two lines above appears in dmesg.

Yes, that is intentional since MD_RECOVERY_INTR is set by write idle.

Anyway, will try the script and investigate more about the issue.

Thanks,
Guoqing

Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition [ In reply to ]

buczek at molgen

Jan 26, 2021, 10:15 AM

Post #17 of 49 (609 views)

Permalink

Dear Guoqing,

On 26.01.21 15:06, Guoqing Jiang wrote:
>
>
> On 1/26/21 13:58, Donald Buczek wrote:
>>
>>
>>> Hmm, how about wake the waiter up in the while loop of raid5d?
>>>
>>> @@ -6520,6 +6532,11 @@ static void raid5d(struct md_thread *thread)
>>>                          md_check_recovery(mddev);
>>>                          spin_lock_irq(&conf->device_lock);
>>>                  }
>>> +
>>> +               if ((atomic_read(&conf->active_stripes)
>>> +                    < (conf->max_nr_stripes * 3 / 4) ||
>>> +                    (test_bit(MD_RECOVERY_INTR, &mddev->recovery))))
>>> +                       wake_up(&conf->wait_for_stripe);
>>>          }
>>>          pr_debug("%d stripes handled\n", handled);
>>
>> Hmm... With this patch on top of your other one, we still have the basic symptoms (md3_raid6 busy looping), but the sync thread is now hanging at
>>
>>      root@sloth:~# cat /proc/$(pgrep md3_resync)/stack
>>      [<0>] md_do_sync.cold+0x8ec/0x97c
>>      [<0>] md_thread+0xab/0x160
>>      [<0>] kthread+0x11b/0x140
>>      [<0>] ret_from_fork+0x22/0x30
>>
>> instead, which is https://elixir.bootlin.com/linux/latest/source/drivers/md/md.c#L8963
>
> Not sure why recovery_active is not zero, because it is set 0 before blk_start_plug, and raid5_sync_request returns 0 and skipped is also set to 1. Perhaps handle_stripe calls md_done_sync.
>
> Could you double check the value of recovery_active? Or just don't wait if resync thread is interrupted.
>
> wait_event(mddev->recovery_wait,
>        test_bit(MD_RECOVERY_INTR,&mddev->recovery) ||
>        !atomic_read(&mddev->recovery_active));

With that added, md3_resync goes into a loop, too. Not 100% busy, though.

root@sloth:~# cat /proc/$(pgrep md3_resync)/stack

[<0>] raid5_get_active_stripe+0x1e7/0x6b0 # https://elixir.bootlin.com/linux/v5.11-rc5/source/drivers/md/raid5.c#L735
[<0>] raid5_sync_request+0x2a7/0x3d0 # https://elixir.bootlin.com/linux/v5.11-rc5/source/drivers/md/raid5.c#L6274
[<0>] md_do_sync.cold+0x3ee/0x97c # https://elixir.bootlin.com/linux/v5.11-rc5/source/drivers/md/md.c#L8883
[<0>] md_thread+0xab/0x160
[<0>] kthread+0x11b/0x140
[<0>] ret_from_fork+0x22/0x30

Sometimes top of stack is raid5_get_active_stripe+0x1ef/0x6b0 instead of raid5_get_active_stripe+0x1e7/0x6b0, so I guess it sleeps, its woken, but the conditions don't match so its sleeps again.

Best
Donald

>
>> And, unlike before, "md: md3: data-check interrupted." from the pr_info two lines above appears in dmesg.
>
> Yes, that is intentional since MD_RECOVERY_INTR is set by write idle.
>
> Anyway, will try the script and investigate more about the issue.
>
> Thanks,
> Guoqing

--
Donald Buczek
buczek@molgen.mpg.de
Tel: +49 30 8413 1433

Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition [ In reply to ]

buczek at molgen

Jan 26, 2021, 12:15 PM

Post #18 of 49 (609 views)

Permalink

Dear Guoqing,

a colleague of mine was able to produce the issue inside a vm and were able to find a procedure to run the vm into the issue within minutes (not unreliably after hours on a physical system as before). This of course helped to pinpoint the problem.

My current theory of what is happening is:

- MD_SB_CHANGE_CLEAN + MD_SB_CHANGE_PENDING are set by md_write_start() when file-system I/O wants to do a write and the array transitions from "clean" to "active". (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L8308)

- Before raid5d gets to write the superblock (its busy processing active stripes because of the sync activity) , userspace wants to pause the check by `echo idle > /sys/block/mdX/md/sync_action`

- action_store() takes the reconfig_mutex before trying to stop the sync thread. (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L4689) Dump of struct mddev of email 1/19/21 confirms reconf_mutex non-zero.

- raid5d is running in its main loop. raid5d()->handle_active_stripes() returns a positive batch size ( https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/raid5.c#L6329 ) although raid5d()->handle_active_stripes()->handle_stripe() doesn't process any stripe because of MD_SB_CHANGE_PENDING. (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/raid5.c#L4729 ). This is the reason, raid5d is busy looping.

- raid5d()->md_check_recovery() is called by the raid5d main loop. One of its duties is to write the superblock, if a change is pending. However to do so, it needs either MD_ALLOW_SB_UPDATE or must be able to take the reconfig_mutex. (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L8967 , https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L9006) Both is not true, so the superblock is not written and MD_SB_CHANGE_PENDING is not cleared.

- (as discussed previously) the sync thread is waiting for the number of active stripes to go down and doesn't terminate. The userspace thread is waiting for the sync thread to terminate.

Does this make sense?

Just for reference, I add the procedure which triggers the issue on the vm (with /dev/md3 mounted on /mnt/raid_md3) and some debug output:

```
#! /bin/bash

(
while true; do
echo "start check"
echo check > /sys/block/md3/md/sync_action
sleep 10
echo "stop check"
echo idle > /sys/block/md3/md/sync_action
sleep 2
done
) &

(
while true; do
dd bs=1k count=$((5*1024*1024)) if=/dev/zero of=/mnt/raid_md3/bigfile status=none
sync /mnt/raid_md3/bigfile
rm /mnt/raid_md3/bigfile
sleep .1
done
) &

start="$(date +%s)"
cd /sys/block/md3/md
wp_count=0
while true; do
array_state=$(cat array_state)
if [ "$array_state" = write-pending ]; then
wp_count=$(($wp_count+1))
else
wp_count=0
fi
echo $(($(date +%s)-$start)) $(cat sync_action) $(cat sync_completed) $array_state $(cat stripe_cache_active)
if [ $wp_count -ge 3 ]; then
kill -- -$$
exit
fi
sleep 1
done
```

The time, this needs to trigger the issue, varies from under a minute to one hour with 5 minute being typical. The output ends like this:

309 check 6283872 / 8378368 active-idle 4144
310 check 6283872 / 8378368 active 1702
311 check 6807528 / 8378368 active 4152
312 check 7331184 / 8378368 clean 3021
stop check
313 check 7331184 / 8378368 write-pending 3905
314 check 7331184 / 8378368 write-pending 3905
315 check 7331184 / 8378368 write-pending 3905
Terminated

If I add

pr_debug("XXX batch_size %d release %d mdddev->sb_flags %lx\n", batch_size, released, mddev->sb_flags);

in raid5d after the call to handle_active_stripes and enable the debug location after the deadlock occurred, I get

[ 3123.939143] [1223] raid5d:6332: XXX batch_size 8 release 0 mdddev->sb_flags 6
[ 3123.939156] [1223] raid5d:6332: XXX batch_size 8 release 0 mdddev->sb_flags 6
[ 3123.939170] [1223] raid5d:6332: XXX batch_size 8 release 0 mdddev->sb_flags 6
[ 3123.939184] [1223] raid5d:6332: XXX batch_size 8 release 0 mdddev->sb_flags 6

If I add

pr_debug("XXX 1 %s:%d mddev->flags %08lx mddev->sb_flags %08lx\n", __FILE__, __LINE__, mddev->flags, mddev->sb_flags);

at the head of md_check_recovery, I get:

[ 789.555462] [1191] md_check_recovery:8970: XXX 1 drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
[ 789.555477] [1191] md_check_recovery:8970: XXX 1 drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
[ 789.555491] [1191] md_check_recovery:8970: XXX 1 drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
[ 789.555505] [1191] md_check_recovery:8970: XXX 1 drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
[ 789.555520] [1191] md_check_recovery:8970: XXX 1 drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006

More debug lines in md_check_recovery confirm the control flow ( `if (mddev_trylock(mddev))` block not taken )

What approach would you suggest to fix this?

Best
Donald

Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition [ In reply to ]

guoqing.jiang at cloud

Feb 2, 2021, 8:15 AM

Post #19 of 49 (609 views)

Permalink

Hi Donald,

On 1/26/21 17:05, Donald Buczek wrote:
> Dear Guoqing,
>
> On 26.01.21 15:06, Guoqing Jiang wrote:
>>
>>
>> On 1/26/21 13:58, Donald Buczek wrote:
>>>
>>>
>>>> Hmm, how about wake the waiter up in the while loop of raid5d?
>>>>
>>>> @@ -6520,6 +6532,11 @@ static void raid5d(struct md_thread *thread)
>>>>                          md_check_recovery(mddev);
>>>>                          spin_lock_irq(&conf->device_lock);
>>>>                  }
>>>> +
>>>> +               if ((atomic_read(&conf->active_stripes)
>>>> +                    < (conf->max_nr_stripes * 3 / 4) ||
>>>> +                    (test_bit(MD_RECOVERY_INTR, &mddev->recovery))))
>>>> +                       wake_up(&conf->wait_for_stripe);
>>>>          }
>>>>          pr_debug("%d stripes handled\n", handled);
>>>
>>> Hmm... With this patch on top of your other one, we still have the
>>> basic symptoms (md3_raid6 busy looping), but the sync thread is now
>>> hanging at
>>>
>>>      root@sloth:~# cat /proc/$(pgrep md3_resync)/stack
>>>      [<0>] md_do_sync.cold+0x8ec/0x97c
>>>      [<0>] md_thread+0xab/0x160
>>>      [<0>] kthread+0x11b/0x140
>>>      [<0>] ret_from_fork+0x22/0x30
>>>
>>> instead, which is
>>> https://elixir.bootlin.com/linux/latest/source/drivers/md/md.c#L8963
>>
>> Not sure why recovery_active is not zero, because it is set 0 before
>> blk_start_plug, and raid5_sync_request returns 0 and skipped is also
>> set to 1. Perhaps handle_stripe calls md_done_sync.
>>
>> Could you double check the value of recovery_active? Or just don't
>> wait if resync thread is interrupted.
>>
>> wait_event(mddev->recovery_wait,
>>         test_bit(MD_RECOVERY_INTR,&mddev->recovery) ||
>>         !atomic_read(&mddev->recovery_active));
>
> With that added, md3_resync goes into a loop, too. Not 100% busy, though.
>
> root@sloth:~# cat /proc/$(pgrep md3_resync)/stack
>
> [<0>] raid5_get_active_stripe+0x1e7/0x6b0 #
> https://elixir.bootlin.com/linux/v5.11-rc5/source/drivers/md/raid5.c#L735
> [<0>] raid5_sync_request+0x2a7/0x3d0       #
> https://elixir.bootlin.com/linux/v5.11-rc5/source/drivers/md/raid5.c#L6274
> [<0>] md_do_sync.cold+0x3ee/0x97c          #
> https://elixir.bootlin.com/linux/v5.11-rc5/source/drivers/md/md.c#L8883
> [<0>] md_thread+0xab/0x160
> [<0>] kthread+0x11b/0x140
> [<0>] ret_from_fork+0x22/0x30
>
> Sometimes top of stack is raid5_get_active_stripe+0x1ef/0x6b0 instead of
> raid5_get_active_stripe+0x1e7/0x6b0, so I guess it sleeps, its woken,
> but the conditions don't match so its sleeps again.

I don't know why the condition was not true after the change since the
RECOVERY_INTR is set and the caller is raid5_sync_request.

wait_event_lock_irq(conf->wait_for_stripe,
(test_bit(MD_RECOVERY_INTR, &mddev->recovery) && sync_req) ||
/* the previous condition */,
*(conf->hash_locks + hash));

BTW, I think there some some possible ways:

1. let "echo idle" give up the reconfig_mutex if there are limited
number of active stripe, but I feel it is ugly to check sh number from
action_store (kind of layer violation).

2. make raid5_sync_request -> raid5_get_active_stripe can quit from the
current situation (this was we tried, though it doesn't work so far).

3. set MD_ALLOW_SB_UPDATE as you said though I am not sure the safety
(but maybe I am wrong).

4. given the write IO keeps coming from upper layer which decrease the
available stripes. Maybe we need to call grow_one_stripe at the
beginning of raid5_make_request for this case, then call drop_one_stripe
at the end of make_request.

5. maybe don't hold reconfig_mutex when try to unregister sync_thread,
like this.

/* resync has finished, collect result */
mddev_unlock(mddev);
md_unregister_thread(&mddev->sync_thread);
mddev_lock(mddev);

My suggestion would be try 2 + 4 together since the reproducer triggers
both sync io and write io. Or try 5. Perhaps there is better way which I
just can't find.

Thanks,
Guoqing

Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition [ In reply to ]

buczek at molgen

Feb 8, 2021, 4:15 AM

Post #20 of 49 (608 views)

Permalink

On 02.02.21 16:42, Guoqing Jiang wrote:
> Hi Donald,
>
> On 1/26/21 17:05, Donald Buczek wrote:
>> Dear Guoqing,
>>
>> On 26.01.21 15:06, Guoqing Jiang wrote:
>>>
>>>
>>> On 1/26/21 13:58, Donald Buczek wrote:
>>>>
>>>>
>>>>> Hmm, how about wake the waiter up in the while loop of raid5d?
>>>>>
>>>>> @@ -6520,6 +6532,11 @@ static void raid5d(struct md_thread *thread)
>>>>>                          md_check_recovery(mddev);
>>>>>                          spin_lock_irq(&conf->device_lock);
>>>>>                  }
>>>>> +
>>>>> +               if ((atomic_read(&conf->active_stripes)
>>>>> +                    < (conf->max_nr_stripes * 3 / 4) ||
>>>>> +                    (test_bit(MD_RECOVERY_INTR, &mddev->recovery))))
>>>>> +                       wake_up(&conf->wait_for_stripe);
>>>>>          }
>>>>>          pr_debug("%d stripes handled\n", handled);
>>>>
>>>> Hmm... With this patch on top of your other one, we still have the basic symptoms (md3_raid6 busy looping), but the sync thread is now hanging at
>>>>
>>>>      root@sloth:~# cat /proc/$(pgrep md3_resync)/stack
>>>>      [<0>] md_do_sync.cold+0x8ec/0x97c
>>>>      [<0>] md_thread+0xab/0x160
>>>>      [<0>] kthread+0x11b/0x140
>>>>      [<0>] ret_from_fork+0x22/0x30
>>>>
>>>> instead, which is https://elixir.bootlin.com/linux/latest/source/drivers/md/md.c#L8963
>>>
>>> Not sure why recovery_active is not zero, because it is set 0 before blk_start_plug, and raid5_sync_request returns 0 and skipped is also set to 1. Perhaps handle_stripe calls md_done_sync.
>>>
>>> Could you double check the value of recovery_active? Or just don't wait if resync thread is interrupted.
>>>
>>> wait_event(mddev->recovery_wait,
>>>         test_bit(MD_RECOVERY_INTR,&mddev->recovery) ||
>>>         !atomic_read(&mddev->recovery_active));
>>
>> With that added, md3_resync goes into a loop, too. Not 100% busy, though.
>>
>> root@sloth:~# cat /proc/$(pgrep md3_resync)/stack
>>
>> [<0>] raid5_get_active_stripe+0x1e7/0x6b0 # https://elixir.bootlin.com/linux/v5.11-rc5/source/drivers/md/raid5.c#L735
>> [<0>] raid5_sync_request+0x2a7/0x3d0       # https://elixir.bootlin.com/linux/v5.11-rc5/source/drivers/md/raid5.c#L6274
>> [<0>] md_do_sync.cold+0x3ee/0x97c          # https://elixir.bootlin.com/linux/v5.11-rc5/source/drivers/md/md.c#L8883
>> [<0>] md_thread+0xab/0x160
>> [<0>] kthread+0x11b/0x140
>> [<0>] ret_from_fork+0x22/0x30
>>
>> Sometimes top of stack is raid5_get_active_stripe+0x1ef/0x6b0 instead of raid5_get_active_stripe+0x1e7/0x6b0, so I guess it sleeps, its woken, but the conditions don't match so its sleeps again.
>
> I don't know why the condition was not true after the change since the RECOVERY_INTR is set and the caller is raid5_sync_request.
>
> wait_event_lock_irq(conf->wait_for_stripe,
>     (test_bit(MD_RECOVERY_INTR, &mddev->recovery) && sync_req) ||
>      /* the previous condition */,
>     *(conf->hash_locks + hash));

Not knowing the system by heart, I'd probably need more than a full day to analyze that. Please let me know, if this is needed.

> BTW, I think there some some possible ways:
>
> 1. let "echo idle" give up the reconfig_mutex if there are limited number of active stripe, but I feel it is ugly to check sh number from action_store (kind of layer violation).

My first though is, that it is either needed to hold the mutex or it is not. Why should this depend on the number of available stripes?

And wouldn't we get into the same situation, if another another reconfiguration from userspace takes the mutex again?

> 2. make raid5_sync_request -> raid5_get_active_stripe can quit from the current situation (this was we tried, though it doesn't work so far).
>
> 3. set MD_ALLOW_SB_UPDATE as you said though I am not sure the safety (but maybe I am wrong).
>
> 4. given the write IO keeps coming from upper layer which decrease the available stripes. Maybe we need to call grow_one_stripe at the beginning of raid5_make_request for this case, then call drop_one_stripe
> at the end of make_request.
>
> 5. maybe don't hold reconfig_mutex when try to unregister sync_thread, like this.
>
>         /* resync has finished, collect result */
>         mddev_unlock(mddev);
>         md_unregister_thread(&mddev->sync_thread);
>         mddev_lock(mddev);

As above: While we wait for the sync thread to terminate, wouldn't it be a problem, if another user space operation takes the mutex?

> My suggestion would be try 2 + 4 together since the reproducer triggers both sync io and write io. Or try 5. Perhaps there is better way which I just can't find.

I just get a bit suspicious, if it is needed to add special code here and here and here, that there might be something wrong with the abstraction.

But, I'm afraid, that aside from testing, I can't be of to much help here, as I don't know all the components or details of the system that well.

Best
Donald

> Thanks,
> Guoqing

Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition [ In reply to ]

guoqing.jiang at cloud

Feb 8, 2021, 8:15 AM

Post #21 of 49 (608 views)

Permalink

On 2/8/21 12:38, Donald Buczek wrote:
>> 5. maybe don't hold reconfig_mutex when try to unregister sync_thread,
>> like this.
>>
>>          /* resync has finished, collect result */
>>          mddev_unlock(mddev);
>>          md_unregister_thread(&mddev->sync_thread);
>>          mddev_lock(mddev);
>
> As above: While we wait for the sync thread to terminate, wouldn't it be
> a problem, if another user space operation takes the mutex?

I don't think other places can be blocked while hold mutex, otherwise
these places can cause potential deadlock. Please try above two lines
change. And perhaps others have better idea.

Thanks,
Guoqing

Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition [ In reply to ]

buczek at molgen

Feb 8, 2021, 2:15 PM

Post #22 of 49 (608 views)

Permalink

Dear Guoqing,

On 08.02.21 15:53, Guoqing Jiang wrote:
>
>
> On 2/8/21 12:38, Donald Buczek wrote:
>>> 5. maybe don't hold reconfig_mutex when try to unregister sync_thread, like this.
>>>
>>>          /* resync has finished, collect result */
>>>          mddev_unlock(mddev);
>>>          md_unregister_thread(&mddev->sync_thread);
>>>          mddev_lock(mddev);
>>
>> As above: While we wait for the sync thread to terminate, wouldn't it be a problem, if another user space operation takes the mutex?
>
> I don't think other places can be blocked while hold mutex, otherwise these places can cause potential deadlock. Please try above two lines change. And perhaps others have better idea.

Yes, this works. No deadlock after >11000 seconds,

(Time till deadlock from previous runs/seconds: 1723, 37, 434, 1265, 3500, 1136, 109, 1892, 1060, 664, 84, 315, 12, 820 )

Best
Donald
>
> Thanks,
> Guoqing

Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition [ In reply to ]

guoqing.jiang at cloud

Feb 8, 2021, 6:15 PM

Post #23 of 49 (608 views)

Permalink

Hi Donald,

On 2/8/21 19:41, Donald Buczek wrote:
> Dear Guoqing,
>
> On 08.02.21 15:53, Guoqing Jiang wrote:
>>
>>
>> On 2/8/21 12:38, Donald Buczek wrote:
>>>> 5. maybe don't hold reconfig_mutex when try to unregister
>>>> sync_thread, like this.
>>>>
>>>>          /* resync has finished, collect result */
>>>>          mddev_unlock(mddev);
>>>>          md_unregister_thread(&mddev->sync_thread);
>>>>          mddev_lock(mddev);
>>>
>>> As above: While we wait for the sync thread to terminate, wouldn't it
>>> be a problem, if another user space operation takes the mutex?
>>
>> I don't think other places can be blocked while hold mutex, otherwise
>> these places can cause potential deadlock. Please try above two lines
>> change. And perhaps others have better idea.
>
> Yes, this works. No deadlock after >11000 seconds,
>
> (Time till deadlock from previous runs/seconds: 1723, 37, 434, 1265,
> 3500, 1136, 109, 1892, 1060, 664, 84, 315, 12, 820 )

Great. I will send a formal patch with your reported-by and tested-by.

Thanks,
Guoqing

Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition [ In reply to ]

buczek at molgen

Feb 9, 2021, 2:15 AM

Post #24 of 49 (608 views)

Permalink

On 09.02.21 01:46, Guoqing Jiang wrote:

> Great. I will send a formal patch with your reported-by and tested-by.

Yes, that's fine.

Thanks a lot for your help!

Donald

>
> Thanks,
> Guoqing

Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition [ In reply to ]

msmith626 at gmail

Mar 14, 2023, 8:15 AM

Post #25 of 49 (454 views)

Permalink

On Mon, Feb 8, 2021 at 7:49?PM Guoqing Jiang
<guoqing.jiang@cloud.ionos.com> wrote:
>
> Hi Donald,
>
> On 2/8/21 19:41, Donald Buczek wrote:
> > Dear Guoqing,
> >
> > On 08.02.21 15:53, Guoqing Jiang wrote:
> >>
> >>
> >> On 2/8/21 12:38, Donald Buczek wrote:
> >>>> 5. maybe don't hold reconfig_mutex when try to unregister
> >>>> sync_thread, like this.
> >>>>
> >>>> /* resync has finished, collect result */
> >>>> mddev_unlock(mddev);
> >>>> md_unregister_thread(&mddev->sync_thread);
> >>>> mddev_lock(mddev);
> >>>
> >>> As above: While we wait for the sync thread to terminate, wouldn't it
> >>> be a problem, if another user space operation takes the mutex?
> >>
> >> I don't think other places can be blocked while hold mutex, otherwise
> >> these places can cause potential deadlock. Please try above two lines
> >> change. And perhaps others have better idea.
> >
> > Yes, this works. No deadlock after >11000 seconds,
> >
> > (Time till deadlock from previous runs/seconds: 1723, 37, 434, 1265,
> > 3500, 1136, 109, 1892, 1060, 664, 84, 315, 12, 820 )
>
> Great. I will send a formal patch with your reported-by and tested-by.
>
> Thanks,
> Guoqing

I'm still hitting this issue with Linux 5.4.229 -- it looks like 1/2
of the patches that supposedly resolve this were applied to the stable
kernels, however, one was omitted due to a regression:
md: don't unregister sync_thread with reconfig_mutex held (upstream
commit 8b48ec23cc51a4e7c8dbaef5f34ebe67e1a80934)

I don't see any follow-up on the thread from June 8th 2022 asking for
this patch to be dropped from all stable kernels since it caused a
regression.

The patch doesn't appear to be present in the current mainline kernel
(6.3-rc2) either. So I assume this issue is still present there, or it
was resolved differently and I just can't find the commit/patch.

I can induce the issue by using Donald's script above which will
eventually result in hangs:
...
147948.504621] INFO: task md_test_2.sh:68033 blocked for more than 122 seconds.
[147948.504624] Tainted: P OE 5.4.229-esos.prod #1
[147948.504624] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[147948.504625] md_test_2.sh D 0 68033 1 0x00000004
[147948.504627] Call Trace:
[147948.504634] __schedule+0x4ab/0x4f3
[147948.504637] ? usleep_range+0x7a/0x7a
[147948.504638] schedule+0x67/0x81
[147948.504639] schedule_timeout+0x2c/0xe5
[147948.504643] ? do_raw_spin_lock+0x2b/0x52
[147948.504644] __wait_for_common+0xc4/0x13a
[147948.504647] ? wake_up_q+0x40/0x40
[147948.504649] kthread_stop+0x9a/0x117
[147948.504653] md_unregister_thread+0x43/0x4d
[147948.504655] md_reap_sync_thread+0x1c/0x1d5
[147948.504657] action_store+0xc9/0x284
[147948.504658] md_attr_store+0x9f/0xb8
[147948.504661] kernfs_fop_write+0x10a/0x14c
[147948.504664] vfs_write+0xa0/0xdd
[147948.504666] ksys_write+0x71/0xba
[147948.504668] do_syscall_64+0x52/0x60
[147948.504671] entry_SYSCALL_64_after_hwframe+0x5c/0xc1
...
[147948.504748] INFO: task md120_resync:135315 blocked for more than
122 seconds.
[147948.504749] Tainted: P OE 5.4.229-esos.prod #1
[147948.504749] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[147948.504749] md120_resync D 0 135315 2 0x80004000
[147948.504750] Call Trace:
[147948.504752] __schedule+0x4ab/0x4f3
[147948.504754] ? printk+0x53/0x6a
[147948.504755] schedule+0x67/0x81
[147948.504756] md_do_sync+0xae7/0xdd9
[147948.504758] ? remove_wait_queue+0x41/0x41
[147948.504759] md_thread+0x128/0x151
[147948.504761] ? _raw_spin_lock_irqsave+0x31/0x5d
[147948.504762] ? md_start_sync+0xdc/0xdc
[147948.504763] kthread+0xe4/0xe9
[147948.504764] ? kthread_flush_worker+0x70/0x70
[147948.504765] ret_from_fork+0x35/0x40
...

This happens on 'raid6' MD RAID arrays that initially have sync_action==resync.

Any guidance would be greatly appreciated.

--Marc

Mailing List Archive

Mailing List Archive

Attached Files: