Mailing List Archive

corrupted resource can't be fixed be rolling back to old snapshot
Hey, all. Here's a question that's bugging me. I've had this happen
multiple times now (over the course of 2-3 years, so infrequent).
I've got a system set up with DRBD resources using ZFS volumes as the
block devices (for volume management and snapshots among other
reasons). I've had some obvious hardware problems lead to what I
think is corrupted DRBD metadata a few times. Now, I had expected to
be able to simply rollback to an earlier snapshot of the underlying
ZVOL on the primary, a slightly older one on the secondary node, and
sync back up nicely. But what happens instead is no matter how old of
a snapshot I use, I continue to get these types of errors:

drbdadm dump-md nautilus_data
Found meta data is "unclean", please apply-al first

drbdadm apply-al nautilus_data
extent 4746752 beyond end of bitmap!
extent 4870144 beyond end of bitmap!
extent 5436416 beyond end of bitmap!
extent 5437440 beyond end of bitmap!
...
extent 6793216 beyond end of bitmap!
extent 6793218 beyond end of bitmap!
../shared/drbdmeta.c:2028:apply_al: ASSERT(bm_pos - bm_on_disk_pos <=
chunk - this_extent_size) failed.

What I'm trying to understand is how can I be corrupting my DRBD
resource so badly that going back in time to an older version of the
block device used by the resource is STILL corrupt?

This is an Ubuntu 20.04 system with 5.15 kernel and DRBD 9.1.5, but as
mentioned I've seen this problem a couple times over the years with
5.10 and 5.4 kernels and whatever version of DRBD9 compiled for those
kernels at the time. I'm convinced I must be fundamentally
misunderstanding something about how DRBD works on this one (thus
drbd-user instead of drbd-dev list).

My resource config follows:

# resource nautilus_data on skywalker: not ignored, not stacked
# defined at /etc/drbd.d/nautilus_data.res:1
resource nautilus_data {
device /dev/drbd1 minor 1;
meta-disk internal;
on skywalker {
node-id 0;
disk /dev/zdata/nautilus;
address ipv4 10.1.20.201:7810;
}
on vader {
node-id 1;
disk /dev/zdata/nautilus;
address ipv4 10.1.20.202:7810;
}
connection {
host skywalker address ipv4 192.168.1.2:7810;
host vader address ipv4 192.168.1.3:7810;
net {
_name vader;
}
}
net {
protocol C;
max-buffers 36k;
max-epoch-size 20000;
sndbuf-size 2M;
rcvbuf-size 4M;
allow-two-primaries yes;
after-sb-0pri discard-zero-changes;
after-sb-1pri discard-secondary;
after-sb-2pri disconnect;
}
disk {
disk-barrier no;
disk-flushes no;
al-extents 3833;
c-plan-ahead 1;
c-fill-target 24M;
c-max-rate 110M;
c-min-rate 10M;
}
}

--
Michael D Labriola
401-316-9844 (cell)
_______________________________________________
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
drbd-user@lists.linbit.com
https://lists.linbit.com/mailman/listinfo/drbd-user
Re: corrupted resource can't be fixed be rolling back to old snapshot [ In reply to ]
Hi Michael,

Are you using the most recent version of drbd-utils? There have been a
few fixes over the years which might be related.

Perhaps the hardware problems affected the metadata long ago and now
the corrupted metadata is present in all the snapshots.

If that is not the case, this looks to me more like a bug than a
misunderstanding of how DRBD works. Are you able to reproduce the
issue starting from a fresh volume? It could be that this particular
combination of device size and bitmap slot count triggers a bug that
no-one else has yet encountered. A reproducer would be necessary to
work on fixing it.

Best regards,
Joel
_______________________________________________
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
drbd-user@lists.linbit.com
https://lists.linbit.com/mailman/listinfo/drbd-user
Re: corrupted resource can't be fixed be rolling back to old snapshot [ In reply to ]
Unable to see the older thread.May Be it is just me.Request to share older
conversation as well please.

Thanks and Regards,
Chitvan Chhabra

On Tue, 2 Aug 2022 at 14:34, Joel Colledge <joel.colledge@linbit.com> wrote:

> Hi Michael,
>
> Are you using the most recent version of drbd-utils? There have been a
> few fixes over the years which might be related.
>
> Perhaps the hardware problems affected the metadata long ago and now
> the corrupted metadata is present in all the snapshots.
>
> If that is not the case, this looks to me more like a bug than a
> misunderstanding of how DRBD works. Are you able to reproduce the
> issue starting from a fresh volume? It could be that this particular
> combination of device size and bitmap slot count triggers a bug that
> no-one else has yet encountered. A reproducer would be necessary to
> work on fixing it.
>
> Best regards,
> Joel
> _______________________________________________
> Star us on GITHUB: https://github.com/LINBIT
> drbd-user mailing list
> drbd-user@lists.linbit.com
> https://lists.linbit.com/mailman/listinfo/drbd-user
>
Re: corrupted resource can't be fixed be rolling back to old snapshot [ In reply to ]
On Tue, Aug 02, 2022 at 02:54:02PM +0530, Chitvan Chhabra wrote:
> Unable to see the older thread.May Be it is just me.Request to share older
> conversation as well please.

we have an archive:
https://lists.linbit.com/pipermail/drbd-user/2022-July/026252.html
_______________________________________________
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
drbd-user@lists.linbit.com
https://lists.linbit.com/mailman/listinfo/drbd-user
Re: corrupted resource can't be fixed be rolling back to old snapshot [ In reply to ]
Though i could be wrong here , but what i understand is:

After Roll back Scenerio:

A( Primary Snapshot rollbacked to time: say 12:10:00 PM) , B(Secondary
Snapshot roll backed to time say 12:09:00 PM)

Current time say : 12:30:00 PM
Now A must have received acks(in the past of ofcource) from B between
12:09:00 to 12:10:00 PM , now at 12:09:01 B says i dont have the data which
might have confused A as it must be saying saying that you have already
acked me few data , then how can you say now that i don't have the data
now.Hence the Error.This is just my thought.Or does DRBD support such
scenario, if yes then that is awesome than as that prevent complete
resynchronization of data.

Anyways with DRBD Down , you can always get the data back from ZVOL
snapshot(otr its clone) itself(assuming DRBD metadata doesnot contain
actual data ?? )




On Tue, 2 Aug 2022 at 15:04, Roland Kammerer <roland.kammerer@linbit.com>
wrote:

> On Tue, Aug 02, 2022 at 02:54:02PM +0530, Chitvan Chhabra wrote:
> > Unable to see the older thread.May Be it is just me.Request to share
> older
> > conversation as well please.
>
> we have an archive:
> https://lists.linbit.com/pipermail/drbd-user/2022-July/026252.html
> _______________________________________________
> Star us on GITHUB: https://github.com/LINBIT
> drbd-user mailing list
> drbd-user@lists.linbit.com
> https://lists.linbit.com/mailman/listinfo/drbd-user
>
Re: corrupted resource can't be fixed be rolling back to old snapshot [ In reply to ]
On Tue, Aug 2, 2022 at 5:04 AM Joel Colledge <joel.colledge@linbit.com> wrote:
>
> Hi Michael,
>
> Are you using the most recent version of drbd-utils? There have been a
> few fixes over the years which might be related.

I was using 9.20.2 this last time. I'm fairly certain I've been using
the focal ppa from linbit for the entire life of this particular
system, so I've probably always been newer than the Ubuntu version.

>
> Perhaps the hardware problems affected the metadata long ago and now
> the corrupted metadata is present in all the snapshots.

Possible. But I'm fairly certain we recreated the DRBD resources from
scratch (new meta-data, initial sync, etc) after we fixed the
problems... granted, I could still have problems. This particular
system for whatever reason is cursed.

>
> If that is not the case, this looks to me more like a bug than a
> misunderstanding of how DRBD works. Are you able to reproduce the
> issue starting from a fresh volume? It could be that this particular
> combination of device size and bitmap slot count triggers a bug that
> no-one else has yet encountered. A reproducer would be necessary to
> work on fixing it.

Well, half of what I was looking for here was somebody else to tell me
this is odd. I *should* be able to recover by rolling back to an old
snapshot of the backing ZVOL on both nodes. I know I've done it for
proof of concept and to rollback to fix "human error" type problems...
This was the first time I've had to try to recover from something
actually going wrong (from DRBD's standpoint).

For the record, I did not lose any data... I could still access the
ZVOL directly (e.g., mounted EXT4) and rsync into a newly created DRBD
resource... but this particular resource is large and takes 3-4 days
to finish the initial sync. I'd obviously like to avoid that.

>
> Best regards,
> Joel

--
Michael D Labriola
401-316-9844 (cell)
_______________________________________________
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
drbd-user@lists.linbit.com
https://lists.linbit.com/mailman/listinfo/drbd-user
Re: corrupted resource can't be fixed be rolling back to old snapshot [ In reply to ]
On Tue, Aug 2, 2022 at 6:44 AM Chitvan Chhabra <chitvan1988@gmail.com> wrote:
>
> Though i could be wrong here , but what i understand is:
>
> After Roll back Scenerio:
>
> A( Primary Snapshot rollbacked to time: say 12:10:00 PM) , B(Secondary Snapshot roll backed to time say 12:09:00 PM)

Yes.

>
> Current time say : 12:30:00 PM
> Now A must have received acks(in the past of ofcource) from B between 12:09:00 to 12:10:00 PM , now at 12:09:01 B says i dont have the data which might have confused A as it must be saying saying that you have already acked me few data , then how can you say now that i don't have the data now.Hence the Error.This is just my thought.Or does DRBD support such scenario, if yes then that is awesome than as that prevent complete resynchronization of data.

Normally, yes this works and it's just as awesome as it sounds. But
in this particular case, it's utterly broken. With the resource down
on both nodes, trying to raise it on either basically wigs out and
says "what is this block device?" Any calls to drbdadm, drbdmeta,
etc, result in a couple dozen "extent beyond end of bitmap" error
messages. Rolling back to an older snapshot results in the exact same
error messages, which was quite unexpected.

>
> Anyways with DRBD Down , you can always get the data back from ZVOL snapshot(otr its clone) itself(assuming DRBD metadata doesnot contain actual data ?? )

Yes, I was able to create a new DRBD resource and copy my data into
it... but the initial sync for this resource takes 3-4 days... and I
don't want to have to do that all the time.

>
>
>
>
> On Tue, 2 Aug 2022 at 15:04, Roland Kammerer <roland.kammerer@linbit.com> wrote:
>>
>> On Tue, Aug 02, 2022 at 02:54:02PM +0530, Chitvan Chhabra wrote:
>> > Unable to see the older thread.May Be it is just me.Request to share older
>> > conversation as well please.
>>
>> we have an archive:
>> https://lists.linbit.com/pipermail/drbd-user/2022-July/026252.html
>> _______________________________________________
>> Star us on GITHUB: https://github.com/LINBIT
>> drbd-user mailing list
>> drbd-user@lists.linbit.com
>> https://lists.linbit.com/mailman/listinfo/drbd-user
>
> _______________________________________________
> Star us on GITHUB: https://github.com/LINBIT
> drbd-user mailing list
> drbd-user@lists.linbit.com
> https://lists.linbit.com/mailman/listinfo/drbd-user


--
Michael D Labriola
401-316-9844 (cell)
_______________________________________________
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
drbd-user@lists.linbit.com
https://lists.linbit.com/mailman/listinfo/drbd-user