Mailing List Archive

The Problem of File System Corruption w/DRBD
Since DRBD lives below the filesystem, if the filesystem gets corrupted, then DRBD faithfully replicates the corruption to the other node. Thus the filesystem is the SPOF in an otherwise shared-nothing architecture. What is the recommended way (if there is one) to avoid the filesystem SPOF problem when clusters are based on DRBD?

-Eric




Disclaimer : This email and any files transmitted with it are confidential and intended solely for intended recipients. If you are not the named addressee you should not disseminate, distribute, copy or alter this email. Any views or opinions presented in this email are solely those of the author and might not represent those of Physician Select Management. Warning: Although Physician Select Management has taken reasonable precautions to ensure no viruses are present in this email, the company cannot accept responsibility for any loss or damage arising from the use of this email or attachments.
Re: The Problem of File System Corruption w/DRBD [ In reply to ]
On 2021-06-02 5:17 p.m., Eric Robinson wrote:
> Since DRBD lives below the filesystem, if the filesystem gets corrupted,
> then DRBD faithfully replicates the corruption to the other node. Thus
> the filesystem is the SPOF in an otherwise shared-nothing architecture.
> What is the recommended way (if there is one) to avoid the filesystem
> SPOF problem when clusters are based on DRBD?
>
> -Eric

To start, HA, like RAID, is not a replacement for backups. That is the
answer to a situation like this... HA (and other availability systems
like RAID) protect against component failure. If a node fails, the peer
recovers automatically and your services stay online. That's what DRBD
and other HA solutions strive to provide; uptime.

If you want to protect against corruption (accidental or intentional,
a-la cryptolockers), you need a robust backup system to _compliment_
your HA solution.

--
Digimer
Papers and Projects: https://alteeve.com/w/
"I am, somehow, less interested in the weight and convolutions of
Einstein?s brain than in the near certainty that people of equal talent
have lived and died in cotton fields and sweatshops." - Stephen Jay Gould
_______________________________________________
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
drbd-user@lists.linbit.com
https://lists.linbit.com/mailman/listinfo/drbd-user
Re: The Problem of File System Corruption w/DRBD [ In reply to ]
> -----Original Message-----
> From: Digimer <lists@alteeve.ca>
> Sent: Wednesday, June 2, 2021 7:23 PM
> To: Eric Robinson <eric.robinson@psmnv.com>; drbd-user@lists.linbit.com
> Subject: Re: [DRBD-user] The Problem of File System Corruption w/DRBD
>
> On 2021-06-02 5:17 p.m., Eric Robinson wrote:
> > Since DRBD lives below the filesystem, if the filesystem gets
> > corrupted, then DRBD faithfully replicates the corruption to the other
> > node. Thus the filesystem is the SPOF in an otherwise shared-nothing
> architecture.
> > What is the recommended way (if there is one) to avoid the filesystem
> > SPOF problem when clusters are based on DRBD?
> >
> > -Eric
>
> To start, HA, like RAID, is not a replacement for backups. That is the answer
> to a situation like this... HA (and other availability systems like RAID) protect
> against component failure. If a node fails, the peer recovers automatically
> and your services stay online. That's what DRBD and other HA solutions strive
> to provide; uptime.
>
> If you want to protect against corruption (accidental or intentional, a-la
> cryptolockers), you need a robust backup system to _compliment_ your HA
> solution.
>

Yes, thanks, I've said for many years that HA is not a replacement for disaster recovery. Still, it is better to avoid downtime than to recover from it, and one of the main ways to achieve that is through redundancy, preferably a shared-nothing approach. If I have a cool 5-node cluster and the whole thing goes down because the filesystem gets corrupted, I can restore from backup, but management is going to wonder why a 5-node cluster could not provide availability. So the question remains: how to eliminate the filesystem as the SPOF?

-Eric
Disclaimer : This email and any files transmitted with it are confidential and intended solely for intended recipients. If you are not the named addressee you should not disseminate, distribute, copy or alter this email. Any views or opinions presented in this email are solely those of the author and might not represent those of Physician Select Management. Warning: Although Physician Select Management has taken reasonable precautions to ensure no viruses are present in this email, the company cannot accept responsibility for any loss or damage arising from the use of this email or attachments.
_______________________________________________
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
drbd-user@lists.linbit.com
https://lists.linbit.com/mailman/listinfo/drbd-user
Re: The Problem of File System Corruption w/DRBD [ In reply to ]
Hi,

Am 03.06.21 um 14:50 schrieb Eric Robinson:

> Yes, thanks, I've said for many years that HA is not a replacement for disaster recovery. Still, it is better to avoid downtime than to recover from it, and one of the main ways to achieve that is through redundancy, preferably a shared-nothing approach. If I have a cool 5-node cluster and the whole thing goes down because the filesystem gets corrupted, I can restore from backup, but management is going to wonder why a 5-node cluster could not provide availability. So the question remains: how to eliminate the filesystem as the SPOF?

Then eliminate the shared filesystem and replicate data on application
level.

- MySQL has Galera
- Dovecot has dsync

Regards
--
Robert Sander
Heinlein Consulting GmbH
Schwedter Str. 8/9b, 10119 Berlin

http://www.heinlein-support.de

Tel: 030 / 405051-43
Fax: 030 / 405051-19

Zwangsangaben lt. §35a GmbHG:
HRB 93818 B / Amtsgericht Berlin-Charlottenburg,
Geschäftsführer: Peer Heinlein -- Sitz: Berlin
Re: The Problem of File System Corruption w/DRBD [ In reply to ]
On 2021-06-03 11:09 a.m., Robert Sander wrote:
> Hi,
>
> Am 03.06.21 um 14:50 schrieb Eric Robinson:
>
>> Yes, thanks, I've said for many years that HA is not a replacement for disaster recovery. Still, it is better to avoid downtime than to recover from it, and one of the main ways to achieve that is through redundancy, preferably a shared-nothing approach. If I have a cool 5-node cluster and the whole thing goes down because the filesystem gets corrupted, I can restore from backup, but management is going to wonder why a 5-node cluster could not provide availability. So the question remains: how to eliminate the filesystem as the SPOF?
>
> Then eliminate the shared filesystem and replicate data on application
> level.
>
> - MySQL has Galera
> - Dovecot has dsync
>
> Regards

Even this approach just moves the SPOF up from the FS to the SQL engine.

The problem here is that you're still confusing redundancy with data
integrity. To avoid data corruption, you need a layer that understands
your data at a sufficient level to know what corruption looks like. Data
integrity is yet another topic, and still separate from HA.

DRBD, and other HA tools, don't analyze the data, and nor should they
(imagine the security and privacy concerns that would open up). If the
HA layer is given data to replicate, it's job is to faithfully and
accurately replicate the data.

I think the real solution is not technical, it's expectations
management. Your managers need to understand what each part of their
infrastructure does and does not do. This way, if the concerns around
data corruption are sufficient, they can invest in tools to protect the
data integrity at the logical layer.

HA protects against component failure. That's it's job, and it does it
well, when well implemented.

--
Digimer
Papers and Projects: https://alteeve.com/w/
"I am, somehow, less interested in the weight and convolutions of
Einstein’s brain than in the near certainty that people of equal talent
have lived and died in cotton fields and sweatshops." - Stephen Jay Gould
_______________________________________________
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
drbd-user@lists.linbit.com
https://lists.linbit.com/mailman/listinfo/drbd-user
Re: The Problem of File System Corruption w/DRBD [ In reply to ]
On 03/06/2021 13:50, Eric Robinson wrote:
>> -----Original Message-----
>> From: Digimer <lists@alteeve.ca>
>> Sent: Wednesday, June 2, 2021 7:23 PM
>> To: Eric Robinson <eric.robinson@psmnv.com>; drbd-user@lists.linbit.com
>> Subject: Re: [DRBD-user] The Problem of File System Corruption w/DRBD
>>
>> On 2021-06-02 5:17 p.m., Eric Robinson wrote:
>>> Since DRBD lives below the filesystem, if the filesystem gets
>>> corrupted, then DRBD faithfully replicates the corruption to the other
>>> node. Thus the filesystem is the SPOF in an otherwise shared-nothing
>> architecture.
>>> What is the recommended way (if there is one) to avoid the filesystem
>>> SPOF problem when clusters are based on DRBD?
>>>
>>> -Eric
>>
>> To start, HA, like RAID, is not a replacement for backups. That is the answer
>> to a situation like this... HA (and other availability systems like RAID) protect
>> against component failure. If a node fails, the peer recovers automatically
>> and your services stay online. That's what DRBD and other HA solutions strive
>> to provide; uptime.
>>
>> If you want to protect against corruption (accidental or intentional, a-la
>> cryptolockers), you need a robust backup system to _compliment_ your HA
>> solution.
>>
>
> Yes, thanks, I've said for many years that HA is not a replacement for disaster recovery. Still, it is better to avoid downtime than to recover from it, and one of the main ways to achieve that is through redundancy, preferably a shared-nothing approach. If I have a cool 5-node cluster and the whole thing goes down because the filesystem gets corrupted, I can restore from backup, but management is going to wonder why a 5-node cluster could not provide availability. So the question remains: how to eliminate the filesystem as the SPOF?
>

Some of the things being discussed here have nothing to do with drbd.
drbd provides a raw block level device. It knows nothing about nor cares
what layers you place above it, whether they be filesystems or some
other block layer such as LVM or bcache.

It does a very specific job; ensure the blocks you write to a drbd
device get replicated and stored in real time on one or more other
distributed hosts. If you write a 512byte size block of random garbage
to a drbd device it will (and should) write the exact same garbage to
the other distributed hosts too, so that if you read that same 512byte
block back from any 1 of those individual hosts, you'll get the exact
same garbage back.

The OP stated "if the filesystem gets corrupted, then DRBD faithfully
replicates the corruption to the other node." Good! That's exactly what
we want it to do. What we definitely do NOT want is for drbd to
manipulate the block data given to it in any way whatsoever, we want it
to faithfully replicate this.
_______________________________________________
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
drbd-user@lists.linbit.com
https://lists.linbit.com/mailman/listinfo/drbd-user
Re: The Problem of File System Corruption w/DRBD [ In reply to ]
> -----Original Message-----
> From: drbd-user-bounces@lists.linbit.com <drbd-user-
> bounces@lists.linbit.com> On Behalf Of Eddie Chapman
> Sent: Thursday, June 3, 2021 1:11 PM
> To: drbd-user@lists.linbit.com
> Subject: Re: [DRBD-user] The Problem of File System Corruption w/DRBD
>
> On 03/06/2021 13:50, Eric Robinson wrote:
> >> -----Original Message-----
> >> From: Digimer <lists@alteeve.ca>
> >> Sent: Wednesday, June 2, 2021 7:23 PM
> >> To: Eric Robinson <eric.robinson@psmnv.com>;
> >> drbd-user@lists.linbit.com
> >> Subject: Re: [DRBD-user] The Problem of File System Corruption w/DRBD
> >>
> >> On 2021-06-02 5:17 p.m., Eric Robinson wrote:
> >>> Since DRBD lives below the filesystem, if the filesystem gets
> >>> corrupted, then DRBD faithfully replicates the corruption to the
> >>> other node. Thus the filesystem is the SPOF in an otherwise
> >>> shared-nothing
> >> architecture.
> >>> What is the recommended way (if there is one) to avoid the
> >>> filesystem SPOF problem when clusters are based on DRBD?
> >>>
> >>> -Eric
> >>
> >> To start, HA, like RAID, is not a replacement for backups. That is
> >> the answer to a situation like this... HA (and other availability
> >> systems like RAID) protect against component failure. If a node
> >> fails, the peer recovers automatically and your services stay online.
> >> That's what DRBD and other HA solutions strive to provide; uptime.
> >>
> >> If you want to protect against corruption (accidental or intentional,
> >> a-la cryptolockers), you need a robust backup system to _compliment_
> >> your HA solution.
> >>
> >
> > Yes, thanks, I've said for many years that HA is not a replacement for
> disaster recovery. Still, it is better to avoid downtime than to recover from it,
> and one of the main ways to achieve that is through redundancy, preferably
> a shared-nothing approach. If I have a cool 5-node cluster and the whole
> thing goes down because the filesystem gets corrupted, I can restore from
> backup, but management is going to wonder why a 5-node cluster could not
> provide availability. So the question remains: how to eliminate the filesystem
> as the SPOF?
> >
>
> Some of the things being discussed here have nothing to do with drbd.
> drbd provides a raw block level device. It knows nothing about nor cares
> what layers you place above it, whether they be filesystems or some other
> block layer such as LVM or bcache.
>
> It does a very specific job; ensure the blocks you write to a drbd device get
> replicated and stored in real time on one or more other distributed hosts. If
> you write a 512byte size block of random garbage to a drbd device it will (and
> should) write the exact same garbage to the other distributed hosts too, so
> that if you read that same 512byte block back from any 1 of those individual
> hosts, you'll get the exact same garbage back.
>
> The OP stated "if the filesystem gets corrupted, then DRBD faithfully
> replicates the corruption to the other node." Good! That's exactly what we
> want it to do. What we definitely do NOT want is for drbd to manipulate the
> block data given to it in any way whatsoever, we want it to faithfully replicate
> this.

No need to defend DRBD. We've been using it in production clusters since 2006 and have been phenomenally happy with it. I'm not indicting DRBD at all. Yes, it's good that it faithfully replicates whatever is passed to it. However, since that is true, it does tend to enable the problem of filesystem corruption taking down a whole cluster. I'm just asking people for any suggestions they may have for alleviating that problem.

-Eric



Disclaimer : This email and any files transmitted with it are confidential and intended solely for intended recipients. If you are not the named addressee you should not disseminate, distribute, copy or alter this email. Any views or opinions presented in this email are solely those of the author and might not represent those of Physician Select Management. Warning: Although Physician Select Management has taken reasonable precautions to ensure no viruses are present in this email, the company cannot accept responsibility for any loss or damage arising from the use of this email or attachments.
_______________________________________________
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
drbd-user@lists.linbit.com
https://lists.linbit.com/mailman/listinfo/drbd-user
Re: The Problem of File System Corruption w/DRBD [ In reply to ]
As others already mentioned the job of DRBD is to faithfully and accurately
replicate the data from the layers above it. So if there's a corruption on
the filesystem above the DRBD layer then it will happily do it for you,
same way as RAID1 would do it on a pair of hdds. If you want to reduce the
recovery time from such situation then you could leverage from the
snapshots capability on the layers below DRBD (if ThinLVM or ZFS are used),
to rollback at a previous checkpoint or implement HA at the layers above
DRBD if the application you are using supports it, it really depends on the
use case. That being said a filesystem corruption shouldn't be a common
thing and if it occurs you should investigate why it happened in the first
place.



On Wed, 2 Jun 2021 at 22:50, Eric Robinson <eric.robinson@psmnv.com> wrote:

> Since DRBD lives below the filesystem, if the filesystem gets corrupted,
> then DRBD faithfully replicates the corruption to the other node. Thus the
> filesystem is the SPOF in an otherwise shared-nothing architecture. What is
> the recommended way (if there is one) to avoid the filesystem SPOF problem
> when clusters are based on DRBD?
>
>
>
> -Eric
>
>
>
>
>
>
>
>
> Disclaimer : This email and any files transmitted with it are confidential
> and intended solely for intended recipients. If you are not the named
> addressee you should not disseminate, distribute, copy or alter this email.
> Any views or opinions presented in this email are solely those of the
> author and might not represent those of Physician Select Management.
> Warning: Although Physician Select Management has taken reasonable
> precautions to ensure no viruses are present in this email, the company
> cannot accept responsibility for any loss or damage arising from the use of
> this email or attachments.
> _______________________________________________
> Star us on GITHUB: https://github.com/LINBIT
> drbd-user mailing list
> drbd-user@lists.linbit.com
> https://lists.linbit.com/mailman/listinfo/drbd-user
>
Re: The Problem of File System Corruption w/DRBD [ In reply to ]
> -----Original Message-----
> From: drbd-user-bounces@lists.linbit.com <drbd-user-
> bounces@lists.linbit.com> On Behalf Of Digimer
> Sent: Thursday, June 3, 2021 11:43 AM
> To: Robert Sander <r.sander@heinlein-support.de>; drbd-
> user@lists.linbit.com
> Subject: Re: [DRBD-user] The Problem of File System Corruption w/DRBD
>
> On 2021-06-03 11:09 a.m., Robert Sander wrote:
> > Hi,
> >
> > Am 03.06.21 um 14:50 schrieb Eric Robinson:
> >
> >> Yes, thanks, I've said for many years that HA is not a replacement for
> disaster recovery. Still, it is better to avoid downtime than to recover from it,
> and one of the main ways to achieve that is through redundancy, preferably
> a shared-nothing approach. If I have a cool 5-node cluster and the whole
> thing goes down because the filesystem gets corrupted, I can restore from
> backup, but management is going to wonder why a 5-node cluster could not
> provide availability. So the question remains: how to eliminate the filesystem
> as the SPOF?
> >
> > Then eliminate the shared filesystem and replicate data on application
> > level.
> >
> > - MySQL has Galera
> > - Dovecot has dsync
> >
> > Regards
>
> Even this approach just moves the SPOF up from the FS to the SQL engine.
>
> The problem here is that you're still confusing redundancy with data
> integrity. To avoid data corruption, you need a layer that understands your
> data at a sufficient level to know what corruption looks like. Data integrity is
> yet another topic, and still separate from HA.
>

> DRBD, and other HA tools, don't analyze the data, and nor should they
> (imagine the security and privacy concerns that would open up). If the HA
> layer is given data to replicate, it's job is to faithfully and accurately replicate
> the data.
>

It seems like the two are sometimes intertwined. If GFS2, for example, about integrity or redundancy? But I'm not really asking how to prevent filesystem corruption. I'm asking (perhaps stupidly) the best/easiest way to make a filesystem redundant.

> I think the real solution is not technical, it's expectations management. Your
> managers need to understand what each part of their infrastructure does
> and does not do. This way, if the concerns around data corruption are
> sufficient, they can invest in tools to protect the data integrity at the logical
> layer.
>
> HA protects against component failure. That's it's job, and it does it well,
> when well implemented.
>

The filesystem is not a hardware component, but it is a cluster resource. The other cluster resources are redundant, with that sole exception. I'm just looking for a way around that problem. If there isn't one, then there isn't.

> --
> Digimer
> Papers and Projects: https://alteeve.com/w/ "I am, somehow, less
> interested in the weight and convolutions of Einstein’s brain than in the near
> certainty that people of equal talent have lived and died in cotton fields and
> sweatshops." - Stephen Jay Gould
> _______________________________________________
> Star us on GITHUB: https://github.com/LINBIT drbd-user mailing list drbd-
> user@lists.linbit.com https://lists.linbit.com/mailman/listinfo/drbd-user
Disclaimer : This email and any files transmitted with it are confidential and intended solely for intended recipients. If you are not the named addressee you should not disseminate, distribute, copy or alter this email. Any views or opinions presented in this email are solely those of the author and might not represent those of Physician Select Management. Warning: Although Physician Select Management has taken reasonable precautions to ensure no viruses are present in this email, the company cannot accept responsibility for any loss or damage arising from the use of this email or attachments.
_______________________________________________
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
drbd-user@lists.linbit.com
https://lists.linbit.com/mailman/listinfo/drbd-user
Re: The Problem of File System Corruption w/DRBD [ In reply to ]
I guess I need to reiterate that I’ve been using DRBD in production clusters since 2006 and have been extremely satisfied happy with it. The purpose of my question is not to cast doubt or blame on DRBD for doing its job well. It's a good thing that DRBD faithfully replicates whatever is passed to it. However, since that is true, it does tend to enable the problem of filesystem corruption taking down a whole cluster. I'm just asking people for any suggestions they may have for alleviating that problem. If it’s not fixable, then it’s not fixable.



Part of the reason I’m asking is because we’re about to build a whole new data center, and after 15 years of using DRBD we are beginning to look at other HA options, mainly because of the filesystem as a weak point. I should mention that it has *never* happened before, but the thought of it is scary.



-Eric




From: drbd-user-bounces@lists.linbit.com <drbd-user-bounces@lists.linbit.com> On Behalf Of Yanni M.
Sent: Thursday, June 3, 2021 2:21 PM
Cc: drbd-user@lists.linbit.com
Subject: Re: [DRBD-user] The Problem of File System Corruption w/DRBD

As others already mentioned the job of DRBD is to faithfully and accurately replicate the data from the layers above it. So if there's a corruption on the filesystem above the DRBD layer then it will happily do it for you, same way as RAID1 would do it on a pair of hdds. If you want to reduce the recovery time from such situation then you could leverage from the snapshots capability on the layers below DRBD (if ThinLVM or ZFS are used), to rollback at a previous checkpoint or implement HA at the layers above DRBD if the application you are using supports it, it really depends on the use case. That being said a filesystem corruption shouldn't be a common thing and if it occurs you should investigate why it happened in the first place.



On Wed, 2 Jun 2021 at 22:50, Eric Robinson <eric.robinson@psmnv.com<mailto:eric.robinson@psmnv.com>> wrote:
Since DRBD lives below the filesystem, if the filesystem gets corrupted, then DRBD faithfully replicates the corruption to the other node. Thus the filesystem is the SPOF in an otherwise shared-nothing architecture. What is the recommended way (if there is one) to avoid the filesystem SPOF problem when clusters are based on DRBD?

-Eric




Disclaimer : This email and any files transmitted with it are confidential and intended solely for intended recipients. If you are not the named addressee you should not disseminate, distribute, copy or alter this email. Any views or opinions presented in this email are solely those of the author and might not represent those of Physician Select Management. Warning: Although Physician Select Management has taken reasonable precautions to ensure no viruses are present in this email, the company cannot accept responsibility for any loss or damage arising from the use of this email or attachments.
_______________________________________________
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
drbd-user@lists.linbit.com<mailto:drbd-user@lists.linbit.com>
https://lists.linbit.com/mailman/listinfo/drbd-user
Disclaimer : This email and any files transmitted with it are confidential and intended solely for intended recipients. If you are not the named addressee you should not disseminate, distribute, copy or alter this email. Any views or opinions presented in this email are solely those of the author and might not represent those of Physician Select Management. Warning: Although Physician Select Management has taken reasonable precautions to ensure no viruses are present in this email, the company cannot accept responsibility for any loss or damage arising from the use of this email or attachments.
Re: The Problem of File System Corruption w/DRBD [ In reply to ]
On 2021-06-03 3:35 p.m., Eric Robinson wrote:
>> Even this approach just moves the SPOF up from the FS to the SQL engine.
>>
>> The problem here is that you're still confusing redundancy with data
>> integrity. To avoid data corruption, you need a layer that understands your
>> data at a sufficient level to know what corruption looks like. Data integrity is
>> yet another topic, and still separate from HA.
>>
>
>> DRBD, and other HA tools, don't analyze the data, and nor should they
>> (imagine the security and privacy concerns that would open up). If the HA
>> layer is given data to replicate, it's job is to faithfully and accurately replicate
>> the data.
>>
>
> It seems like the two are sometimes intertwined. If GFS2, for example, about integrity or redundancy? But I'm not really asking how to prevent filesystem corruption. I'm asking (perhaps stupidly) the best/easiest way to make a filesystem redundant.

GFS2 coordinates access between nodes, to ensure no two step on each
others blocks and that all know when to update their view of the FS. It
is still above the redundancy layer, it is still just a file system at
the end of the day.

If, for example, you were writing data to an FS on top of DRBD, and one
of the node's local storage started failing, the kernel would (should)
inform the DRBD driver that there has been an IO error. In such a case,
the DRBD device should detach from the local store and go diskless. All
further read/writes on that node would (transparently) go to/from
another node.

In this way, I think, you get as close to the goal you're describing. In
such a case though, you survived a hardware failure, _exactly_ what HA
is all about. You would have no data loss and your managers would be
happy. However, note how this example was below the data structure... It
involved the detection of a hardware fault and mitigation of that fault.

DRBD (like a RAID array) has no concept of data structures. So if
something at the logic layer wrote bad data (ie: a user's deletion or
saving of bad data), DRBD (again, like a RAID array) only cares to
ensure that the data is on both/all nodes, byte for byte accurate. This
is where the role of HA ends, and the role of anti-virus, security and
data integrity / backups kick in.


>> I think the real solution is not technical, it's expectations management. Your
>> managers need to understand what each part of their infrastructure does
>> and does not do. This way, if the concerns around data corruption are
>> sufficient, they can invest in tools to protect the data integrity at the logical
>> layer.
>>
>> HA protects against component failure. That's it's job, and it does it well,
>> when well implemented.
>>
>
> The filesystem is not a hardware component, but it is a cluster resource. The other cluster resources are redundant, with that sole exception. I'm just looking for a way around that problem. If there isn't one, then there isn't.

Consider the example of a virtual machine running on top of DRBD /
pacemaker (a setup I am very familiar with). If the host hardware fails,
the VM can be preventatively migrated or recovered on the peer node. In
this way, the data was preserved (up to the point of failure / reboot),
and services are restored promptly. This was possible because, byte for
byte the data was written to both host nodes. Voila! Full protection
against hardware faults.

Consider now that your VM gets hit with a cryptolocker virus. That
attack is, faithfully, replicated to both nodes (exactly as it would
replicate to both hard drives in a RAID 1 array). In this case, you're
out of luck. Why? Because HA doesn't protect data integrity, it can't.
It's role is to protect against hardware faults. This is true of the
filesystem inside a VM, or a file system directly on top of a DRBD resource.

The key take-away here is the role of different technologies in your
over-all corporate resilience planning. It's one (very powerful) tool in
a toolbox to protect your services and data. Backups, DR and
anti-malware all play each their own roles in the big-picture planning.

--
Digimer
Papers and Projects: https://alteeve.com/w/
"I am, somehow, less interested in the weight and convolutions of
Einstein’s brain than in the near certainty that people of equal talent
have lived and died in cotton fields and sweatshops." - Stephen Jay Gould
_______________________________________________
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
drbd-user@lists.linbit.com
https://lists.linbit.com/mailman/listinfo/drbd-user
Re: The Problem of File System Corruption w/DRBD [ In reply to ]
On 03 Jun 2021, at 21:41, Eric Robinson <eric.robinson@psmnv.com> wrote:
>
> It's a good thing that DRBD faithfully replicates whatever is passed to it. However, since that is true, it does tend to enable the problem of filesystem corruption taking down a whole cluster. I'm just asking people for any suggestions they may have for alleviating that problem. If it’s not fixable, then it’s not fixable.
>
> Part of the reason I’m asking is because we’re about to build a whole new data center, and after 15 years of using DRBD we are beginning to look at other HA options, mainly because of the filesystem as a weak point. I should mention that it has *never* happened before, but the thought of it is scary.

Oh, you’ve opened that can of worms, one of my favorite topics ;)

I guess, I have bad news for you, because you have only just found the entrance to that rabbit hole. There are *lots* of things that can take down your entire cluster, and the filesystem is probably the least of your concerns here, so I think you’re looking at the wrong thing here. Unfortunately, none of them can be fixed by high-availability, because the problem area that you are talking about is not high-availability, it’s high-reliability.

Let me give you a few examples on why high-reliability is something completely different than high-availability:

1. Imagine your application ends up in a corrupted state, but keeps running. Pacemaker might not even see that - the monitoring possibly just sees that the application is still running, so the cluster does not see any need to do anything, but the application does not work anymore.

2. Imagine your application crashes and leaves its data behind in a corrupted state in a file on a perfectly good filesystem - e.g., crashes after having written only 20% of the file’s content. Now Pacemaker restarts the application, but due to the corrupted content in its data file, the application cannot start. Pacemaker migrates the application to another node, which obviously - due to synchronous replication - has the sama data. The application cannot start there. The whole game continues until Pacemaker runs out of nodes to try and start the application, because it doesn’t work anywhere.

3. Even worse, there could be a bug hidden in Pacemaker or Corosync that crashes the cluster software on all nodes at the same time, so that high-availability is lost. Then, your application crashes. Nothing’s there to restart it anywhere.

4. Ultimate worst case: there could be a bug in the Linux kernel, especially somewhere in the network or I/O stack, that crashes all nodes simultaneously - especially on operations, where all of the nodes are doing the same thing, which is not that atypical for clusters - e.g., repliaction to all nodes, or distributed locking, etc.
It’s not even that unlikely.

You might be shocked to hear that it has already happened to me - while developing or testing/experimenting, e.g. with experimental code. I have even crashed all nodes of an 8 node cluster simultaneously, and not just once. I have also had cases where my cluster fenced all its nodes.
It’s not impossible - BUT it’s also not common on a well-tested production system that doesn’t continuously run tests of crazy corner cases like I do on my test systems.

Obviously, adding more nodes does not solve any of those problems. But the real question is whether your use case is so critical that you really need to prevent any of those from occuring once (because those don’t seem to happen that often, otherwise we would have heard about it).

If it’s really that level of critical, then you’re running the wrong hardware, the wrong operating system and the wrong applications, and what you’re really looking for is a custom-designed high-reliability (not just high-availability) solution, with dissimilar hardware platforms, multiple independent code implementations, formally verified software design and implementation, etc. - like the ones used for special purpose medical equipment, safety-critical industrial equipment, avionics systems, nuclear reactor control, etc. - you get the idea. Now you know why those aren’t allowed run on general-purpose hardware and software.

_______________________________________________
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
drbd-user@lists.linbit.com
https://lists.linbit.com/mailman/listinfo/drbd-user
Re: The Problem of File System Corruption w/DRBD [ In reply to ]
> -----Original Message-----
> From: drbd-user-bounces@lists.linbit.com <drbd-user-
> bounces@lists.linbit.com> On Behalf Of Robert Altnoeder
> Sent: Friday, June 4, 2021 6:15 AM
> To: drbd-user@lists.linbit.com
> Subject: Re: [DRBD-user] The Problem of File System Corruption w/DRBD
>
> On 03 Jun 2021, at 21:41, Eric Robinson <eric.robinson@psmnv.com> wrote:
> >
> > It's a good thing that DRBD faithfully replicates whatever is passed to it.
> However, since that is true, it does tend to enable the problem of filesystem
> corruption taking down a whole cluster. I'm just asking people for any
> suggestions they may have for alleviating that problem. If it’s not fixable,
> then it’s not fixable.
> >
> > Part of the reason I’m asking is because we’re about to build a whole new
> data center, and after 15 years of using DRBD we are beginning to look at
> other HA options, mainly because of the filesystem as a weak point. I should
> mention that it has *never* happened before, but the thought of it is scary.
>
> Oh, you’ve opened that can of worms, one of my favorite topics ;)
>
> I guess, I have bad news for you, because you have only just found the
> entrance to that rabbit hole. There are *lots* of things that can take down
> your entire cluster, and the filesystem is probably the least of your concerns
> here, so I think you’re looking at the wrong thing here. Unfortunately, none
> of them can be fixed by high-availability, because the problem area that you
> are talking about is not high-availability, it’s high-reliability.
>
> Let me give you a few examples on why high-reliability is something
> completely different than high-availability:
>
> 1. Imagine your application ends up in a corrupted state, but keeps running.
> Pacemaker might not even see that - the monitoring possibly just sees that
> the application is still running, so the cluster does not see any need to do
> anything, but the application does not work anymore.
>
> 2. Imagine your application crashes and leaves its data behind in a corrupted
> state in a file on a perfectly good filesystem - e.g., crashes after having
> written only 20% of the file’s content. Now Pacemaker restarts the
> application, but due to the corrupted content in its data file, the application
> cannot start. Pacemaker migrates the application to another node, which
> obviously - due to synchronous replication - has the sama data. The
> application cannot start there. The whole game continues until Pacemaker
> runs out of nodes to try and start the application, because it doesn’t work
> anywhere.
>
> 3. Even worse, there could be a bug hidden in Pacemaker or Corosync that
> crashes the cluster software on all nodes at the same time, so that high-
> availability is lost. Then, your application crashes. Nothing’s there to restart it
> anywhere.
>
> 4. Ultimate worst case: there could be a bug in the Linux kernel, especially
> somewhere in the network or I/O stack, that crashes all nodes
> simultaneously - especially on operations, where all of the nodes are doing
> the same thing, which is not that atypical for clusters - e.g., repliaction to all
> nodes, or distributed locking, etc.
> It’s not even that unlikely.
>
> You might be shocked to hear that it has already happened to me - while
> developing or testing/experimenting, e.g. with experimental code. I have
> even crashed all nodes of an 8 node cluster simultaneously, and not just
> once. I have also had cases where my cluster fenced all its nodes.
> It’s not impossible - BUT it’s also not common on a well-tested production
> system that doesn’t continuously run tests of crazy corner cases like I do on
> my test systems.
>
> Obviously, adding more nodes does not solve any of those problems. But the
> real question is whether your use case is so critical that you really need to
> prevent any of those from occuring once (because those don’t seem to
> happen that often, otherwise we would have heard about it).
>
> If it’s really that level of critical, then you’re running the wrong hardware, the
> wrong operating system and the wrong applications, and what you’re really
> looking for is a custom-designed high-reliability (not just high-availability)
> solution, with dissimilar hardware platforms, multiple independent code
> implementations, formally verified software design and implementation, etc.
> - like the ones used for special purpose medical equipment, safety-critical
> industrial equipment, avionics systems, nuclear reactor control, etc. - you get
> the idea. Now you know why those aren’t allowed run on general-purpose
> hardware and software.
>

Those are all good points. Since the three legs of the information security triad are confidentiality, integrity, and availability, this is ultimately a security issue. We all know that information security is not about eliminating all possible risks, as that is an unattainable goal. It is about mitigating risks to acceptable levels. So I guess it boils down to how each person evaluates the risks in their own environment. Over my 38-year career, and especially the past 15 years of using Linux HA, I've seen more filesystem-type issues than the other possible issues you mentioned, so that one tends to feature more prominently on my risk radar.
Disclaimer : This email and any files transmitted with it are confidential and intended solely for intended recipients. If you are not the named addressee you should not disseminate, distribute, copy or alter this email. Any views or opinions presented in this email are solely those of the author and might not represent those of Physician Select Management. Warning: Although Physician Select Management has taken reasonable precautions to ensure no viruses are present in this email, the company cannot accept responsibility for any loss or damage arising from the use of this email or attachments.
_______________________________________________
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
drbd-user@lists.linbit.com
https://lists.linbit.com/mailman/listinfo/drbd-user
Re: The Problem of File System Corruption w/DRBD [ In reply to ]
Il 2021-06-04 15:08 Eric Robinson ha scritto:
> Those are all good points. Since the three legs of the information
> security triad are confidentiality, integrity, and availability, this
> is ultimately a security issue. We all know that information security
> is not about eliminating all possible risks, as that is an
> unattainable goal. It is about mitigating risks to acceptable levels.
> So I guess it boils down to how each person evaluates the risks in
> their own environment. Over my 38-year career, and especially the past
> 15 years of using Linux HA, I've seen more filesystem-type issues than
> the other possible issues you mentioned, so that one tends to feature
> more prominently on my risk radar.

For the very limited goal of protecting from filesystem corruptions, you
can use a snapshot/CoW layer as thinlvm. Keep multiple rolling snapshots
and you can recover from sudden filesystem corruption. However this is
simply move the SPOF down to the CoW layer (thinlvm, which is quite
complex by itself and can be considered a stripped-down
filesystem/allocator) or up to the application layer (where corruptions
are relatively quite common).

That said, nowadays a mature filesystem as EXT4 and XFS can be corrupted
(barring obscure bugs) only by:
- a double mount from different machines;
- a direct write to the underlying raw disks;
- a serious hardware issue.

For what it is worth I am now accustomed to ZFS strong data integrity
guarantee, but I fully realize that this does *not* protect from any
corruptions scenario by itself, not even on
XFS-over-ZVOL-over-DRBD-over-ZFS setups. If anything, a more complex
filesystem (and I/O setup) has *greater* chances of exposing uncommon
bugs.

So: I strongly advise on placing your filesystem over a snapshot layer,
but do not expect this to shield from any storage related issue.
Regards.

--
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@assyoma.it - info@assyoma.it
GPG public key ID: FF5F32A8
_______________________________________________
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
drbd-user@lists.linbit.com
https://lists.linbit.com/mailman/listinfo/drbd-user
Re: The Problem of File System Corruption w/DRBD [ In reply to ]
> -----Original Message-----
> From: Gionatan Danti <g.danti@assyoma.it>
> Sent: Sunday, June 6, 2021 11:02 AM
> To: Eric Robinson <eric.robinson@psmnv.com>
> Cc: Robert Altnoeder <robert.altnoeder@linbit.com>; drbd-
> user@lists.linbit.com
> Subject: Re: [DRBD-user] The Problem of File System Corruption w/DRBD
>
> Il 2021-06-04 15:08 Eric Robinson ha scritto:
> > Those are all good points. Since the three legs of the information
> > security triad are confidentiality, integrity, and availability, this
> > is ultimately a security issue. We all know that information security
> > is not about eliminating all possible risks, as that is an
> > unattainable goal. It is about mitigating risks to acceptable levels.
> > So I guess it boils down to how each person evaluates the risks in
> > their own environment. Over my 38-year career, and especially the past
> > 15 years of using Linux HA, I've seen more filesystem-type issues than
> > the other possible issues you mentioned, so that one tends to feature
> > more prominently on my risk radar.
>
> For the very limited goal of protecting from filesystem corruptions, you can
> use a snapshot/CoW layer as thinlvm. Keep multiple rolling snapshots and
> you can recover from sudden filesystem corruption. However this is simply
> move the SPOF down to the CoW layer (thinlvm, which is quite complex by
> itself and can be considered a stripped-down
> filesystem/allocator) or up to the application layer (where corruptions are
> relatively quite common).
>
> That said, nowadays a mature filesystem as EXT4 and XFS can be corrupted
> (barring obscure bugs) only by:
> - a double mount from different machines;
> - a direct write to the underlying raw disks;
> - a serious hardware issue.
>
> For what it is worth I am now accustomed to ZFS strong data integrity
> guarantee, but I fully realize that this does *not* protect from any
> corruptions scenario by itself, not even on XFS-over-ZVOL-over-DRBD-over-
> ZFS setups. If anything, a more complex filesystem (and I/O setup) has
> *greater* chances of exposing uncommon bugs.
>
> So: I strongly advise on placing your filesystem over a snapshot layer, but do
> not expect this to shield from any storage related issue.
> Regards.
>

That would require a model where DRBD is sandwiched between two LVM layers. First, the DRBD backing device is an LVM partition. Then we create an LVM partition on top of DRBD, and create our filesystem on top of that. I've tried that approach before and had very poor success with cluster failover, due to the LVM resource agent not working as expected, volumes going inactive when they should be active, etc. Maybe it's just too complex for my brain. ????

If rolling snapshots are an acceptable solution, why not just periodically snapshot the whole drbd volume? Then, in the unlikely event that filesystem corruption occurs, fall back to the snapshot from before the corruption happened. I assume that would require a full drbd resync from primary to secondary, but that's probably easier than restoring from backup media.
Disclaimer : This email and any files transmitted with it are confidential and intended solely for intended recipients. If you are not the named addressee you should not disseminate, distribute, copy or alter this email. Any views or opinions presented in this email are solely those of the author and might not represent those of Physician Select Management. Warning: Although Physician Select Management has taken reasonable precautions to ensure no viruses are present in this email, the company cannot accept responsibility for any loss or damage arising from the use of this email or attachments.
_______________________________________________
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
drbd-user@lists.linbit.com
https://lists.linbit.com/mailman/listinfo/drbd-user