Mailing List Archive: Soliciting new RAID ideas

Soliciting new RAID ideas

May 27, 2014, 3:13 PM

Post #1 of 14 (4654 views)

Hi all,
The list is quiet. Please excuse me waking it up. (Or trying to...) ;-)

I'm at the point where I'm a few months from running out of disk
space on my RAID6 so I'm considering how to move forward. I thought
I'd check in here and get any ideas folks have. Thanks in advance.

The system is a Gentoo 64-bit, mostly stable, using a i7-980x
Extreme Edition processor with 24GB DRAM. Large chassis, 6 removable
HD bays, room for 6 other drives, a large power supply.

The disk subsystem is a 1.4TB RAID6 built from five SATA2 500GB WD
RAID-Edition 3 drives. The RAID has not had a single glitch in the 4+
years I've used this machine.

Generally there are 4 classes of data on the RAID:

1) Gentoo (obviously), configs backed up every weekend. I plan to
rebuild from scratch using existing configs if there's a failure.
Being down for a couple of days is not an issue.
2) VMs - about 300GB. Loaded every morning, stopped & saved every
night, backed up every weekend.
3) Financial data - lots of it - stocks, futures, options, etc.
Performance requirements are pretty low. Backed up every weekend.
4) Video files - backed up to a different location than items 1/2/3
whenever there are changes

After eclean-dist/eclean-pkg I'm down to about 80GB free and this
will fill up in 3-6 months so it's time to make some changes.

My thoughts:

1) Buy three (or even just two) 5400 RPM 3TB WD Red drives and go with
RAID1. This would use the internal SATA2 ports so it wouldn't be the
highest performance but likely a lot better than my SATA2 RAID6.

2) Buy two 7200 RPM 3TB WD Red drives and an LSI logic hardware RAID
controller. This would be SATA3 so probably way more performance than
I have now. MUCH more expensive though.

3) #1 + an SSD. I have an unused 120GB SSD so I could get another,
make a 2-disk RAID1, put Gentoo on that and everything else on the
newer 3TB drives. More complex, probably lower reliability and I'm not
sure I gain much.

Beyond this I need to talk file system types. I'm fat dumb and
happy with Ext4 and don't really relish dealing with new stuff but
now's the time to at least look.

Anyway, that's the basic outline. Any thoughts, ideas, corrections,
expansions, etc., I'm very interested in talking about.

Cheers,
Mark

Re: Soliciting new RAID ideas [ In reply to ]

rsanders at sgi

May 27, 2014, 3:39 PM

Post #2 of 14 (4589 views)

Permalink

Mark Knecht, mused, then expounded:
> Hi all,
> The list is quiet. Please excuse me waking it up. (Or trying to...) ;-)
>
> I'm at the point where I'm a few months from running out of disk
> space on my RAID6 so I'm considering how to move forward. I thought
> I'd check in here and get any ideas folks have. Thanks in advance.
>

Beware - if Adobe acroread is used, and you opt for a 3TB home
directory, there is a chance it will not work. Or more specifically,
acroread is still 32-bit. It's only something I've seen with the xfs
filesystem. And Adobe has ignored it for approx. 3yrs now.

> The system is a Gentoo 64-bit, mostly stable, using a i7-980x
> Extreme Edition processor with 24GB DRAM. Large chassis, 6 removable
> HD bays, room for 6 other drives, a large power supply.
>
> The disk subsystem is a 1.4TB RAID6 built from five SATA2 500GB WD
> RAID-Edition 3 drives. The RAID has not had a single glitch in the 4+
> years I've used this machine.
>
> Generally there are 4 classes of data on the RAID:
>
> 1) Gentoo (obviously), configs backed up every weekend. I plan to
> rebuild from scratch using existing configs if there's a failure.
> Being down for a couple of days is not an issue.
> 2) VMs - about 300GB. Loaded every morning, stopped & saved every
> night, backed up every weekend.
> 3) Financial data - lots of it - stocks, futures, options, etc.
> Performance requirements are pretty low. Backed up every weekend.
> 4) Video files - backed up to a different location than items 1/2/3
> whenever there are changes
>
> After eclean-dist/eclean-pkg I'm down to about 80GB free and this
> will fill up in 3-6 months so it's time to make some changes.
>
> My thoughts:
>
> 1) Buy three (or even just two) 5400 RPM 3TB WD Red drives and go with
> RAID1. This would use the internal SATA2 ports so it wouldn't be the
> highest performance but likely a lot better than my SATA2 RAID6.
>
> 2) Buy two 7200 RPM 3TB WD Red drives and an LSI logic hardware RAID
> controller. This would be SATA3 so probably way more performance than
> I have now. MUCH more expensive though.
>

RAID 1 is fine, RAID 10 is better, but comsumes 4 drives and SATA ports.

> 3) #1 + an SSD. I have an unused 120GB SSD so I could get another,
> make a 2-disk RAID1, put Gentoo on that and everything else on the
> newer 3TB drives. More complex, probably lower reliability and I'm not
> sure I gain much.
>
> Beyond this I need to talk file system types. I'm fat dumb and
> happy with Ext4 and don't really relish dealing with new stuff but
> now's the time to at least look.
>

If you change, do not use ZFS and possibly BTRFS if the system does not
have ECC DRAM. A single, unnoticed, ECC error can corrupt the data pool
and be written to the file system, which effectively renders it corrupt
without a way to recover.

FWIW - a Synology DS414slim can hold 4 x 1TB WD Red NAS 2.5" drives and
provide a boot of nfs or iSCSI to your VMs. The downside is the NAS box
and drives would go for a bit north of $636. The upside is all your
movies and VM files could move off your workstation and the workstation
would still host the VMs via a mount of the NAS box.

> Anyway, that's the basic outline. Any thoughts, ideas, corrections,
> expansions, etc., I'm very interested in talking about.
>
> Cheers,
> Mark
>

--
-

Re: Soliciting new RAID ideas [ In reply to ]

harryholt at gmail

May 27, 2014, 3:58 PM

Post #3 of 14 (4603 views)

Permalink

On May 27, 2014 6:39 PM, "Bob Sanders" <rsanders@sgi.com> wrote:
>
> Mark Knecht, mused, then expounded:
> > Hi all,
> > The list is quiet. Please excuse me waking it up. (Or trying to...)
;-)
> >
> > I'm at the point where I'm a few months from running out of disk
> > space on my RAID6 so I'm considering how to move forward. I thought
> > I'd check in here and get any ideas folks have. Thanks in advance.
> >
>
> Beware - if Adobe acroread is used, and you opt for a 3TB home
> directory, there is a chance it will not work. Or more specifically,
> acroread is still 32-bit. It's only something I've seen with the xfs
> filesystem. And Adobe has ignored it for approx. 3yrs now.
>
> > The system is a Gentoo 64-bit, mostly stable, using a i7-980x
> > Extreme Edition processor with 24GB DRAM. Large chassis, 6 removable
> > HD bays, room for 6 other drives, a large power supply.
> >
> > The disk subsystem is a 1.4TB RAID6 built from five SATA2 500GB WD
> > RAID-Edition 3 drives. The RAID has not had a single glitch in the 4+
> > years I've used this machine.
> >
> > Generally there are 4 classes of data on the RAID:
> >
> > 1) Gentoo (obviously), configs backed up every weekend. I plan to
> > rebuild from scratch using existing configs if there's a failure.
> > Being down for a couple of days is not an issue.
> > 2) VMs - about 300GB. Loaded every morning, stopped & saved every
> > night, backed up every weekend.
> > 3) Financial data - lots of it - stocks, futures, options, etc.
> > Performance requirements are pretty low. Backed up every weekend.
> > 4) Video files - backed up to a different location than items 1/2/3
> > whenever there are changes
> >
> > After eclean-dist/eclean-pkg I'm down to about 80GB free and this
> > will fill up in 3-6 months so it's time to make some changes.
> >
> > My thoughts:
> >
> > 1) Buy three (or even just two) 5400 RPM 3TB WD Red drives and go with
> > RAID1. This would use the internal SATA2 ports so it wouldn't be the
> > highest performance but likely a lot better than my SATA2 RAID6.
> >
> > 2) Buy two 7200 RPM 3TB WD Red drives and an LSI logic hardware RAID
> > controller. This would be SATA3 so probably way more performance than
> > I have now. MUCH more expensive though.
> >
>
> RAID 1 is fine, RAID 10 is better, but comsumes 4 drives and SATA ports.
>
> > 3) #1 + an SSD. I have an unused 120GB SSD so I could get another,
> > make a 2-disk RAID1, put Gentoo on that and everything else on the
> > newer 3TB drives. More complex, probably lower reliability and I'm not
> > sure I gain much.
> >
> > Beyond this I need to talk file system types. I'm fat dumb and
> > happy with Ext4 and don't really relish dealing with new stuff but
> > now's the time to at least look.
> >
>
> If you change, do not use ZFS and possibly BTRFS if the system does not
> have ECC DRAM. A single, unnoticed, ECC error can corrupt the data pool
> and be written to the file system, which effectively renders it corrupt
> without a way to recover.
>
> FWIW - a Synology DS414slim can hold 4 x 1TB WD Red NAS 2.5" drives and
> provide a boot of nfs or iSCSI to your VMs. The downside is the NAS box
> and drives would go for a bit north of $636. The upside is all your
> movies and VM files could move off your workstation and the workstation
> would still host the VMs via a mount of the NAS box.

+1 for the Synology NAS boxes, those things are awesome, fast, reliable,
upgradable (if you buy a larger one), and the best value available for
iSCSI attached VMs.

>
> > Anyway, that's the basic outline. Any thoughts, ideas, corrections,
> > expansions, etc., I'm very interested in talking about.
> >
> > Cheers,
> > Mark
> >
>
> --
> -
>
>

Re: Soliciting new RAID ideas [ In reply to ]

wired at gentoo

May 27, 2014, 4:05 PM

Post #4 of 14 (4609 views)

Permalink

On Wed, May 28, 2014 at 1:13 AM, Mark Knecht <markknecht@gmail.com> wrote:

> 1) Buy three (or even just two) 5400 RPM 3TB WD Red drives and go with
> RAID1. This would use the internal SATA2 ports so it wouldn't be the
> highest performance but likely a lot better than my SATA2 RAID6.
>

This.

Thinking into the future is important - drives tend to fill up faster when
you have more free space available.
Get three drives if possible, go RAID5, then when you run out of space (you
will), you just add one more and you're happy again.

This setup has one more advantage: You get to keep your old drives and
re-use them.

One interesting idea would be to use 3 of your old drives in a RAID5 setup
for Gentoo. It wouldn't be as fast as a couple of SSDs, but you're already
used to the speed and you instantly get two backup drives just in case one
of the old drives fails. You could also use the spare space on this array
for backups of critical stuff from the main raid.

You can always switch to SSDs for main system later :)

>
> Beyond this I need to talk file system types. I'm fat dumb and
> happy with Ext4 and don't really relish dealing with new stuff but
> now's the time to at least look.

New tech is nice, but I'd stick with ext4. Data is one of the few things on
my systems that I don't like to toy with.

Cheers,
--
Alex Alexander
+ wired
+ www.linuxized.com
+ www.leetworks.com

Re: Soliciting new RAID ideas [ In reply to ]

markknecht at gmail

May 27, 2014, 4:32 PM

Post #5 of 14 (4592 views)

Permalink

On Tue, May 27, 2014 at 3:39 PM, Bob Sanders <rsanders@sgi.com> wrote:
> Mark Knecht, mused, then expounded:
>> Hi all,
>> The list is quiet. Please excuse me waking it up. (Or trying to...) ;-)
>>
>> I'm at the point where I'm a few months from running out of disk
>> space on my RAID6 so I'm considering how to move forward. I thought
>> I'd check in here and get any ideas folks have. Thanks in advance.
>>
>
> Beware - if Adobe acroread is used, and you opt for a 3TB home
> directory, there is a chance it will not work. Or more specifically,
> acroread is still 32-bit. It's only something I've seen with the xfs
> filesystem. And Adobe has ignored it for approx. 3yrs now.
>

acroread isn't critical to me but it does get used now and then so
thanks for the heads-up.

<SNIP>
>
> RAID 1 is fine, RAID 10 is better, but comsumes 4 drives and SATA ports.

Humm...I suppose I might consider building a 4-drive 1TB RAID10 from
my existing 500GB RE3 drives, and then buy a couple of 2TB Red drives
and do a RAID1 for data storage. If I did that I'd end up with 6
drives in the box, 4 of them old, but old ain't necessarily bad. ;-)
However that forces me to manage what data goes where instead of just
a big, flat RAID1 which is going to be easy to live with. Still, it
would probably save some money.

<SNIP>
>
> If you change, do not use ZFS and possibly BTRFS if the system does not
> have ECC DRAM. A single, unnoticed, ECC error can corrupt the data pool
> and be written to the file system, which effectively renders it corrupt
> without a way to recover.

Thanks. No ECC and no real interest in doing anything very exotic.

>
> FWIW - a Synology DS414slim can hold 4 x 1TB WD Red NAS 2.5" drives and
> provide a boot of nfs or iSCSI to your VMs. The downside is the NAS box
> and drives would go for a bit north of $636. The upside is all your
> movies and VM files could move off your workstation and the workstation
> would still host the VMs via a mount of the NAS box.
>

NAS is an interesting idea. I'll do a little study but my initial
feeling is that it's more money than I really want to spend. Summer's
coming. Time for Margaritas!

Thanks,
Mark

Re: Soliciting new RAID ideas [ In reply to ]

thegeezer at thegeezer

May 27, 2014, 4:38 PM

Post #6 of 14 (4604 views)

Permalink

On 2014-05-27 23:58, Harry Holt wrote:
> On May 27, 2014 6:39 PM, "Bob Sanders" <rsanders@sgi.com> wrote:
> >
> > Mark Knecht, mused, then expounded:
> > > Hi all,
> > > Â Â The list is quiet. Please excuse me waking it up. (Or trying
> to...) ;-)
> > >
> > > Â Â I'm at the point where I'm a few months from running out of
> disk
> > > space on my RAID6 so I'm considering how to move forward. I
> thought
> > > I'd check in here and get any ideas folks have. Thanks in
> advance.
> > >
> >
> > Beware - if Adobe acroread is used, and you opt for a 3TB home
> > directory, there is a chance it will not work. Â Or more
> specifically,
> > acroread is still 32-bit. Â It's only something I've seen with the
> xfs
> > filesystem. Â And Adobe has ignored it for approx. 3yrs now.
> >
> > > Â Â The system is a Gentoo 64-bit, mostly stable, using a
> i7-980x
> > > Extreme Edition processor with 24GB DRAM. Large chassis, 6
> removable
> > > HD bays, room for 6 other drives, a large power supply.
> > >
> > > Â Â The disk subsystem is a 1.4TB RAID6 built from five SATA2
> 500GB WD
> > > RAID-Edition 3 drives. The RAID has not had a single glitch in
> the 4+
> > > years I've used this machine.
> > >
> > > Â Â Generally there are 4 classes of data on the RAID:
> > >
> > > 1) Gentoo (obviously), configs backed up every weekend. I plan to
> > > rebuild from scratch using existing configs if there's a failure.
> > > Being down for a couple of days is not an issue.
> > > 2) VMs - about 300GB. Loaded every morning, stopped & saved every
> > > night, backed up every weekend.
> > > 3) Financial data - lots of it - stocks, futures, options, etc.
> > > Performance requirements are pretty low. Backed up every weekend.
> > > 4) Video files - backed up to a different location than items
> 1/2/3
> > > whenever there are changes
> > >
> > > Â Â After eclean-dist/eclean-pkg I'm down to about 80GB free and
> this
> > > will fill up in 3-6 months so it's time to make some changes.
> > >
> > > Â Â My thoughts:
> > >
> > > 1) Buy three (or even just two) 5400 RPM 3TB WD Red drives and go
> with
> > > RAID1. This would use the internal SATA2 ports so it wouldn't be
> the
> > > highest performance but likely a lot better than my SATA2 RAID6.
> > >
> > > 2) Buy two 7200 RPM 3TB WD Red drives and an LSI logic hardware
> RAID
> > > controller. This would be SATA3 so probably way more performance
> than
> > > I have now. MUCH more expensive though.
> > >
> >
> > RAID 1 is fine, RAID 10 is better, but comsumes 4 drives and SATA
> ports.
> >
> > > 3) #1 + an SSD. I have an unused 120GB SSD so I could get
> another,
> > > make a 2-disk RAID1, put Gentoo on that and everything else on
> the
> > > newer 3TB drives. More complex, probably lower reliability and
> I'm not
> > > sure I gain much.
> > >
> > > Â Â Beyond this I need to talk file system types. I'm fat dumb
> and
> > > happy with Ext4 and don't really relish dealing with new stuff
> but
> > > now's the time to at least look.
> > >
> >
> > If you change, do not use ZFS and possibly BTRFS if the system does
> not
> > have ECC DRAM. Â A single, unnoticed, ECC error can corrupt the
> data pool
> > and be written to the file system, which effectively renders it
> corrupt
> > without a way to recover.
> >
> > FWIW - a Synology DS414slim can hold 4 x 1TB WD Red NAS 2.5" drives
> and
> > provide a boot of nfs or iSCSI to your VMs. Â The downside is the
> NAS box
> > and drives would go for a bit north of $636. Â The upside is all
> your
> > movies and VM files could move off your workstation and the
> workstation
> > would still host the VMs via a mount of the NAS box.
>
> +1 for the Synology NAS boxes, those things are awesome, fast,
> reliable, upgradable (if you buy a larger one), and the best value
> available for iSCSI attached VMs.

while i agree on the +1 for iscsi storage, there are a few drawbacks.
yes the modularity is awesome primarily -- super simple to spin up
backup system and "move" data with a simple connection command.
also a top tip would be to have teh "data" part of the vm as an iscsi
connection too, so you can easily detach/reattach to another vm.

however, depending on the vm's you have you will probably start needing
to use more than one gigabit connection to max out speeds: 1gigabit
ethernet is not the same as 6gigabit sata3, and spinning rust is not the
same as ssd.

looking to the spec of the existing workstation, i'd be tempted to stay
with mdadm rather than a hardware raid card (which is probably running
embedded anyway) though with that i7 you have disabled turboboost right?

what would be an interesting comparison is pci-express speed vs
motherboard sata - cpu bridge speed, obviously spinning disks will not
max 6gbit, and the motherboard may not give you 6x 6gbit real
throughput, whereas dedicated hardware raid _might_ do if it had
intelligent caching.

other fun to look at would be lvm cos i personally think it's awesome.
for an example the first half of spinning disks is substantially faster
than the second half due to the tracks on the outer part, so i split
each disk into three partitions fast,med,slow and add to lvm volume
group, you can then group the fasts into a raid, medium into a raid and
slows into a raid too; mdadm allows similar configs with partitions.

ZFS for me lost it's lustre when minimum requirement was 1GB RAM per
terabyte...i may have my gigabytes and gigabits mixed up on this one
happy for someone to correct me. BTRFS looks very very interesting to
me, though still not played with it but mostly for checksums, the rest i
can do with lvm.

you might also like to consider fun with deduplication, by have a raid
base, with lvm on top with block level dedupe ala lessfs, then lvm
inside the deduped-lvm (yeah i know i'm sick, but the doctor tells me
the layers of abstraction eventually combine happily :) but i'm not sure
you'll get much benefit from virtualmachines and movies being deduped.

if you add an ssd into the mix you can also look at devicemapper caches
such as bcache and dm-cache, or even just moving the journal of your
ext4 partition there instead.

crucially you need to think about what your issues you _need_ to solve
and those that you would like to solve. space is obviously one issue,
and performance is not really an issue for you. depending on your budget
a pair of large sata drives + mdadm will be ideal, if you had lvm
already you could simply 'move' then 'enlarge' your existing stuff (tm)
: i'd like to know how btrfs would do the same for anyone who can let me
know.
you have raid6 because you probably know that raid5 is just waiting for
trouble, so i'd probably start looking at btrfs for your finanical data
to be checksummed. also consider ECC memory if your motherboard
supports it, never mind the hosing of filesystems, if you are running
vm's you do _not_ want memory making them behave oddly or worse, and if
you have lots of active financial data (bloomberg + analytics) you run
the risk of the butterfly effect making odd results.

Re: Soliciting new RAID ideas [ In reply to ]

marcec at gmx

May 27, 2014, 4:51 PM

Post #7 of 14 (4606 views)

Permalink

Am Tue, 27 May 2014 15:39:38 -0700
schrieb Bob Sanders <rsanders@sgi.com>:

> Mark Knecht, mused, then expounded:
[...]
> > Beyond this I need to talk file system types. I'm fat dumb and
> > happy with Ext4 and don't really relish dealing with new stuff but
> > now's the time to at least look.
> >
>
> If you change, do not use ZFS and possibly BTRFS if the system does not
> have ECC DRAM. A single, unnoticed, ECC error can corrupt the data pool
> and be written to the file system, which effectively renders it corrupt
> without a way to recover.
[...]

As someone who recently switched an mdraid to BTRFS (with / on EXT4 on an
SSD, which will be migrated at a later point, once I feel more at ease with
BTRFS), I was curious about this, so I googled it. I found two threads, [0]
and [3], which dispute (and most likely refute) this notion that BTRFS is more
susceptible to memory errors than other file systems.

While I am far from a filesystem/storage expert (I see myself as a mere user),
the cited threads lead me to believe that this is most likely an
overhyped/misunderstood class of errors (e.g., posts [1] and [2]), so I would
suggest reading them in their entirety.

[0] http://comments.gmane.org/gmane.comp.file-systems.btrfs/31832
[1] http://permalink.gmane.org/gmane.comp.file-systems.btrfs/31871
[2] http://permalink.gmane.org/gmane.comp.file-systems.btrfs/31877
[3] http://comments.gmane.org/gmane.comp.file-systems.btrfs/31821

HTH
--
Marc Joliet
--
"People who think they know everything really annoy those of us who know we
don't" - Bjarne Stroustrup

Re: Soliciting new RAID ideas [ In reply to ]

rich0 at gentoo

May 27, 2014, 5:26 PM

Post #8 of 14 (4603 views)

Permalink

On Tue, May 27, 2014 at 7:38 PM, <thegeezer@thegeezer.net> wrote:
> if you had lvm already you could
> simply 'move' then 'enlarge' your existing stuff (tm)

Yup - if you're not running btrfs/zfs you probably should be running
lvm. One thing I would do is backup your lvm metadata when it changes
- I once got burned by an lvm error of some kind and an fsck scrambled
the living daylights out of my disk (an fsck on one ext3 partition
scrambled a different partition). That is pretty rare though (but I
did find one or two mentions online of similar situations.

> : i'd like to know how
> btrfs would do the same for anyone who can let me know.

A btrfs filesystem pools storage. You can add devices to the pool,
and remove devices to the pool. If you remove a device with data on
it the data will get moved. When adding devices btrfs does not
automatically shuffle data around - you can issue a balance command to
do so, but I wouldn't do this until you're done adding/removing
drives.

A nice thing about btrfs is that devices do not have to be of the same
size and it generally does the right thing.

The downside of btrfs right now for raid is that raid5/6 are still
very experimental. They will support reshaping though, which is one
of the reasons I've stayed away from zfs. Zfs also lets you
add/remove devices from a pool, but it does not allow you to reshape a
raid.

Rich

Re: Soliciting new RAID ideas [ In reply to ]

rsanders at sgi

May 28, 2014, 8:26 AM

Post #9 of 14 (4590 views)

Permalink

Marc Joliet, mused, then expounded:
> Am Tue, 27 May 2014 15:39:38 -0700
> schrieb Bob Sanders <rsanders@sgi.com>:
>
> While I am far from a filesystem/storage expert (I see myself as a mere user),
> the cited threads lead me to believe that this is most likely an
> overhyped/misunderstood class of errors (e.g., posts [1] and [2]), so I would
> suggest reading them in their entirety.
>
> [0] http://comments.gmane.org/gmane.comp.file-systems.btrfs/31832
> [1] http://permalink.gmane.org/gmane.comp.file-systems.btrfs/31871
> [2] http://permalink.gmane.org/gmane.comp.file-systems.btrfs/31877
> [3] http://comments.gmane.org/gmane.comp.file-systems.btrfs/31821
>

FWIW - here's the FreeNAS ZFS ECC discussion on what happens with a bad
memory bit and no ECC memory:

http://forums.freenas.org/index.php?threads/ecc-vs-non-ecc-ram-and-zfs.15449/

Thanks Mark! Interesting discussion on btrfs.

Bob

> HTH
> --
> Marc Joliet
> --
> "People who think they know everything really annoy those of us who know we
> don't" - Bjarne Stroustrup

--
-

Re: Soliciting new RAID ideas [ In reply to ]

rsanders at sgi

May 28, 2014, 8:28 AM

Post #10 of 14 (4588 views)

Permalink

Bob Sanders, mused, then expounded:
>
> Marc Joliet, mused, then expounded:
> > Am Tue, 27 May 2014 15:39:38 -0700
> > schrieb Bob Sanders <rsanders@sgi.com>:
> >
> > While I am far from a filesystem/storage expert (I see myself as a mere user),
> > the cited threads lead me to believe that this is most likely an
> > overhyped/misunderstood class of errors (e.g., posts [1] and [2]), so I would
> > suggest reading them in their entirety.
> >
> > [0] http://comments.gmane.org/gmane.comp.file-systems.btrfs/31832
> > [1] http://permalink.gmane.org/gmane.comp.file-systems.btrfs/31871
> > [2] http://permalink.gmane.org/gmane.comp.file-systems.btrfs/31877
> > [3] http://comments.gmane.org/gmane.comp.file-systems.btrfs/31821
> >
>
> FWIW - here's the FreeNAS ZFS ECC discussion on what happens with a bad
> memory bit and no ECC memory:
>
> http://forums.freenas.org/index.php?threads/ecc-vs-non-ecc-ram-and-zfs.15449/
>
>
> Thanks Mark! Interesting discussion on btrfs.
>

Apologies - that should have been - Thanks Marc!

> Bob
>
> > HTH
> > --
> > Marc Joliet
> > --
> > "People who think they know everything really annoy those of us who know we
> > don't" - Bjarne Stroustrup
>
>
>
> --
> -
>
>

--
-

Re: Soliciting new RAID ideas [ In reply to ]

rich0 at gentoo

May 28, 2014, 9:10 AM

Post #11 of 14 (4598 views)

Permalink

On Wed, May 28, 2014 at 11:26 AM, Bob Sanders <rsanders@sgi.com> wrote:
> Marc Joliet, mused, then expounded:
>> Am Tue, 27 May 2014 15:39:38 -0700
>> schrieb Bob Sanders <rsanders@sgi.com>:
>>
>> While I am far from a filesystem/storage expert (I see myself as a mere user),
>> the cited threads lead me to believe that this is most likely an
>> overhyped/misunderstood class of errors (e.g., posts [1] and [2]), so I would
>> suggest reading them in their entirety.
>>
>> [0] http://comments.gmane.org/gmane.comp.file-systems.btrfs/31832
>> [1] http://permalink.gmane.org/gmane.comp.file-systems.btrfs/31871
>> [2] http://permalink.gmane.org/gmane.comp.file-systems.btrfs/31877
>> [3] http://comments.gmane.org/gmane.comp.file-systems.btrfs/31821
>>
>
> FWIW - here's the FreeNAS ZFS ECC discussion on what happens with a bad
> memory bit and no ECC memory:
>
> http://forums.freenas.org/index.php?threads/ecc-vs-non-ecc-ram-and-zfs.15449/
>

I don't think that anybody debates that if you use btrfs/zfs with
non-ECC RAM you can potentially lose some of the protection afforded
by the checksumming.

What I'd question is that this is some concern unique to btrfs/zfs.
I'd think the same failure modes would all apply to any other
filesystem.

So, the message should be that ECC RAM is better than non-ECC RAM, not
that those who use non-ECC RAM are better off using ext4 instead of
zfs/btrfs. I'd think that any RAM-related issue that would impact
zfs/btrfs would affect ext4 just as badly, and with ext4 you're also
vulnerable to all the non-RAM-related errors that checksumming was
created to solve.

If your RAM is bad then all kinds of stuff can go wrong. Ditto for
your cache memory in the CPU, logic circuitry in the CPU, your busses,
etc. Most systems are not fault-tolerant of these system components
and the cost to make them fault-tolerant tends to be fairly high. On
the other hand, the good news is that you're far more likely to have
problems with data stored on a disk than in RAM, which is probably why
we haven't bothered to improve the other components.

Rich

Re: Soliciting new RAID ideas [ In reply to ]

marcec at gmx

May 28, 2014, 12:20 PM

Post #12 of 14 (4607 views)

Permalink

Am Wed, 28 May 2014 08:26:58 -0700
schrieb Bob Sanders <rsanders@sgi.com>:

>
> Marc Joliet, mused, then expounded:
> > Am Tue, 27 May 2014 15:39:38 -0700
> > schrieb Bob Sanders <rsanders@sgi.com>:
> >
> > While I am far from a filesystem/storage expert (I see myself as a mere user),
> > the cited threads lead me to believe that this is most likely an
> > overhyped/misunderstood class of errors (e.g., posts [1] and [2]), so I would
> > suggest reading them in their entirety.
> >
> > [0] http://comments.gmane.org/gmane.comp.file-systems.btrfs/31832
> > [1] http://permalink.gmane.org/gmane.comp.file-systems.btrfs/31871
> > [2] http://permalink.gmane.org/gmane.comp.file-systems.btrfs/31877
> > [3] http://comments.gmane.org/gmane.comp.file-systems.btrfs/31821
> >
>
> FWIW - here's the FreeNAS ZFS ECC discussion on what happens with a bad
> memory bit and no ECC memory:
>
> http://forums.freenas.org/index.php?threads/ecc-vs-non-ecc-ram-and-zfs.15449/

Thanks for explicitly linking that. I didn't read it the first time around,
but just read through most of it, then reread the threads [0] and [3] above and
*think* that I understand the problem (and how it doesn't apply to BTRFS)
better now.

IIUC, the claim is: data is written to disk, but it must go through the RAM
first, obviously, where it is corrupted (due to a permanent bit flip caused,
e.g., by deteriorating hardware). At some later point, when the data is read
back from disk, it might happen to load around the damaged location in RAM,
where it is further corrupted. At this point the checksum fails, and ZFS
corrects the data in RAM (using parity information!), where it is immediately
corrupted again (because apparently it is corrected at the same physical
location in RAM? perhaps this is specific to correction via parity?). This
*additionally* corrupted data is then written back to disk (without any further
checks).

So the point is that, apparently, without ECC RAM, you could get a (long-term)
cascade of errors, especially during a scrub. The likelihood of such permanent
RAM corruption happening in the first place is another question entirely.

The various posts in [0] then basically say that regardless of whether this
really is true of ZFS, it certainly doesn't apply to BTRFS, for various
reasons. I suppose this quote from [1] (see above) says it most clearly:

> In hxxp://forums.freenas.org/threads/ecc-vs-non-ecc-ram-and-zfs.15449, they talk about
> reconstructing corrupted data from parity information:
>
> > Ok, no problem. ZFS will check against its parity. Oops, the parity failed since we have a new corrupted
> bit. Remember, the checksum data was calculated after the corruption from the first memory error
> occurred. So now the parity data is used to "repair" the bad data. So the data is "fixed" in RAM.
>
> i.e. that there is parity information stored with every piece of data, and ZFS will "correct" errors
> automatically from the parity information. I start to suspect that there is confusion here between
> checksumming for data integrity and parity information. If this is really how ZFS works, then if memory
> corruption interferes with this process, then I can see how a scrub could be devastating. I don't know if
> ZFS really works like this. It sounds very odd to do this without an additional checksum check. This sounds
> very different to what you say below that btrfs does, which is only to check against redundantly-stored
> copies, which I agree sounds much safer.

The rest is also relevant, but I think the point that the data is corrected via
parity information, as opposed to using a known-good redundant copy of the data
(which I originally missed, and thus got confused), is the key point in
understanding the (supposed) difference in behaviour between ZFS and BTRFS.

All this assumes, of course, that the FreeNAS forum post that ignited this
discussion is correct in the first place.

> Thanks Mark! Interesting discussion on btrfs.
>
> Bob

You're welcome! I agree, it's an interesting discussion. And regarding the
misspelling of my name: no problem :-) .

--
Marc Joliet
--
"People who think they know everything really annoy those of us who know we
don't" - Bjarne Stroustrup

Re: Soliciting new RAID ideas [ In reply to ]

rsanders at sgi

May 28, 2014, 12:56 PM

Post #13 of 14 (4598 views)

Permalink

Marc Joliet, mused, then expounded:
> Am Wed, 28 May 2014 08:26:58 -0700
> schrieb Bob Sanders <rsanders@sgi.com>:
>
> >
> > Marc Joliet, mused, then expounded:
> > > Am Tue, 27 May 2014 15:39:38 -0700
> > > schrieb Bob Sanders <rsanders@sgi.com>:
> > >
> > > While I am far from a filesystem/storage expert (I see myself as a mere user),
> > > the cited threads lead me to believe that this is most likely an
> > > overhyped/misunderstood class of errors (e.g., posts [1] and [2]), so I would
> > > suggest reading them in their entirety.
> > >
> > > [0] http://comments.gmane.org/gmane.comp.file-systems.btrfs/31832
> > > [1] http://permalink.gmane.org/gmane.comp.file-systems.btrfs/31871
> > > [2] http://permalink.gmane.org/gmane.comp.file-systems.btrfs/31877
> > > [3] http://comments.gmane.org/gmane.comp.file-systems.btrfs/31821
> > >
> >
> > FWIW - here's the FreeNAS ZFS ECC discussion on what happens with a bad
> > memory bit and no ECC memory:
> >

Just to beat this dead horse some more, an analysis of a academic study
on drive failures -

http://storagemojo.com/2007/02/20/everything-you-know-about-disks-is-wrong/

And it links to the actual study here -

https://www.usenix.org/legacy/events/fast07/tech/schroeder.html

Which shows that memory has a fairly high failure rate as well, though
the focus is on hard drives.

> > http://forums.freenas.org/index.php?threads/ecc-vs-non-ecc-ram-and-zfs.15449/
>
> Thanks for explicitly linking that. I didn't read it the first time around,
> but just read through most of it, then reread the threads [0] and [3] above and
> *think* that I understand the problem (and how it doesn't apply to BTRFS)
> better now.
>
> IIUC, the claim is: data is written to disk, but it must go through the RAM
> first, obviously, where it is corrupted (due to a permanent bit flip caused,
> e.g., by deteriorating hardware). At some later point, when the data is read
> back from disk, it might happen to load around the damaged location in RAM,
> where it is further corrupted. At this point the checksum fails, and ZFS
> corrects the data in RAM (using parity information!), where it is immediately
> corrupted again (because apparently it is corrected at the same physical
> location in RAM? perhaps this is specific to correction via parity?). This
> *additionally* corrupted data is then written back to disk (without any further
> checks).
>
> So the point is that, apparently, without ECC RAM, you could get a (long-term)
> cascade of errors, especially during a scrub. The likelihood of such permanent
> RAM corruption happening in the first place is another question entirely.
>
> The various posts in [0] then basically say that regardless of whether this
> really is true of ZFS, it certainly doesn't apply to BTRFS, for various
> reasons. I suppose this quote from [1] (see above) says it most clearly:
>
> > In hxxp://forums.freenas.org/threads/ecc-vs-non-ecc-ram-and-zfs.15449, they talk about
> > reconstructing corrupted data from parity information:
> >
> > > Ok, no problem. ZFS will check against its parity. Oops, the parity failed since we have a new corrupted
> > bit. Remember, the checksum data was calculated after the corruption from the first memory error
> > occurred. So now the parity data is used to "repair" the bad data. So the data is "fixed" in RAM.
> >
> > i.e. that there is parity information stored with every piece of data, and ZFS will "correct" errors
> > automatically from the parity information. I start to suspect that there is confusion here between
> > checksumming for data integrity and parity information. If this is really how ZFS works, then if memory
> > corruption interferes with this process, then I can see how a scrub could be devastating. I don't know if
> > ZFS really works like this. It sounds very odd to do this without an additional checksum check. This sounds
> > very different to what you say below that btrfs does, which is only to check against redundantly-stored
> > copies, which I agree sounds much safer.
>
> The rest is also relevant, but I think the point that the data is corrected via
> parity information, as opposed to using a known-good redundant copy of the data
> (which I originally missed, and thus got confused), is the key point in
> understanding the (supposed) difference in behaviour between ZFS and BTRFS.
>
> All this assumes, of course, that the FreeNAS forum post that ignited this
> discussion is correct in the first place.
>
> > Thanks Mark! Interesting discussion on btrfs.
> >
> > Bob
>
> You're welcome! I agree, it's an interesting discussion. And regarding the
> misspelling of my name: no problem :-) .
>
> --
> Marc Joliet
> --
> "People who think they know everything really annoy those of us who know we
> don't" - Bjarne Stroustrup

--
-

Re: Soliciting new RAID ideas [ In reply to ]

1i5t5.duncan at cox

May 29, 2014, 12:08 AM

Post #14 of 14 (4595 views)

Permalink

Marc Joliet posted on Wed, 28 May 2014 21:20:18 +0200 as excerpted:

> Am Wed, 28 May 2014 08:26:58 -0700 schrieb Bob Sanders
> <rsanders@sgi.com>:
>
>> Marc Joliet, mused, then expounded: [snipped]
>
>> Thanks Mark! Interesting discussion on btrfs.
>>
>> [followup] Apologies - that should have been - Thanks Marc!
>
> You're welcome! I agree, it's an interesting discussion. And regarding
> the misspelling of my name: no problem :-) .

=:^)

But seriously, thanks Bob for pointing out the misspelling.

There's a Mark (with a k) that's quite active on the btrfs list (and has
in fact done quite a bit of testing on the raid56 stuff, and written most
of several related pages on the btrfs wiki), and I guess my brain has so
associated him with the btrfs discussion context that without actually
thinking about it, I was thinking this was the same "Mark" here.

So pointing out that it's actually Marc-with-a-c here actually alerted me
to the fact that it's not the same person, and very possibly saved a very
confused Duncan from making quite a fool of himself in some future post
either here or there as a result!

So thanks VERY MUCH, Bob! =:^)

(FWIW, my first name is John. But at least in my generation there's so
many Johns around, and Duncan as a last name isn't uncommon either, that
in fact there are quite a few John Duncans around too, and it's all
horribly confusing. I even worked with a Donna at one point, and in a
fairly noisy environment all you hear for either is the ON bit, so we
were always either both or neither answering to calls for either one of
us, since neither could easily hear which one they actually called. So I
switched to the mononym "Duncan". That has been MUCH less confusing over
the decades I've been using it, now. Anyway, I can definitely identify
with first-name confusion. =:^)

--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman

Mailing List Archive

Attached Files:

Attached Files: