Mailing List Archive

Am 20.06.2013 21:10, schrieb Mark Knecht:
> Hi,
> Does anyone know of info on how the starting sector number might
> impact RAID performance under Gentoo? The drives are WD-500G RE3
> drives shown here:
>
> http://www.amazon.com/Western-Digital-WD5002ABYS-3-5-inch-Enterprise/dp/B001EMZPD0/ref=cm_cr_pr_product_top
>
> These are NOT 4k sector sized drives.
>
> Specifically I'm a 5-drive RAID6 for about 1.45TB of storage. My
> benchmarking seems abysmal at around 40MB/S using dd copying large
> files. It's higher, around 80MB/S if the file being transferred is
> coming from an SSD, but even 80MB/S seems slow to me. I see a LOT of
> wait time in top. And my 'large file' copies might not be large enough
> as the machine has 24GB of DRAM and I've only been copying 21GB so
> it's possible some of that is cached.
>
> Then I looked again at how I partitioned the drives originally and
> see the starting sector of sector 3 as 8594775. I started wondering if
> something like 4K block sizes at the file system level might be
> getting munged across 16k chunk sizes in the RAID. Maybe the blocks
> are being torn apart in bad ways for performance? That led me down a
> bunch of rabbit holes and I haven't found any light yet.
>
> Looking for some thoughtful ideas from those more experienced in this area.
>
> Cheers,
> Mark
>
>

man mkfs.xfs

man mkfs.ext4

look for stripe size etc.

Have fun.

Re: Is my RAID performance bad possibly due to starting sector value? [ In reply to ]

Jun 20, 2013, 12:27 PM

Post #3 of 46 (9725 views)

On Thu, Jun 20, 2013 at 3:10 PM, Mark Knecht <markknecht@gmail.com> wrote:
> Looking for some thoughtful ideas from those more experienced in this area.

Please do share your findings. I suspect my own RAID+LVM+EXT3/4
system is not optimized - especially with LVM I have no idea how
blocks in ext3/4 end up mapping to stripes and physical blocks. Oh,
and this is on 4k disks.

Honestly, this is one of the reasons I REALLY want to move to btrfs
when it fully supports raid5. Right now the various layers don't talk
to each other and that means a lot of micro-management if you don't
want a lot of read-write-read cycles (to say nothing of what you can
buy with a filesystem that can aim to overwrite entire stripes at a
time).

Rich

Re: Is my RAID performance bad possibly due to starting sector value? [ In reply to ]

Jun 20, 2013, 12:28 PM

Post #4 of 46 (9731 views)

On Thu, Jun 20, 2013 at 12:16 PM, Volker Armin Hemmann
<volkerarmin@googlemail.com> wrote:
<SNIP>
>> Looking for some thoughtful ideas from those more experienced in this area.
>>
>> Cheers,
>> Mark
>>
>>
>
> man mkfs.xfs
>
> man mkfs.ext4
>
> look for stripe size etc.
>
> Have fun.
>

I am probably mistaken but I thought that stuff was for hardware RAID
and that for mdadm type software RAID it was handled by mdadm?

I certainly don't remember any of the Linux software RAID pages I've
read about setting up RAID suggesting that these options are
important, but I'll go look around.

Thanks!

Re: Is my RAID performance bad possibly due to starting sector value? [ In reply to ]

Jun 20, 2013, 12:31 PM

Post #5 of 46 (9720 views)

On Thu, Jun 20, 2013 at 12:27 PM, Rich Freeman <rich0@gentoo.org> wrote:
> On Thu, Jun 20, 2013 at 3:10 PM, Mark Knecht <markknecht@gmail.com> wrote:
>> Looking for some thoughtful ideas from those more experienced in this area.
>
> Please do share your findings. I suspect my own RAID+LVM+EXT3/4
> system is not optimized - especially with LVM I have no idea how
> blocks in ext3/4 end up mapping to stripes and physical blocks. Oh,
> and this is on 4k disks.
>
> Honestly, this is one of the reasons I REALLY want to move to btrfs
> when it fully supports raid5. Right now the various layers don't talk
> to each other and that means a lot of micro-management if you don't
> want a lot of read-write-read cycles (to say nothing of what you can
> buy with a filesystem that can aim to overwrite entire stripes at a
> time).
>
> Rich
>

I'll share everything I find, true or false, and maybe as a group we
can figure out what's right.

In the meantime, please be careful with your RAID5 and do good backups
:-) I ran RAID5 for awhile but moved to RAID6 due to the number of
reports I read where one drive went bad on a RAID5 and then the RAID
lost a second drive before the original bad drive was replaced and
everything was gone.

Cheers,
Mark

Re: Is my RAID performance bad possibly due to starting sector value? [ In reply to ]

Jun 20, 2013, 1:45 PM

Post #6 of 46 (9715 views)

On Thu, Jun 20, 2013 at 12:16 PM, Volker Armin Hemmann
<volkerarmin@googlemail.com> wrote:
<SNIP>
> man mkfs.xfs
>
> man mkfs.ext4
>
> look for stripe size etc.
>
> Have fun.
>
Volker,
I find way down at the bottom of the RAID setup page that they do
say stride & stripe are important for RAID4 & RAID5, but remain
non-committal for RAID6. None the less thanks for the idea. Now I
guess I have to figure out how to test it in less than 10 weeks. I
think I'm in trouble at this point having only 1 file system. Possibly
it would be better to have a second just to be able to change settings
with tune2fs and being able to do it quickly.

None the less thanks.

- Mark

Re: Is my RAID performance bad possibly due to starting sector value? [ In reply to ]

Jun 21, 2013, 12:31 AM

Post #7 of 46 (9715 views)

Mark Knecht posted on Thu, 20 Jun 2013 12:10:04 -0700 as excerpted:

> Does anyone know of info on how the starting sector number might
> impact RAID performance under Gentoo? The drives are WD-500G RE3 drives
> shown here:
>
> http://www.amazon.com/Western-Digital-WD5002ABYS-3-5-inch-Enterprise/dp/
B001EMZPD0/ref=cm_cr_pr_product_top
>
> These are NOT 4k sector sized drives.
>
> Specifically I'm a 5-drive RAID6 for about 1.45TB of storage. My
> benchmarking seems abysmal at around 40MB/S using dd copying large
> files.
> It's higher, around 80MB/S if the file being transferred is coming from
> an SSD, but even 80MB/S seems slow to me. I see a LOT of wait time in
> top.
> And my 'large file' copies might not be large enough as the machine has
> 24GB of DRAM and I've only been copying 21GB so it's possible some of
> that is cached.

I /suspect/ that the problem isn't striping, tho that can be a factor,
but rather, your choice of raid6. Note that I personally ran md/raid-6
here for awhile, so I know a bit of what I'm talking about. I didn't
realize the full implications of what I was setting up originally, or I'd
have not chosen raid6 in the first place, but live and learn as they say,
and that I did.

General rule, raid6 is abysmal for writing and gets dramatically worse as
fragmentation sets in, tho reading is reasonable. The reason is that in
ordered to properly parity-check and write out less-than-full-stripe
writes, the system must effectively read-in the existing data and merge
it with the new data, then recalculate the parity, before writing the new
data AND 100% of the (two-way in raid-6) parity. Further, because raid
sits below the filesystem level, it knows nothing about what parts of the
filesystem are actually used, and must read and write the FULL data
stripe (perhaps minus the new data bit, I'm not sure), including parts
that will be empty on a freshly formatted filesystem.

So with 4k block sizes on a 5-device raid6, you'd have 20k stripes, 12k
in data across three devices, and 8k of parity across the other two
devices. Now you go to write a 1k file, but in ordered to do so the full
12k of existing data must be read in, even on an empty filesystem,
because the RAID doesn't know it's empty! Then the new data must be
merged in and new checksums created, then the full 20k must be written
back out, certainly the 8k of parity, but also likely the full 12k of
data even if most of it is simply rewrite, but almost certainly at least
the 4k strip on the device the new data is written to.

As I said that gets much worse as a filesystem ages, due to fragmentation
meaning writes are more often writes to say 3 stripe fragments instead of
a single whole stripe. That's what proper stride size, etc, can help
with, if the filesystem's reasonably fragmentation resistant, but even
then filesystem aging certainly won't /help/.

Reads, meanwhile, are reasonable speed (in normal non-degraded mode),
because on a raid6 the data is at least two-way striped (on a 4-device
raid, your 5-device would be three-way striped data, the other two being
parity of course), so you do get moderate striping read bonuses.

Then there's all that parity information available and written out at
every write, but it's not actually used to check the reliability of the
data in normal operation, only to reconstruct if a device or two goes
missing.

On a well laid out system, I/O to the separate drives at least shouldn't
interfere with each other, assuming SATA and a chipset and bus layout
that can handle them in parallel, not /that/ big a feat on today's
hardware at least as long as you're still doing "spinning rust", as the
mechanical drive latency is almost certainly the bottleneck there, and at
least that can be parallelized to a reasonable degree across the
individual drives.

What I ultimately came to realize here is that unless the job at hand is
nearly 100% read on the raid, with the caveat that you have enough space,
raid1 is almost certainly at least as good if not a better choice. If
you have the devices to support it, you can go for raid10/50/60, and a
raid10 across 5 devices is certainly possible with mdraid, but a straight
raid-6... you're generally better off with an N-way raid-1, for a couple
reasons.

First, md/raid1 is surprisingly, even astoundingly, good at multi-task
scheduling reads. So any time there's multiple I/O read tasks going on
(like during boot), raid1 works really well, with the scheduler
distributing tasks among the available devices, this minimizing seek-
latency. So take a 5-device raid-1, you can very likely accomplish at
least 5 and possibly 6 or even 7 read jobs in say 110 or 120% of the time
it would take to do just the longest one on a single device, almost
certainly well before a single device could have done the two longest
read jobs. This also works if there's a single task alternating reads of
N different files/directories, since the scheduler will again distribute
jobs among the devices, so say one device head stays over the directory
information, while another goes to read the first file, the second reads
another file, etc, and the heads stay where they are until they're needed
elsewhere so the more devices in raid1 you have the more likely it is
that more data read from the same location still has a head located right
over it and can just read it as the correct portion of the disk spins
underneath, instead of first seeking to the correct spot on the disk.

It's worth pointing out that in the case of parallel job read access, due
to this parallel read-scheduling md/raid1 can often best raid0
performance, despite raid0's technically better single-job thruput
numbers. This was something I learned by experience as well, that makes
sense but that I had TOTALLY not realized or calculated for in my
original setup, as I was running raid0 for things like the gentoo ebuild
tree and the kernel sources, since I didn't need redundancy for them. My
raid0 performance there was rather disappointing, because both portage
tree updates and dep calculation and the kernel build process don't
optimize well for thruput, which is what raid0 does, but optimize rather
better for parallel I/O, where raid1 shines especially for read.

Second, md/raid1 writes, because they happen in parallel with the
bottleneck being the spinning rust, basically occur at the speed of the
slowest disk. So you don't get N-way parallel job write speed, just
single disk speed, but it's still *WAY* *WAY* better than raid6, which
has to read in the existing data and do the merge before it can write
back out. **THAT'S THE RAID6 PERFORMANCE KILLER**, or at least it was
for me, effectively giving you half-device speed writes because the data
in too many cases must be read in first before it can be written. Raid1
doesn't have that problem -- it doesn't get a write performance
multiplier from the N devices, but at least it doesn't get device
performance cut in half like raid5/6 does.

Third, the read-scheduling benefits of #1 help to a lessor extent with
large same-raid1 copies as well. Consider, the first block must be read
by one device, then written to all at the new location. The second
similarly, then the third, etc. But, with proper scheduling an N-way
raid1 doing an N-block copy has done N+1 operations on all devices at the
end of that N-block copy. IOW, given the memory to use as a buffer, the
read can be done in parallel, reading N blocks in at once, one from each
device, then the writes, one block at a time to all devices. So a 5-way
raid1 will have done 6 jobs on each of the 5 devices at the end, 1 read
and 5 writes, to write out 5 blocks. (In actuality due to read-ahead I
think it's optimally 64-k blocks per device, 16 4-k blocks on each, 320k
total, but that's well within the usual minimal 2MB drive buffer size,
and the drive will probably do that on its own if both read and write-
caching are on, given scheduling that forces a cache-flush only at the
end, not multiple times in the middle. So all the kernel has to do is be
sure it's not interfering by forcing untimely flushes, and the drives
should optimize on their own.)

Forth, back to the parity. Remember, raid5/6 has all that parity that it
writes out (but basically never reads in normal mode, only when degraded,
in ordered to reconstruct the data from the missing device(s)), but
doesn't actually use it for integrity checking. So while raid1 doesn't
have the benefit of that parity data, it's not like raid5/6 used it
anyway, and an N-way raid1 means even MORE missing-device protection
since you can lose all but one device and keep right on going as if
nothing happened. So a 5-way raid1 can lose 4 devices, not just the two
devices of a 5-way raid6 or the single device of a raid5. Yes, there's
the loss of parity/integrity data with raid1, BUT RAID5/6 DOESN'T USE
THAT DATA FOR INTEGRITY CHECKING ANYWAY, ONLY FOR RECONSTRUCTION IN THE
CASE OF DEVICE LOSS! So the N-way raid1 is far more redundant since you
have N copies of the data, not one copy plus two-way-parity-that's-never-
used-except-for-reconstruction.

Fifth, in the event of device loss, a raid1 continues to function at
normal speed, because it's simply an N-way copy with a bit of extra
metadata to keep track of the number of N-ways. Of course you'll lose
the benefit of read-parallelization that the missing device provided, and
you'll lose the redundancy of the missing device, but in general,
performance remains pretty much the same no matter how many ways it's
raid-1 mirrored. Contrast that with raid5/6 which is SEVERELY read
performance impacted by device loss, since it then must reconstruct the
data using the parity data, not simply read it from somewhere else, which
is what raid1 does.

The single down side to raid1 as opposed to raid5/6 is the loss of the
extra space made available by the data striping, 3*single-device-space in
the case of 5-way raid6 (or 4-way raid5) vs. 1*single-device-space in the
case of raid1. Otherwise, no contest, hands down, raid1 over raid6.

IOW, you're seeing now exactly why raid6 and to a lessor extent raid5
have such terrible performance (as opposed to reliability) reputations.
Really, unless you simply don't have the space to make it raid1, I
**STRONGLY** urge you to try that instead. I know I was very happily
surprised by the results I got, and only then realized what all the
negativity I'd seen around raid5/6 had been about, as I really hadn't
understood it at all when I was doing my original research.

Meanwhile, Rich0 already brought up btrfs, which really does promise a
better solution to many of these issues than md/raid, in part due to that
is arguably "layering violation", but really DOES allow for some serious
optimizations in the multiple-drive case, because as a filesystem, it
DOES know what's real data and what's empty space that isn't worth
worrying about, and because unlike raid5/6 parity, it really DOES care
about data integrity, not just rebuilding in case of device failure.

So several points on btrfs:

1) It's still in heavy development. The base single-device filesystem
case works reasonably well now and is /almost/ stable, tho I'd still urge
people to keep good backups as it's simply not time tested and settled,
and won't be for at least a few more kernels as they're still busy with
the other features. Second-level raid0/raid1/raid10 is at an
intermediate level. Primary development and initial bug testing and
fixing is done but they're still working on bugs that people doing only
traditional single-device filesystems simply don't have to worry about.
Third-round raid5/6 is still very new, introduced as VERY experimental
only with 3.9 IIRC, and is currently EXPECTED to eat data in power-loss
or crash events, so it's ONLY good for preliminary testing at this point.

Thus, if you're using btrfs at all, keep good backups, and keep current,
even -rc (if not live-git) on the kernel, because there really are fixes
in every single kernel for very real corner-case problems they are still
coming across. But single-device is /relatively/ stable now, so provided
you keep good *TESTED* backups and are willing and able to use them if it
comes to it, and keep current on the kernel, go for that. And I'm
personally running dual-device raid-1 mode across two SSDs, at the second
stage deployment level. I tried that (but still on spinning rust) a year
ago and decided btrfs simply wasn't ready for me yet, so it has come
quite a way in the last year. But raid5/6 mode is still fresh third-tier
development, which I'd not consider usable until at LEAST 3.11 and
probably 3.12 or later (maybe a year from now, since it's less mature
than raid1 was at this point last year, but should mature a bit faster).

Takeaway: If you don't have a backup you're prepared to use, you
shouldn't be even THINKING about btrfs at this point, no matter WHAT type
of deployment you're considering. If you do, you're probably reasonably
safe with traditional single-device btrfs, intermediately risky/safe with
raid0/1/10, don't even think about raid5/6 for real deployment yet,
period.

2) RAID levels work QUITE a bit differently on btrfs. In particular,
what btrfs calls raid1 mode (with the same applying to raid10) is simply
two-way-mirroring, NO MATTER THE NUMBER OF DEVICES. There's no multi-way
mirroring yet available, unless you're willing to apply not-yet-
mainstreamed patches. It's planned, but not yet applied. The roadmap
says it'll happen after raid5/6 are introduced (they have been, but
aren't yet really finished including power-loss-recovery, etc), so I'm
guessing 3.12 at the earliest as I think 3.11 is still focused on raid5/6
completion.

3) Btrfs raid1 mode is used to provide second-source for its data
integrity feature as well, such that if one copy's checksum doesn't
verify, it'll try the other one. Unfortunately #2 means there's only the
single fallback to try, but that's better than most filesystems, without
data integrity at all, or if they have it, no fallback if it fails.

The combination of #2 and 3 was a bitter pill for me a year ago, when I
was still running on aging spinning rust, and thus didn't trust two-copy-
only redundancy. I really like the data integrity feature, but just a
single backup copy was a great disappointment since I didn't trust my old
hardware, and unfortunately two-copy-max remains the case for so-called
raid1. (Raid5/6 mode apparently introduces N-way copies or some such,
but as I said, it's not complete yet and is EXPECTED to eat data. N-way-
mirroring will build on that and is on the horizon, but it has been on
the horizon and not seeming to get much closer for over a year now...)
Fortunately for me, my budget is in far better shape this year, and with
the dual new SSDs I purchased and with spinning rust for backup still, I
trust my hardware enough now to run the 2-way-only mirroring that btrfs
calls raid1 mode.

4) As mentioned above in the btrfs intro paragraph, btrfs, being a
filesystem, actually knows what data is actual data, and what is safely
left untracked and thus unsynced. Thus, the read-data-in-before-writing-
it problem will be rather less, certainly on freshly formated disks where
most existing data WILL be garbage/zeros (trimmed if on SSD, as mkfs.btrfs
issues a trim command for the entire filesystem range before it creates
the superblocks, etc, so empty space really /is/ zeroed). Similarly with
"slack space" that's not currently used but was used previously, as the
filesystem ages -- btrfs knows that it can ignore that data too, and thus
won't have to read it in to update the checksum when writing to a raid5/6
mode btrfs.

5) There's various other nice btrfs features and a few caveats as well,
but with the exception of anything btrfs-raid pertaining I totally forgot
about, they're out of scope for this thread, which is after all, on raid,
so I'll skip discussing them here.

So bottom line, I really recommend md/raid1 for now. Unless you want to
go md/raid10, with three-way-mirroring on the raid1 side. AFAIK that's
doable with 5 devices, but it's simpler, certainly conceptually simpler
which can make a different to an admin trying to work with it, with 6.

If the data simply won't fit on the 5-way raid1 and you want to keep at
least 2-device-loss protection, consider splitting it up, raid1 with
three devices for the first half, then either get a sixth device to do
the same with the second half, or go raid1 with two devices and put your
less critical data on the second set. Or, do the raid10 with 5 devices
thing, but I'll admit that while I've read that it's possible, I don't
really conceptually understand it myself, and haven't tried it, so I have
no personal opinion or experience to offer on that. But in that case I
really would try to scrap up the money for a sixth device if possible,
and do raid10 with 3-way redundancy 2-way-striping across the six, simply
because it's easier to conceptualize and thus to properly administer.

--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman

Re: Re: Is my RAID performance bad possibly due to starting sector value? [ In reply to ]

Jun 21, 2013, 3:28 AM

Post #8 of 46 (9704 views)

On Fri, Jun 21, 2013 at 3:31 AM, Duncan <1i5t5.duncan@cox.net> wrote:
> So with 4k block sizes on a 5-device raid6, you'd have 20k stripes, 12k
> in data across three devices, and 8k of parity across the other two
> devices.

With mdadm on a 5-device raid6 with 512K chunks you have 1.5M in a
stripe, not 20k. If you modify one block it needs to read all 1.5M,
or it needs to read at least the old chunk on the single drive to be
modified and both old parity chunks (which on such a small array is 3
disks either way).

> Forth, back to the parity. Remember, raid5/6 has all that parity that it
> writes out (but basically never reads in normal mode, only when degraded,
> in ordered to reconstruct the data from the missing device(s)), but
> doesn't actually use it for integrity checking.

I wasn't aware of this - I can't believe it isn't even an option
either. Note to self - start doing weekly scrubs...

> The single down side to raid1 as opposed to raid5/6 is the loss of the
> extra space made available by the data striping, 3*single-device-space in
> the case of 5-way raid6 (or 4-way raid5) vs. 1*single-device-space in the
> case of raid1. Otherwise, no contest, hands down, raid1 over raid6.

This is a HUGE downside. The only downside to raid1 over not having
raid at all is that your disk space cost doubles. raid5/6 is
considerably cheaper in that regard. In a 5-disk raid5 the cost of
redundancy is only 25% more, vs a 100% additional cost for raid1. To
accomplish the same space as a 5-disk raid5 you'd need 8 disks. Sure,
read performance would be vastly superior, but if you're going to
spend $300 more on hard drives and whatever it takes to get so many
SATA ports on your system you could instead add an extra 32GB of RAM
or put your OS on a mirrored SSD. I suspect that both of those
options on a typical workload are going to make a far bigger
improvement in performance.

Which is better really depends on your workload. In my case much of
my raid space is used my mythtv, or for storage of stuff I only
occasionally use. In these use cases the performance of the raid5 is
more than adequate, and I'd rather be able to keep shows around for an
extra 6 months in HD than have the DVR respond a millisecond faster
when I hit play. If you really have sustained random access of the
bulk of your data than a raid1 would make much more sense.

> So several points on btrfs:
>
> 1) It's still in heavy development.

That is what is keeping me away. I won't touch it until I can use it
with raid5, and the first common containing that hit the kernel weeks
ago I think (and it has known gaps). Until it is stable I'm sticking
with my current setup.

> 2) RAID levels work QUITE a bit differently on btrfs. In particular,
> what btrfs calls raid1 mode (with the same applying to raid10) is simply
> two-way-mirroring, NO MATTER THE NUMBER OF DEVICES. There's no multi-way
> mirroring yet available

Odd, for some reason I thought it let you specify arbitrary numbers of
copies, but looking around I think you're right. It does store two
copies of metadata regardless of the number of drives unless you
override this.

However, if one considered raid1 expensive, having multiple layers of
redundancy is REALLY expensive if you aren't using Reed Solomon and
many data disks.

From my standpoint I don't think raid1 is the best use of money in
most cases, either for performance OR for data security. If you want
performance the money is probably better spent on other components.
If you want data security the money is probably better spent on
offline backups. However, this very-much depends on how the disks
will be used - there are certainly cases where raid1 is your best
option.

Rich

Re: Re: Is my RAID performance bad possibly due to starting sector value? [ In reply to ]

rsanders at sgi

Jun 21, 2013, 7:23 AM

Post #9 of 46 (9706 views)

Rich Freeman, mused, then expounded:
> On Fri, Jun 21, 2013 at 3:31 AM, Duncan <1i5t5.duncan@cox.net> wrote:
>
> > The single down side to raid1 as opposed to raid5/6 is the loss of the
> > extra space made available by the data striping, 3*single-device-space in
> > the case of 5-way raid6 (or 4-way raid5) vs. 1*single-device-space in the
> > case of raid1. Otherwise, no contest, hands down, raid1 over raid6.
>
> This is a HUGE downside. The only downside to raid1 over not having
> raid at all is that your disk space cost doubles. raid5/6 is
> considerably cheaper in that regard. In a 5-disk raid5 the cost of
> redundancy is only 25% more, vs a 100% additional cost for raid1. To
> accomplish the same space as a 5-disk raid5 you'd need 8 disks. Sure,
> read performance would be vastly superior, but if you're going to
> spend $300 more on hard drives and whatever it takes to get so many
> SATA ports on your system you could instead add an extra 32GB of RAM
> or put your OS on a mirrored SSD. I suspect that both of those
> options on a typical workload are going to make a far bigger
> improvement in performance.
>

However, the incidence of failure is less with RAID1 than RAID5/6. As
the number of devices increases, the failure rate increases. Indeed,
the performance and total space can outweigh the increase in device
failure. However, more devices - especially more devices that have
motrs and bearings, takes more power, generates more heat, and increases
the need for more backups to avert an increase in failures.

Bob
--
-

Re: Is my RAID performance bad possibly due to starting sector value? [ In reply to ]

Jun 21, 2013, 7:27 AM

Post #10 of 46 (9719 views)

Rich Freeman posted on Fri, 21 Jun 2013 06:28:35 -0400 as excerpted:

> On Fri, Jun 21, 2013 at 3:31 AM, Duncan <1i5t5.duncan@cox.net> wrote:
>> So with 4k block sizes on a 5-device raid6, you'd have 20k stripes, 12k
>> in data across three devices, and 8k of parity across the other two
>> devices.
>
> With mdadm on a 5-device raid6 with 512K chunks you have 1.5M in a
> stripe, not 20k. If you modify one block it needs to read all 1.5M, or
> it needs to read at least the old chunk on the single drive to be
> modified and both old parity chunks (which on such a small array is 3
> disks either way).

I'll admit to not fully understanding chunks/stripes/strides in terms of
actual size tho I believe you're correct, it's well over the filesystem
block size and a half-meg is probably right. However, the original post
went with a 4k blocksize, which is pretty standard as that's the usual
memory page size as well so it makes for a convenient filesystem blocksize
too, so that's what I was using as a base for my numbers. If it's 4k
blocksize, then 5-device raid6 stripe would be 3*4k=12k of data, plus
2*4k=8k of parity.

>> Forth, back to the parity. Remember, raid5/6 has all that parity that
>> it writes out (but basically never reads in normal mode, only when
>> degraded,
>> in ordered to reconstruct the data from the missing device(s)), but
>> doesn't actually use it for integrity checking.
>
> I wasn't aware of this - I can't believe it isn't even an option either.
> Note to self - start doing weekly scrubs...

Indeed. That's one of the things that frustrated me with mdraid -- all
that data integrity metadata there, but just going to waste in normal
operation, only used for device recovery.

Which itself can be a problem as well, because if there *IS* an
undetected cosmic-ray-error or whatever and a device goes out, that means
you'll lose integrity on a second device in the rebuild as well (if it
was a data device that dropped out and not parity anyway), because the
parity's screwed against the undetected error and will thus rebuild a bad
copy of the data on the replacement device.

And it's one of the things which so attracted me to btrfs, too, and why I
was so frustrated to see it could only be a single redundancy (two-way-
mirrored), no way to do more. The btrfs sales pitch talks about how
great data integrity and the ability to go find a good copy when the
data's bad, but what if the only allowed second copy is bad as well?
OOPS!

But as I said, N-way mirroring is on the btrfs roadmap, it's simply not
there yet.

>> The single down side to raid1 as opposed to raid5/6 is the loss of the
>> extra space made available by the data striping, 3*single-device-space
>> in the case of 5-way raid6 (or 4-way raid5) vs. 1*single-device-space
>> in the case of raid1. Otherwise, no contest, hands down, raid1 over
>> raid6.
>
> This is a HUGE downside. The only downside to raid1 over not having
> raid at all is that your disk space cost doubles. raid5/6 is
> considerably cheaper in that regard. In a 5-disk raid5 the cost of
> redundancy is only 25% more, vs a 100% additional cost for raid1. To
> accomplish the same space as a 5-disk raid5 you'd need 8 disks. Sure,
> read performance would be vastly superior, but if you're going to spend
> $300 more on hard drives and whatever it takes to get so many SATA ports
> on your system you could instead add an extra 32GB of RAM or put your OS
> on a mirrored SSD. I suspect that both of those options on a typical
> workload are going to make a far bigger improvement in performance.

I'd suggest that with the exception of large database servers where the
object is to be able to cache the entire db in RAM, the SSDs are likely a
better option.

FWIW, my general "gentoo/amd64 user" rule of thumb is 1-2 gig base, plus
1-2 gig per core. Certainly that scale can slide either way and it'd
probably slide down for folks not doing system rebuilds in tmpfs, as
gentooers often do, but up or down, unless you put that ram in a battery-
backed ramdisk, 32-gig is a LOT of ram, even for an 8-core.

FWIW, with my old dual-dual-core (so four cores), 8 gig RAM was nicely
roomy, tho I /did/ sometimes bump the top and thus end up either dumping
either cache or swapping. When I upgraded to the 6-core, I used that
rule of thumb and figured ~12 gig, but due to the power-of-twos
efficiency rule, I ended up with 16 gig, figuring that was more than I'd
use in practice, but better than limiting it to 8 gig.

I was right. The 16 gig is certainly nice, but in reality, I'm typically
entirely wasting several gigs of it, not even cache filling it up. I
typically run ~1 gig in application memory and several gigs in cache,
with only a few tens of MB in buffer. But while I'll often exceed my old
capacity of 8 gig, it's seldom by much, and 12 gig would handle
everything including cache without dumping at well over the 90th
percentile, probably 97% or there abouts. Even with parallel make at
both the ebuild and global portage level and with PORTAGE_TMPDIR in
tmpfs, I hit 100% on the cores well before I run out of RAM and start
dumping cache or swapping. The only time that has NOT been the case is
when I deliberately saturate, say a kernel build with an open-ended -j so
it stacks up several-hundred jobs at once.

Meanwhile, the paired SSDs in btrfs raid1 make a HUGE practical
difference, especially in things like the (cold-cache) portage tree (and
overlays) sync, kernel git pull, etc. (In my case actual booting didn't
get a huge boost as I run ntp-client and ntpd at boot, and the ntp-client
time sync takes ~12 seconds, more than the rest of the boot put
together. But cold-cache loading kde happens faster now -- I actually
uninstalled the ksplash and just go text-console login to x-black-screen
to kde/plasma desktop, now. But the tree sync and kernel pull are still
the places I appreciate the SSDs most.)

And notably, because the cold-cache system is so much faster with the
SSDs, I tend to actually shut-down instead of suspending, now, so I tend
to cache even less and thus use less memory with the SSDs than before.
I /could/ probably do 8-gig RAM now instead of 16, and not miss it. Even
a gig per core, 6-gig, wouldn't be terrible, tho below that would start
to bottleneck and pinch a bit again I suspect.

> Which is better really depends on your workload. In my case much of my
> raid space is used my mythtv, or for storage of stuff I only
> occasionally use. In these use cases the performance of the raid5 is
> more than adequate, and I'd rather be able to keep shows around for an
> extra 6 months in HD than have the DVR respond a millisecond faster when
> I hit play. If you really have sustained random access of the bulk of
> your data than a raid1 would make much more sense.

Definitely. For mythTV or similar massive media needs, raid5 will be
fast enough. And I suspect just the single device-loss tolerance is a
reasonable risk tradeoff for you too, since after all it /is/ just media,
so tolerating loss of a single device is good, but the risk of losing two
before a full rebuild with a replacement if one fails is acceptable,
given the cost vs. size tradeoff with the massive size requirements of
video.

But again, the OP seemed to find his speed benchmarks disappointing, to
say the least, and I believe pointing out raid6 as the culprit is
accurate. Which, given his production-rating reliability stock trading
VMs usage, I'm guessing raid5/6 really isn't the ideal match. Massive
media, yes, definitely. Massive VMs, not so much.

>> So several points on btrfs:
>>
>> 1) It's still in heavy development.
>
> That is what is keeping me away. I won't touch it until I can use it
> with raid5, and the first common containing that hit the kernel weeks
> ago I think (and it has known gaps). Until it is stable I'm sticking
> with my current setup.

Question: Would you use it for raid1 yet, as I'm doing? What about as a
single-device filesystem? Do you believe my estimates of reliability in
those cases (almost but not quite stable for single-device, kind of in
the middle for raid1/raid0/raid10, say a year behind single-device and
raid5/6/50/60 about a year behind that) reasonably accurate?

Because if you're waiting until btrfs raid5 is fully stable, that's
likely to be some wait yet -- I'd say a year, likely more given that
everything btrfs has seemed to take longer than people expected. But if
you're simply waiting until it matures to the point that say btrfs raid1
is at now, or maybe even a bit less, but certainly to where it's complete
plus say a kernel release to work out a few more wrinkles, then that's
quite possible by year-end.

>> 2) RAID levels work QUITE a bit differently on btrfs. In particular,
>> what btrfs calls raid1 mode (with the same applying to raid10) is
>> simply two-way-mirroring, NO MATTER THE NUMBER OF DEVICES. There's no
>> multi-way mirroring yet available
>
> Odd, for some reason I thought it let you specify arbitrary numbers of
> copies, but looking around I think you're right. It does store two
> copies of metadata regardless of the number of drives unless you
> override this.

Default is single-copy data, dual-copy metadata, regardless of number of
devices (single device does DUP metadata, two copies on the same device,
by default), with the exception of SSDs, where the metadata default is
single since many of the SSD firmwares (sandforce firmware, with its
compression features, is known to do this, tho mine, I forgot the
firmware brand ATM but it's Corsair Neutron SSDs aimed at the server/
workstation market where unpredictability isn't considered a feature,
doesn't as one of its features is stable performance and usage regardless
of the data its fed) do dedup on identical copy data anyway. At least
that's the explanation given for the SSD exception.

But the real gotcha is that there's no way to setup N-way (N>2)
redundancy on btrfs raid1/10, and I know for a fact that catches some
admins by nasty surprise, as I've seen it come up on the btrfs list as
well as had my own personal disappointment with it, tho luckily I did my
research and figured that out before I actually installed on btrfs.

I just wish they'd called it 2-way-mirroring instead of raid1, as that
wouldn't be the deception in labeling that I consider the btrfs raid1
moniker at this point, and admins would be far less likely to be caught
unaware when a second device goes haywire that they /thought/ they'd be
covered for. Of course at this point it's all still development anyway,
so no sane admin is going to be lacking backups in any case, but there's
a lot of people flying by the seat of their pants out there, who have NOT
done the research, and I they show up frequently on the btrfs list, after
it's too late. (Tho certainly there's less of them showing up now than
a year ago, when I first investigated btrfs, I think both due to btrfs
maturing quite a bit since then and to a lot of the original btrfs hype
dying down, which is a good thing considering the number of folks that
were installing it, only to find out once they lost data that it was
still development.)

> However, if one considered raid1 expensive, having multiple layers of
> redundancy is REALLY expensive if you aren't using Reed Solomon and many
> data disks.

Well, depending on the use case. In your media case, certainly.
However, that's one of the few cases that still gobbles storage space as
fast as the manufacturers up their capacities, and that is likely to
continue to do so for at least a few more years, given that HD is still
coming in, so a lot of the media is still SD, and with quad-HD in the
wings as well, now. But once we hit half-petabyte, I suppose even quad-HD
won't be gobbling the space as fast as they can upgrade it, any more. So
a half-decade or so, maybe?

Plus of course the shear bandwidth requirements for quad-HD are
astounding, so at that point either some serious raid0/x0 raid or ssds
for the speed will be pretty mandatory anyway, remaining SSD size limits
or no SSD size limits.

> From my standpoint I don't think raid1 is the best use of money in most
> cases, either for performance OR for data security. If you want
> performance the money is probably better spent on other components. If
> you want data security the money is probably better spent on offline
> backups. However, this very-much depends on how the disks will be used
> - there are certainly cases where raid1 is your best option.

I agree when the use is primarily video media. Other than that, a pair
of 2 TB spinning rust drives tends to still go quite a long way, and
tends to be a pretty good cost/risk tradeoff IMO. Throwing in a third 2-
TB drive for three-way raid1 mirroring is often a good idea as well,
where the additional data security is needed, but beyond that, the cost/
benefit balance probably doesn't make a whole lot of sense, agreed.

And offline backups are important too, but with dual 2TB drives, many
people can live with a TB of data and do multiple raid1s, giving
themselves both logically offline backup and physical device redundancy.
And if that means they do backups to the second raid set on the same
physical devices more reliably than they would with an external that they
have to physically look for and/or attach each time (as turned out to be
the case for me), then the pair of 2TB drives is quite a reasonable
investment indeed.

But if you're going for performance, spinning rust raid simply doesn't
cut it at the consumer level any longer. SSD at least the commonly used
data, leaving say the media data on spinning rust for the time being if
the budget doesn't work otherwise, as I've actually done here with my
(much smaller than yours) media collection, figuring it not worth the
cost to put /it/ on SSD just yet.

--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman

Re: Re: Is my RAID performance bad possibly due to starting sector value? [ In reply to ]

Jun 21, 2013, 8:13 AM

Post #11 of 46 (9719 views)

On Fri, Jun 21, 2013 at 10:27 AM, Duncan <1i5t5.duncan@cox.net> wrote:
> Rich Freeman posted on Fri, 21 Jun 2013 06:28:35 -0400 as excerpted:
>
>> That is what is keeping me away. I won't touch it until I can use it
>> with raid5, and the first common containing that hit the kernel weeks
>> ago I think (and it has known gaps). Until it is stable I'm sticking
>> with my current setup.
>
> Question: Would you use it for raid1 yet, as I'm doing? What about as a
> single-device filesystem? Do you believe my estimates of reliability in
> those cases (almost but not quite stable for single-device, kind of in
> the middle for raid1/raid0/raid10, say a year behind single-device and
> raid5/6/50/60 about a year behind that) reasonably accurate?

If I wanted to use raid1 I might consider using btrfs now. I think it
is still a bit risky, but the established use cases have gotten a fair
bit of testing now. I'd be more confident in using it with a single
device.

>
> Because if you're waiting until btrfs raid5 is fully stable, that's
> likely to be some wait yet -- I'd say a year, likely more given that
> everything btrfs has seemed to take longer than people expected.

That's my thought as well. Right now I'm not running out of space, so
I'm hoping that I can wait until the next time I need to migrate my
data (from 1TB to 5+TB drives, for example). With such a scenario I
don't need to have 10 drives mounted at once to migrate the data - I
can migrate existing data to 1-2 drives, remove the old ones, and
expand the new array. To migrate today would require finding someplace
to dump all the data offline and migrate the drives, as there is no
in-place way to migrate multiple ext3/4 logical volumes on top of
mdadm to a single btrfs on bare metal.

Without replying to anything in particular both you and Bob have
mentioned the importance of multiple redundancy.

Obviously risk goes down as redundancy goes up. If you protect 25
drives of data with 1 drive of parity then you need 2/26 drives to
fail to hose 25 drives of data. If you protect 1 drive of data with
25 drives of parity (call them mirrors or parity or whatever - they're
functionally equivalent) then you need 25/26 drives to fail to lose 1
drive of data. RAID 1 is actually less effective - if you protect 13
drives of data with 13 mirrors you need 2/26 drives to fail to lose 1
drive of data (they just have to be the wrong 2). However, you do
need to consider that RAID is not the only way to protect data, and
I'm not sure that multiple-redundancy raid-1 is the most
cost-effective strategy.

If I had 2 drives of data to protect and had 4 spare drives to do it
with, I doubt I'd set up a 3x raid-1/5/10 setup (or whatever you want
to call it - imho raid "levels" are poorly named as there really is
just striping and mirroring and adding RS parity and everything else
is just combinations). Instead I'd probably set up a
RAID1/5/10/whatever with single redundancy for faster storage and
recovery, and an offline backup (compressed and with
incrementals/etc). The backup gets you more security and you only
need it in a very unlikely double-failure. I'd only invest in
multiple redundancy in the event that the risk-weighted cost of having
the node go down exceeds the cost of the extra drives. Frankly in
that case RAID still isn't the right solution - you need a backup node
someplace else entirely as hard drives aren't the only thing that can
break in your server.

This sort of rationale is why I don't like arguments like "RAM is
cheap" or "HDs are cheap" or whatever. The fact is that wasting money
on any component means investing less in some other component that
could give you more space/performance/whatever-makes-you-happy. If
you have $1000 that you can afford to blow on extra drives then you
have $1000 you could blow on RAM, CPU, an extra server, or a trip to
Disney. Why not blow it on something useful?

Rich

Re: Re: Is my RAID performance bad possibly due to starting sector value? [ In reply to ]

Jun 21, 2013, 10:40 AM

Post #12 of 46 (9711 views)

On Fri, Jun 21, 2013 at 12:31 AM, Duncan <1i5t5.duncan@cox.net> wrote:
> Mark Knecht posted on Thu, 20 Jun 2013 12:10:04 -0700 as excerpted:
>
>> Does anyone know of info on how the starting sector number might
>> impact RAID performance under Gentoo? The drives are WD-500G RE3 drives
>> shown here:
>>
>> http://www.amazon.com/Western-Digital-WD5002ABYS-3-5-inch-Enterprise/dp/
> B001EMZPD0/ref=cm_cr_pr_product_top
>>
>> These are NOT 4k sector sized drives.
>>
>> Specifically I'm a 5-drive RAID6 for about 1.45TB of storage. My
>> benchmarking seems abysmal at around 40MB/S using dd copying large
>> files.
>> It's higher, around 80MB/S if the file being transferred is coming from
>> an SSD, but even 80MB/S seems slow to me. I see a LOT of wait time in
>> top.
>> And my 'large file' copies might not be large enough as the machine has
>> 24GB of DRAM and I've only been copying 21GB so it's possible some of
>> that is cached.
>
> I /suspect/ that the problem isn't striping, tho that can be a factor,
> but rather, your choice of raid6. Note that I personally ran md/raid-6
> here for awhile, so I know a bit of what I'm talking about. I didn't
> realize the full implications of what I was setting up originally, or I'd
> have not chosen raid6 in the first place, but live and learn as they say,
> and that I did.
>
> General rule, raid6 is abysmal for writing and gets dramatically worse as
> fragmentation sets in, tho reading is reasonable. The reason is that in
> ordered to properly parity-check and write out less-than-full-stripe
> writes, the system must effectively read-in the existing data and merge
> it with the new data, then recalculate the parity, before writing the new
> data AND 100% of the (two-way in raid-6) parity. Further, because raid
> sits below the filesystem level, it knows nothing about what parts of the
> filesystem are actually used, and must read and write the FULL data
> stripe (perhaps minus the new data bit, I'm not sure), including parts
> that will be empty on a freshly formatted filesystem.
>
> So with 4k block sizes on a 5-device raid6, you'd have 20k stripes, 12k
> in data across three devices, and 8k of parity across the other two
> devices. Now you go to write a 1k file, but in ordered to do so the full
> 12k of existing data must be read in, even on an empty filesystem,
> because the RAID doesn't know it's empty! Then the new data must be
> merged in and new checksums created, then the full 20k must be written
> back out, certainly the 8k of parity, but also likely the full 12k of
> data even if most of it is simply rewrite, but almost certainly at least
> the 4k strip on the device the new data is written to.
>
<SNIP>

Hi Duncan,
Wonderful post but much too long to carry on a conversation
in-line. As you sound pretty sure of your understanding/history I'll
assume you're right 100% of the time, but only maybe 80% of the post
feels right to me at this time so let's assume I have much to learn
and go from there. I expect that others here are in a similar
situation to me - they use RAID but are laboring with little hard data
on what different portions of the system are doing and how to optimize
it. I certainly feel that's true in my case. I hope this thread over
the near or far term future might help a bit for me and potentially
others.

In thinking about this issue this morning I think it's important to
me to get down to basics and verify as much as possible, step-by-step,
so that I don't layer good work on top of bad assumptions. To that
end, and before I move too much farther forward, let me document a few
things about my system and the hardware available to work with and see
if you, Rich, Bob, Volker or anyone else wants to chime in about what
is correct, not correct or a better way to use it.

Basic Machine - ASUS Rampage II Extreme motherboard (4/1/2010) + 24GB
DDR3 + Core i7-980x Extreme 12 core processor
1 SDD - 120GB SATA3 on it's own controller
5+ HDD - WD5002ABYS RAID Edition 3 SATA3 drives using Intel integrated
controllers

(NOTE: I can possibly go to a 6-drive RAID if I made some changes in
the box but that's for later)

According to the WD spec
(http://www.wdc.com/en/library/spec/2879-701281.pdf) the 500GB drives
sustain 113MB/S to the drive. Using hdparm I measure 107MB/S or higher
for all 5 drives:

c2RAID6 ~ # hdparm -tT /dev/sdb

/dev/sdb:
Timing cached reads: 17374 MB in 2.00 seconds = 8696.12 MB/sec
Timing buffered disk reads: 322 MB in 3.00 seconds = 107.20 MB/sec
c2RAID6 ~ #

The SDD on it's own PCI Express controller clocks in at about 250MB/S for reads.

c2RAID6 ~ # hdparm -tT /dev/sda

/dev/sda:
Timing cached reads: 17492 MB in 2.00 seconds = 8754.42 MB/sec
Timing buffered disk reads: 760 MB in 3.00 seconds = 253.28 MB/sec
c2RAID6 ~ #

TESTING: I'm using dd to test. It gives an easy to read anyway result
and seems to be used a lot. I can use bonnie++ or IOzone later but I
don't think that's necessary quite yet. Being that I have 24GB and
don't want cached data to effect the test speeds I do the following:

1) Using dd I created a 50GB file for copying using the following commands:

cd /mnt/fastVM
dd if=/dev/random of=random1 bs=1000 count=0 seek=$[1000*1000*50]

mark@c2RAID6 /VirtualMachines/bonnie $ ls -alh /mnt/fastVM/ran*
-rw-r--r-- 1 mark mark 47G Jun 21 07:10 /mnt/fastVM/random1
mark@c2RAID6 /VirtualMachines/bonnie $

2) To ensure that nothing is cached and the copies are (hopefully)
completely fair as root I do the following between each test:

sync
free -h
echo 3 > /proc/sys/vm/drop_caches
free -h

An example:

c2RAID6 ~ # sync
c2RAID6 ~ # free -h
total used free shared buffers cached
Mem: 23G 23G 129M 0B 8.5M 21G
-/+ buffers/cache: 1.6G 21G
Swap: 12G 0B 12G
c2RAID6 ~ # echo 3 > /proc/sys/vm/drop_caches
c2RAID6 ~ # free -h
total used free shared buffers cached
Mem: 23G 2.6G 20G 0B 884K 1.3G
-/+ buffers/cache: 1.3G 22G
Swap: 12G 0B 12G
c2RAID6 ~ #

3) As a first test I copy using dd the 50GB file from the SDD to the
RAID6. As long as reading the SDD is much faster than writing the
RAID6 then it should be a test of primarily the RAID6 write speed:

mark@c2RAID6 /VirtualMachines/bonnie $ dd if=/mnt/fastVM/random1 of=SDDCopy
97656250+0 records in
97656250+0 records out
50000000000 bytes (50 GB) copied, 339.173 s, 147 MB/s
mark@c2RAID6 /VirtualMachines/bonnie $

If I clear cache as above and rerun the test it's always 145-155MB/S

4) As a second test I read from the RAID6 and write back to the RAID6.
I see MUCH lower speeds, again repeatable:

mark@c2RAID6 /VirtualMachines/bonnie $ dd if=SDDCopy of=HDDWrite
97656250+0 records in
97656250+0 records out
50000000000 bytes (50 GB) copied, 1187.07 s, 42.1 MB/s
mark@c2RAID6 /VirtualMachines/bonnie $

5) As a final test, and just looking for problems if any, I do an SDD
to SDD copy which clocked in at close to 200MB/S

mark@c2RAID6 /mnt/fastVM $ dd if=random1 of=SDDCopy
97656250+0 records in
97656250+0 records out
50000000000 bytes (50 GB) copied, 251.105 s, 199 MB/s
mark@c2RAID6 /mnt/fastVM $

So, being that this RAID6 was grown yesterday from something that
has existed for a year or two I'm not sure of it's fragmentation, or
even how to determine that at this time. However it seems my problem
are RAID6 reads, not RAID6 writes, at least to new an probably never
used disk space.

I will also report more later but I can state that just using top
there's never much CPU usage doing this but a LOT of WAIT time when
reading the RAID6. It really appears the system is spinning it's
wheels waiting for the RAID to get data from the disk.

One place where I wanted to double check your thinking. My thought
is that a RAID1 will _NEVER_ outperform the hdparm -tT read speeds as
it has to read from three drives and make sure they are all good
before returning data to the user. I don't see how that could ever be
faster than what a single drive file system could do which for these
drives would be the 113MB/S WD spec number, correct? As I'm currently
getting 145MB/S it appears on the surface that the RAID6 is providing
some value, at least in these early days of use. Maybe it will degrade
over time though.

Comments?

Cheers,
Mark

Re: Re: Is my RAID performance bad possibly due to starting sector value? [ In reply to ]

rsanders at sgi

Jun 21, 2013, 10:56 AM

Post #13 of 46 (9714 views)

Mark Knecht, mused, then expounded:
> On Fri, Jun 21, 2013 at 12:31 AM, Duncan <1i5t5.duncan@cox.net> wrote:
> > Mark Knecht posted on Thu, 20 Jun 2013 12:10:04 -0700 as excerpted:
> >
>
> Basic Machine - ASUS Rampage II Extreme motherboard (4/1/2010) + 24GB
> DDR3 + Core i7-980x Extreme 12 core processor

The hit to iop performance is mainly due to the large number of cores in
the the high end Intel cpu. I suggest you find a nice 4-core Intel
processor, something non-extreme. you'll find all your IO will improve.

> 1 SDD - 120GB SATA3 on it's own controller
> 5+ HDD - WD5002ABYS RAID Edition 3 SATA3 drives using Intel integrated
> controllers
>

Again, if you're serious about RAID, get an LSI MegaRAID card. While I
have my dislikes about the LSI controller, it's a lot faster than using
MD and much faster (and more reliable) than any bios software RAID.

Oh, and don't believe all the published numbers on drives,
etc...benchmarking is an art.

Bob
--
-

Re: Re: Is my RAID performance bad possibly due to starting sector value? [ In reply to ]

Jun 21, 2013, 10:57 AM

Post #14 of 46 (9707 views)

On Fri, Jun 21, 2013 at 1:40 PM, Mark Knecht <markknecht@gmail.com> wrote:
> One place where I wanted to double check your thinking. My thought
> is that a RAID1 will _NEVER_ outperform the hdparm -tT read speeds as
> it has to read from three drives and make sure they are all good
> before returning data to the user.

That isn't correct. In theory it could be done that way, but every
raid1 implementation I've heard of makes writes to all drives
(obviously), but reads from only a single drive (assuming it is
correct). That means that read latency is greatly reduced since they
can be split across two drives which effectively means two heads per
"platter." Also, raid1 typically does not include checksumming, so if
there is a discrepancy between the drives there is no way to know
which one is right. With raid5 at least you can always correct
discrepancies if you have all the disks (though as Duncan pointed out
in practice this only happens if you do an explicit scrub on mdadm).
With btrfs every block is checksummed and so as long as there is one
good (err, consistent) copy somewhere it will be used.

Rich

Re: Re: Is my RAID performance bad possibly due to starting sector value? [ In reply to ]

gem at rellim

Jun 21, 2013, 11:10 AM

Post #15 of 46 (9710 views)

Yo Rich!

On Fri, 21 Jun 2013 13:57:20 -0400
Rich Freeman <rich0@gentoo.org> wrote:

> In theory it could be done that way, but every
> raid1 implementation I've heard of makes writes to all drives
> (obviously), but reads from only a single drive (assuming it is
> correct). That means that read latency is greatly reduced since they
> can be split across two drives which effectively means two heads per
> "platter."

Yes, that is what I see in practice. A much reduced average read time.
And if you are really pressed for speed, add more stripes and get even
more speed.

> Also, raid1 typically does not include checksumming, so if
> there is a discrepancy between the drives there is no way to know
> which one is right.

Uh, not exactly correct. Remember each HDD has ECC for each sector. If
there is a read error the HDD will detecct the bad ECC and report the
error to the RAID1 hardware/software. Then RAID1 is smart enough to try
to read from the 2nd drive.

> With raid5 at least you can always correct
> discrepancies if you have all the disks

Not really. If 2 disks fail in an n+1 RAID5 you are out of luck.
Not as uncommon occurance as one might think.

> (though as Duncan pointed out
> in practice this only happens if you do an explicit scrub on mdadm).

Which you should be doing at least weekly. Otherwise you only find out
your disks have failed when you try to do a full copy or backup, and then
you likely have multiple failures and you are out of luck.

RGDS
GARY
---------------------------------------------------------------------------
Gary E. Miller Rellim 109 NW Wilmington Ave., Suite E, Bend, OR 97701
gem@rellim.com Tel:+1(541)382-8588

Re: Re: Is my RAID performance bad possibly due to starting sector value? [ In reply to ]

Jun 21, 2013, 11:12 AM

Post #16 of 46 (9708 views)

On Fri, Jun 21, 2013 at 10:56 AM, Bob Sanders <rsanders@sgi.com> wrote:
> Mark Knecht, mused, then expounded:
>> On Fri, Jun 21, 2013 at 12:31 AM, Duncan <1i5t5.duncan@cox.net> wrote:
>> > Mark Knecht posted on Thu, 20 Jun 2013 12:10:04 -0700 as excerpted:
>> >
>>
>> Basic Machine - ASUS Rampage II Extreme motherboard (4/1/2010) + 24GB
>> DDR3 + Core i7-980x Extreme 12 core processor
>
> The hit to iop performance is mainly due to the large number of cores in
> the the high end Intel cpu. I suggest you find a nice 4-core Intel
> processor, something non-extreme. you'll find all your IO will improve.
>

Interesting point but not likely to happen. I run 3 Windows VMs all
day, most of which are doing numerical calculations and not a huge
amount of IO in the Windows environment itself. In my usage model the
12 cores get a workout nearly every day.

>
>> 1 SDD - 120GB SATA3 on it's own controller
>> 5+ HDD - WD5002ABYS RAID Edition 3 SATA3 drives using Intel integrated
>> controllers
>>
>
> Again, if you're serious about RAID, get an LSI MegaRAID card. While I
> have my dislikes about the LSI controller, it's a lot faster than using
> MD and much faster (and more reliable) than any bios software RAID.
>

I suppose if I accept your assertion above then an LSI MegaRAID might
be a better solution specifically because I _am_ using the 12 core
Extreme processor.

Will consider, at least in the long run, if this thread & work doesn't
yield significantly improved results over the next few weeks.

> Oh, and don't believe all the published numbers on drives,
> etc...benchmarking is an art.
>
> Bob

Absolutely! :-)

Thanks,
Mark

Re: Re: Is my RAID performance bad possibly due to starting sector value? [ In reply to ]

Jun 21, 2013, 11:38 AM

Post #17 of 46 (9703 views)

On Fri, Jun 21, 2013 at 10:57 AM, Rich Freeman <rich0@gentoo.org> wrote:
> On Fri, Jun 21, 2013 at 1:40 PM, Mark Knecht <markknecht@gmail.com> wrote:
>> One place where I wanted to double check your thinking. My thought
>> is that a RAID1 will _NEVER_ outperform the hdparm -tT read speeds as
>> it has to read from three drives and make sure they are all good
>> before returning data to the user.
>
> That isn't correct. In theory it could be done that way, but every
> raid1 implementation I've heard of makes writes to all drives
> (obviously), but reads from only a single drive (assuming it is
> correct). That means that read latency is greatly reduced since they
> can be split across two drives which effectively means two heads per
> "platter." Also, raid1 typically does not include checksumming, so if
> there is a discrepancy between the drives there is no way to know
> which one is right. With raid5 at least you can always correct
> discrepancies if you have all the disks (though as Duncan pointed out
> in practice this only happens if you do an explicit scrub on mdadm).
> With btrfs every block is checksummed and so as long as there is one
> good (err, consistent) copy somewhere it will be used.
>
> Rich
>

Humm...

OK, we agree on RAID1 writes. All data must be written to all drives
so there's no way to implement any real speed up in that area. If I
simplistically assume that write speeds are similar to hdparm -tT read
speeds then that's that.

On the read side I'm not sure if I'm understanding your point. I agree
that a so-designed RAID1 system could/might read smaller portions of a
larger read from RAID1 drives in parallel, taking some data from one
drive and some from another drive, and then only take action
corrective if one of the drives had troubles. However I don't know
that mdadm-based RAID1 does anything like that. Does it?

It seems to me that unless I at least _request_ all data from all
drives and minimally compare at least some error flag from the
controller telling me one drive had trouble reading a sector then how
do I know if anything bad is happening?

Or maybe you're saying it's RAID1 and I don't know if anything bad is
happening _unless_ I do a scrub and specifically check all the drives
for consistency?

Just trying to get clear what you're saying.

I do mdadm scrubs at least once a week. I still do them by hand. They
have never appeared terribly expensive watching top or iotop but
sometimes when I'm watching NetFlix or Hulu in a VM I get more pauses
when the scrub is taking place, but it's not huge.

I agree that RAID5 gives you an opportunity to get things fixed, but
there are folks who lose a disk in a RAID5, start the rebuild, and
then lose a second disk during the rebuild. That was my main reason to
go to RAID6. Not that I would ever run the array degraded but that I
could still tolerate a second loss while the rebuild was happening and
hopefully get by. That was similar to my old 3-disk RAID1 where I'd
have to lose all 3 disks to be out of business.

Thanks,
Mark

Re: Re: Is my RAID performance bad possibly due to starting sector value? [ In reply to ]

gem at rellim

Jun 21, 2013, 11:50 AM

Post #18 of 46 (9704 views)

Yo Mark!

On Fri, 21 Jun 2013 11:38:00 -0700
Mark Knecht <markknecht@gmail.com> wrote:

> On the read side I'm not sure if I'm understanding your point. I agree
> that a so-designed RAID1 system could/might read smaller portions of a
> larger read from RAID1 drives in parallel, taking some data from one
> drive and some from another drive, and then only take action
> corrective if one of the drives had troubles. However I don't know
> that mdadm-based RAID1 does anything like that. Does it?

It surely does. I have confirmed that at least monthly since md has
existed in the kernel.

> It seems to me that unless I at least _request_ all data from all
> drives and minimally compare at least some error flag from the
> controller telling me one drive had trouble reading a sector then how
> do I know if anything bad is happening?

Correct. You cant' tell if you can read something without trying to
read it. Which is why you should do a full raid rebuild every week.
>
> Or maybe you're saying it's RAID1 and I don't know if anything bad is
> happening _unless_ I do a scrub and specifically check all the drives
> for consistency?

No. A simple read will find the problem. But given it is RAID1 the only
way to be sure to read from both dirves is a raid rebuild.

> I do mdadm scrubs at least once a week. I still do them by hand. They
> have never appeared terribly expensive watching top or iotop but
> sometimes when I'm watching NetFlix or Hulu in a VM I get more pauses
> when the scrub is taking place, but it's not huge.

Which is why you should cron jothem at oh-dark-thirty.
>
> I agree that RAID5 gives you an opportunity to get things fixed, but
> there are folks who lose a disk in a RAID5, start the rebuild, and
> then lose a second disk during the rebuild.

Because they failed to do weekly rebuilds.

> Not that I would ever run the array degraded but that I
> could still tolerate a second loss while the rebuild was happening and
> hopefully get by.

Sadly most people make their RAID5 or RAID6 out of brand new,
consecutively serial numbered drives. They then get the exactly the
same temp, voltage, humidity, seek stress until they all fail within
days of each other. I have personally seen 4 of 5 drives in a RAID5
fail within 3 days many times. Usually on a Friday where the tech
decides the drive replacement can wait until Monday.

Your only protection against a full RAIDx failure is an offsite backup.

RGDS
GARY
---------------------------------------------------------------------------
Gary E. Miller Rellim 109 NW Wilmington Ave., Suite E, Bend, OR 97701
gem@rellim.com Tel:+1(541)382-8588

Re: Re: Is my RAID performance bad possibly due to starting sector value? [ In reply to ]

rsanders at sgi

Jun 21, 2013, 11:53 AM

Post #19 of 46 (9714 views)

Mark Knecht, mused, then expounded:
>
>
> I agree that RAID5 gives you an opportunity to get things fixed, but
> there are folks who lose a disk in a RAID5, start the rebuild, and
> then lose a second disk during the rebuild. That was my main reason to
> go to RAID6. Not that I would ever run the array degraded but that I
> could still tolerate a second loss while the rebuild was happening and
> hopefully get by. That was similar to my old 3-disk RAID1 where I'd
> have to lose all 3 disks to be out of business.
>

If the drives in the RAID came from the same build lot, the chances of
multi-drive failure are fairly high, if one fails.

I've had 3 out of four drives, from the same lot build, fail at the same
time. I've had others never fail. And a few that fail over time where
others from the same lot failed within a month of the first failure.

Bob
--
-

Re: Re: Is my RAID performance bad possibly due to starting sector value? [ In reply to ]

Jun 21, 2013, 11:57 AM

Post #20 of 46 (9709 views)

On Fri, Jun 21, 2013 at 2:50 PM, Gary E. Miller <gem@rellim.com> wrote:
> On Fri, 21 Jun 2013 11:38:00 -0700
> Mark Knecht <markknecht@gmail.com> wrote:
>> Or maybe you're saying it's RAID1 and I don't know if anything bad is
>> happening _unless_ I do a scrub and specifically check all the drives
>> for consistency?
>
> No. A simple read will find the problem. But given it is RAID1 the only
> way to be sure to read from both dirves is a raid rebuild.

Keep in mind that a read will only find the problem if it is visible
to the hard drive's ECC. A silent error would not be detected. It
could be detected by a rebuild, though it could not be reliably fixed
in this way. With raid5 a silent error in a single drive per stripe
could be fixed in a rebuild.

>
> Your only protection against a full RAIDx failure is an offsite backup.

++

That's why I'm not big on crazy levels of redundancy. RAID is first
and foremost a restoration avoidance tool, not a backup solution. It
reduces the risk of needing restoration, but it does not cover as many
failure modes as an offline backup. If btrfs eats your data it really
won't matter how many platters it had to chew on in the process. So,
by all means use RAID, but if you're going to spend a lot of money on
redundant disks, spend it on a backup solution instead (which might
very well involve disks, though you should move them offsite).

Rich

Re: Is my RAID performance bad possibly due to starting sector value? [ In reply to ]

Jun 22, 2013, 3:29 AM

Post #21 of 46 (9706 views)

Rich Freeman posted on Fri, 21 Jun 2013 11:13:51 -0400 as excerpted:

> On Fri, Jun 21, 2013 at 10:27 AM, Duncan <1i5t5.duncan@cox.net> wrote:

>> Question: Would you use [btrfs] for raid1 yet, as I'm doing?
>> What about as a single-device filesystem?

> If I wanted to use raid1 I might consider using btrfs now. I think it
> is still a bit risky, but the established use cases have gotten a fair
> bit of testing now. I'd be more confident in using it with a single
> device.

OK, so we agree on the basic confidence level of various btrfs features.
I trust my own judgement a bit more now. =:^)

> To migrate today would require finding someplace to dump all
> the data offline and migrate the drives, as there is no in-place way to
> migrate multiple ext3/4 logical volumes on top of mdadm to a single
> btrfs on bare metal.

... Unless you have enough unpartitioned space available still.

What I did a few years ago is buy a 1 TB USB drive I found at a good
deal. (It was very near the price of half-TB drives at the time, I
figured out later they must have gotten shipped a pallet of the wrong
ones for a sale on the half-TB version of the same thing, so it was a
single-store, get-it-while-they're-there-to-get, deal.)

That's how I was able to migrate from the raid6 I had back to raid1. I
had to squeeze the data/partitions a bit to get everything to fit, but it
did, and that's how I ended up with 4-way raid1, since it /had/ been a 4-
way raid6. All 300-gig drives at the time, so the TB USB had /plenty/ of
room. =:^)

> Without replying to anything in particular both you and Bob have
> mentioned the importance of multiple redundancy.
>
> Obviously risk goes down as redundancy goes up. If you protect 25
> drives of data with 1 drive of parity then you need 2/26 drives to fail
> to hose 25 drives of data.

Ouch!

> If you protect 1 drive of data with 25 drives of parity (call them
> mirrors or parity or whatever - they're functionally equivalent) then
> you need 25/26 drives to fail to lose 1 drive of data.

Almost correct.

Except that with 25/26 failed, you'd still have 1 working, which with
raid1/mirroring would be enough. (AFAIK that's the difference with
parity. Parity is generally done on a minimum of two devices with the
third as parity, and going down to just one isn't enough, you can lose
only one, or two if you have two-way-parity as with raid6. With
mirroring/raid1, they're all essentially identical, so one is enough to
keep going, you'd have to loose 26/26 to be dead in the water. But 25/26
dead or 26/26 dead, you better HOPE it never comes down to where that
matters!)

> RAID 1 is actually less effective - if you protect 13
> drives of data with 13 mirrors you need 2/26 drives to fail to lose 1
> drive of data (they just have to be the wrong 2). However, you do need
> to consider that RAID is not the only way to protect data, and I'm not
> sure that multiple-redundancy raid-1 is the most cost-effective
> strategy.

The first time I read that thru I read it wrong, and was about to
disagree. Then I realized what you meant... and that it was an equally
valid read of what you wrote, except...

AFAIK 13 drives of data with 13 mirrors wouldn't (normally) be called
raid1 (unless it's 13 individual raid1s). Normally, an arrangement of
that nature if configured together would be configured as raid10, 2-way-
mirrored, 13-way-striped (or possibly raid0+1, but that's not recommended
for technical reasons having to do with rebuild thruput), tho it could
also be configured as what mdraid calls linear mode (which isn't really
raid, but it happens to be handled by the same md/raid driver in Linux)
across the 13, plus raid1, or if they're configured as separate volumes,
13 individual two-disk raid1s, any of which might be what you meant (and
the wording appears to favor 13 individual raid1s).

What I interpreted it as initially was a 13-way raid1, mirrored again at
a second level to 13 additional drives, which would be called raid11,
except that there's no benefit of that over a simple single-layer 26-way
raid1 so the raid11 term is seldom seen, and that's clearly not what you
meant.

Anyway, you're correct if it's just two-way-mirrored. However, at that
level, if one was to do only two-way-mirroring, one would usually do
either raid10 for the 13-way striping, or 13 separate raid1s, which would
give one the opportunity to make some of them 3-way-mirrored (or more)
raid1s for the really vital data, leaving the less vital data as simple
2-way-mirror-raid1s.

Or raid6 and get loss-of-two tolerance, but as this whole subthread is
discussing, that can be problematic for thruput. (I've occasionally seen
reference to raid7, which is said to be 3-way-parity, loss-of-three-
tolerance, but AFAIK there's no support for it in the kernel, and I
wouldn't be surprised if all implementations are proprietary. AFAIK, in
practice, raid10 with N-way mirroring on the raid1 portion is implemented
once that many devices get involved, or other multi-level raid schemes.)

> If I had 2 drives of data to protect and had 4 spare drives to do it
> with, I doubt I'd set up a 3x raid-1/5/10 setup (or whatever you want to
> call it - imho raid "levels" are poorly named as there really is just
> striping and mirroring and adding RS parity and everything else is just
> combinations). Instead I'd probably set up a RAID1/5/10/whatever with
> single redundancy for faster storage and recovery, and an offline backup
> (compressed and with incrementals/etc). The backup gets you more
> security and you only need it in a very unlikely double-failure. I'd
> only invest in multiple redundancy in the event that the risk-weighted
> cost of having the node go down exceeds the cost of the extra drives.
> Frankly in that case RAID still isn't the right solution - you need a
> backup node someplace else entirely as hard drives aren't the only thing
> that can break in your server.

So we're talking six drives, two of data and four "spares" to play with.

Often that's setup as raid10, either two-way-striped and 3-way-mirrored,
or 3-way-striped and 2-way-mirrored, depending on whether the loss-of-two
tolerance of 3-way-mirroring or thruput of three-way-striping, is
considered of higher value.

You're right that at that level, you DO need a real backup, and it should
take priority over raid-whatever. HOWEVER, in addition to creating a
SINGLE raid across all those drives, it's possible to partition them up,
and create multiple raids out of the partitions, with one set being a
backup of the other. And since you've already stated that there's only
two drives worth of data, there's certainly room enough amongst the six
drives total to do just that.

This is in fact how I ran my raids, both my raid6 config, and my raid1
config, for a number of years, and is in fact how I have my (raid1-mode)
btrfs filesystems setup now on the SSDs.

Effectively I had/have each drive partitioned up into two sets of
partitions, my "working" set, and my "backup" set. Then I md-raided at
my chosen level each partition across all devices. So on each physical
device partition 5 might be the working rootfs partition, partition 6 the
woriing home partition... partition 9 the backup rootfs partition, and
partition 10 the backup home partition. They might end up being md3
(rootwork), md4 (homework), md7 (rootbak) and md8 (homebak).

That way, you're protected against physical device death by the
redundancy of the raids, and from fat-fingering or an update gone wrong
by the redundancy of the backup partitions across the same physical
devices.

What's nice about an arrangement such as this is that it gives you quite
a bit more flexibility than you'd have with a single raid, since it's now
possible to decide "Hmm, I don't think I actually need a backup of /var/
log, so I think I'll only run with one log partition/raid, instead of the
usual working/backup arrangement." Similarly, "You know, I ultimately
don't need backups of the gentoo tree and overlays, or of the kernel git
tree, at all, since as Linus says, 'Real men upload it to the net and let
others be their backup', and I can always redownload that from the net,
so I think I'll raid0 this partition and not keep any copies at all,
since re-downloading's less trouble than dealing with the backups
anyway." Finally, and possibly critically, it's possible to say, "You
know, what happens if I've just wiped rootbak in ordered to make a new
root backup, and I have a crash and working-root refuses to boot. I
think I need a rootbak2, and with the space I saved by doing only one log
partition and by making the sources trees raid0, I have room for it now,
without using any more space than I would had I had everything on the
same raid."

Another nice thing about it, and this is what I would have ended up doing
if I hadn't conveniently found that 1 TB USB drive at such a good price,
is that while the whole thing is partitioned up and in use, it's very
possible to wipe out the backup partitions temporarily, and recreate them
as a different raid level or a different filesystem, or otherwise
reorganize that area, then reboot into the new version, and do the same
to what was the working copies. (For the area that was raid0, well, it
was raid0 because it's easy to recreate, so just blow it away and
recreate it on the new layout. And for the single-raid log without a
backup copy, it's simple enough to simply point the log elsewhere or keep
it on rootfs for long enough to redo that set of partitions across all
physical devices.)

Again, this isn't just theory, it really works, as I've done it to
various degrees at various times, even if I found copying to the external
1 TB USB drive and booting from it more convenient to do when I
transferred from raid6 to raid1.

And being I do run ~arch, there's been a number of times I've needed to
boot to rootbak instead of rootworking, including once when a ~arch
portage was hosing symlinks just as a glibc update came along, thus
breaking glibc (!!), once when a bash update broke, and another time when
a glibc update mostly worked but I needed to downgrade and the protection
built into the glibc ebuild wasn't letting me do it from my working root.

What's nice about this setup in regard to booting to rootbak instead of
the usual working root, is that unlike booting to a liveCD/DVD rescue
disk, you have the full working system installed, configured and running
just as it was when the backup was made. That makes it much easier to
pickup and run from where you left off, with all the tools you're used to
having and modes of working you're used to using, instead of being
limited to some artificial rescue environment often with limited tools,
and in any case setup and configured differently than you have your own
system, because rootbak IS your own system, just from a few days/weeks/
months ago, whenever it was that you last did the backup.

Anyway, with the parameters you specified, two drives full of data and
four spare drives (presumably of a similar size), there's a LOT of
flexibility. There's raid10 across four drives (two-mirror, two-stripe)
with the other two as backup (this would probably be my choice given the
2-disks of data, 6 disk total, constraints, but see below, and it appears
this might be your choice as well), or raid6 across four drives (two
mirror, two parity) with two as backups (not a choice I'd likely make,
but a choice), or a working pair of drives plus two sets of backups (not
a choice I'd likely make), or raid10 across all six drives in either 3-
mirror/2-stripe or 3-stripe/2-mirror mode (I'd probably elect for this
with 3-stripe/2-mirror for the 3X speed and space, and prioritize a
separate backup, see the discussion below), or two independent 3-disk
raid5s (IMO there's better options for most cases, with the possible
exception of primarily slow media usage, just which options are better
depends on usage and priorities tho), or some hybrid combination of these.

> This sort of rationale is why I don't like arguments like "RAM is cheap"
> or "HDs are cheap" or whatever. The fact is that wasting money on any
> component means investing less in some other component that could give
> you more space/performance/whatever-makes-you-happy. If you have $1000
> that you can afford to blow on extra drives then you have $1000 you
> could blow on RAM, CPU, an extra server, or a trip to Disney. Why not
> blow it on something useful?

[ This gets philosophical. OK to quit here if uninterested. ]

You're right. "RAM and HDs are cheap"... relative to WHAT, the big-
screen TV/monitor I WOULD have been replacing my much smaller monitor
with, if I hadn't been spending the money on the "cheap" RAM and HDs?

Of course, "time is cheap" comes with the same caveats, and can actually
end up being far more dear. Stress and hassle of administration
similarly. And sometimes, just a bit of investment in another
"expensive" HD, saves you quite a bit of "cheap" time and stress, that's
actually more expensive.

"It's all relative"... to one's individual priorities. Because one
thing's for sure, both money and time are fungible, and if they aren't
spent on one thing, they WILL be on another (even if that "spent" is
savings, for money), and ultimately, it's one's individual priorities
that should rank where that spending goes. And I can't set your
priorities and you can't set mine, so... But from my observation, a LOT
of folks don't realize that and/or don't take the time necessary to
reevaluate their own priorities from time to time, so end up spending out
of line with their real priorities, and end up rather unhappy people as a
result! That's one reason why I have a personal policy to deliberately
reevaluate personal priorities from time to time (as well as being aware
of them constantly), and rearrange spending, money time and otherwise, in
accordance with those reranked priorities. I'm absolutely positive I'm a
happier man for doing so! =:^)

--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman

Re: Re: Is my RAID performance bad possibly due to starting sector value? [ In reply to ]

Jun 22, 2013, 4:12 AM

Post #22 of 46 (9719 views)

anonymous.pseudonym.88 at gmail

On Sat, Jun 22, 2013 at 6:29 AM, Duncan <1i5t5.duncan@cox.net> wrote:
> Rich Freeman posted on Fri, 21 Jun 2013 11:13:51 -0400 as excerpted:
>> If you protect 1 drive of data with 25 drives of parity (call them
>> mirrors or parity or whatever - they're functionally equivalent) then
>> you need 25/26 drives to fail to lose 1 drive of data.
>
> Almost correct.

DOH - good catch. Would need 26 fails.

> AFAIK 13 drives of data with 13 mirrors wouldn't (normally) be called
> raid1 (unless it's 13 individual raid1s)...

That's why I commented that I find RAID "levels" extremely unhelpful.
There is striping, mirroring, and RS parity, and every possible
combination of the above. We have a special name raid5 for striping
with one RS parity drive. We have another special name raid6 for
striping with two RS parity drives. We don't have a special name for
striping with 37 RS parity drives. Yet, all three of these are the
same thing.

I was referring to 13 data drives with one mirror each . If you lose
two drives you could potential lose one drive of data. If you made
that one big raid10 then if you lose two drives you could lose 13
drives of data. Both scenarios involve bad luck in terms of what pair
goes.

> You're right that at that level, you DO need a real backup, and it should
> take priority over raid-whatever. HOWEVER, in addition to creating a
> SINGLE raid across all those drives, it's possible to partition them up,
> and create multiple raids out of the partitions, with one set being a
> backup of the other.

I wouldn't consider that a great strategy. Sure, it is convenient,
but it does you no good at all if your computer burns up in a fire.

Multiple-level redundancy just seems to be past the point of
diminishing returns to me. If I wanted to spend that kind of money
I'd probably spend it differently.

However, I do agree that mdadm should support more flexible arrays.
For example, my boot partition is raid1 (since grub doesn't support
anything else), and I have it set up across all 5 of my drives.
However, the reality is that only two get used and the others are
treated only as spares. So, that is just a waste of space, and it is
actually more annoying from a config perspective because it would be
really nice if my system could boot from an arbitrary drive.

Oh, as far as raid on partitions goes - I do use this for a different
purpose. If you have a collection of drives of different sizes it can
reduce space waste. Suppose you have 3 500GB drives and 2 1TB drives.
If you put them all directly in a raid5 you get 2TB of space. If you
chop the 1TB drives into 2 500GB partitions then you can get two
raid5s - one 2TB in space, and the other 500GB in space. That is
500GB more data for the same space. Oh, and I realize I wrote raid5.
With mdadm you can set up a 2-drive raid5. It is functionally
equivalent to a raid1 I think, and I believe you can convert between
them, but since I generally intend to expand arrays I prefer to just
set them up as raid5 from the start. Since I stick lvm on top I
don't care if the space is chopped up.

Rich

Re: Is my RAID performance bad possibly due to starting sector value? [ In reply to ]

Jun 22, 2013, 5:49 AM

Post #23 of 46 (9710 views)

On Thu, 2013-06-20 at 12:10 -0700, Mark Knecht wrote:
> Hi,
> Does anyone know of info on how the starting sector number might
> impact RAID performance under Gentoo? The drives are WD-500G RE3
> drives shown here:
>
> http://www.amazon.com/Western-Digital-WD5002ABYS-3-5-inch-Enterprise/dp/B001EMZPD0/ref=cm_cr_pr_product_top
>
> These are NOT 4k sector sized drives.
>
> Specifically I'm a 5-drive RAID6 for about 1.45TB of storage. My
> benchmarking seems abysmal at around 40MB/S using dd copying large
> files. It's higher, around 80MB/S if the file being transferred is
> coming from an SSD, but even 80MB/S seems slow to me. I see a LOT of
> wait time in top. And my 'large file' copies might not be large enough
> as the machine has 24GB of DRAM and I've only been copying 21GB so
> it's possible some of that is cached.
>
> Then I looked again at how I partitioned the drives originally and
> see the starting sector of sector 3 as 8594775. I started wondering if
> something like 4K block sizes at the file system level might be
> getting munged across 16k chunk sizes in the RAID. Maybe the blocks
> are being torn apart in bad ways for performance? That led me down a
> bunch of rabbit holes and I haven't found any light yet.
>
> Looking for some thoughtful ideas from those more experienced in this area.
>
> Cheers,
> Mark
>

Not necessarily the kind of answer you are looking for, but a year or so
back I converted my NAS from Hardware RAID1 to linux software RAID1 to
RAID1 on ZFS. Before the conversion to ZFS I had issues with the NAS
being unable to keep up with requests. Since then I have been able to
hit the SAN relatively hard with no visible effects. Just to give an
idea, a normal load involves streaming an HD movie to the TV, streaming
music to a second system, being used as the shared storage for four
computers, two of which almost constantly hit the shared drive for data
(keep the distfile directory for all the systems on it as well as using
it as the local rsync) , and once a month, transferring data to
removable storage devices. All of this going over cat6 Ethernet and
occasionally USB2.

I'm unsure how I would go about measuring the throughput, mainly because
I never cared in the past as long as the files transferred at a
reasonable pace and the video/audio didn't stutter. By no means is my
NAS a high-end system. It's stats are:

AMD64 X2, 4200
ASUS A8V MoBo (I think)
4GB RAM
2 x Silicon Image Sil 3114 SATA RAID cards (4 port PCI cards)
3 x 1.5TB Seagate drives (on Raid Cards)
4 x 2TB Western Digital drives (On Raid Cards)
2 x Western Digital antique 80GB drives (mirrored on motherboard for OS)
Marvell GigE network cards (Have a second card to add once I figure how
to automatically load balance through two cards)
Case with 2 x 120mm fans on top, 3 x 120mm fans on the front, 1 x 240mm
fan on the side
Total storage available 6.3TB, of which 3.4TB is used.
An image of the pool is created on a daily basis via cron jobs, which
are overwritten every 3 days. (Image of Day 1, Day 2, Day 3 then Day 4
overwrites Day 1.)The pool started with 5 750GB drives and has been
grown slowly as I find deals on better drives.

Main advantage of using ZFS on linux is the ease of growing your pools.
As long as you know the id of the drive (preferably the hardware id not
the delegated one), its so simple I can manage it. Since I'm nowhere
near the technical level of most folk here, anyone can do it. For what
it's worth (very little I know), I think that ZFS has too many
advantages over linux software RAID for it to be a real competition.

YMMV

B. Vance

Re: Is my RAID performance bad possibly due to starting sector value? [ In reply to ]

Jun 22, 2013, 6:12 AM

Post #24 of 46 (9712 views)

On Sat, Jun 22, 2013 at 8:49 AM, B Vance
<anonymous.pseudonym.88@gmail.com> wrote:
> Main advantage of using ZFS on linux is the ease of growing your pools.
> As long as you know the id of the drive (preferably the hardware id not
> the delegated one), its so simple I can manage it. Since I'm nowhere
> near the technical level of most folk here, anyone can do it. For what
> it's worth (very little I know), I think that ZFS has too many
> advantages over linux software RAID for it to be a real competition.

I'm holding out for btrfs but for all the same reasons. I really
don't want to mess with zfs on linux (fuse, etc - and the license
issues - the thing I don't get is that Oracle maintains both).

However, the last time I checked ZFS does not support reshaping of
RAID-Z. That is a major limitation for me, as I almost always expand
arrays gradually. You can add additional raid-z's to a zpool, but if
you have a raid-z with 5 drives you can't add 1 more drive to it as
part of the same raid-z. That means that it get treated as a mirror
and not a stripe, and that means that if you add 10 drives in this
manner one at a time you get 5 drives of capacity and not 9. Btrfs
targets making raids re-shapeable, just like mdadm.

But in general COW makes a LOT more sense with RAID because the
layer-breaking allows them to often avoid read-write cycles by writing
complete stripes more often, and files aren't modified in place so you
can consolidate changes for many files into a single stripe (granted,
that can cause fragmentation). ZFS has all those advantages being
COW, as will btrfs when it is ready for prime time.

Rich

Re: Is my RAID performance bad possibly due to starting sector value? [ In reply to ]

Jun 22, 2013, 7:23 AM

Post #25 of 46 (9713 views)