Mailing List Archive: Is my RAID performance bad possibly due to starting sector value?

Re: Is my RAID performance bad possibly due to starting sector value? [ In reply to ]

Jun 22, 2013, 7:34 AM

Post #26 of 46 (7605 views)

Gary E. Miller posted on Fri, 21 Jun 2013 11:50:43 -0700 as excerpted:

> On Fri, 21 Jun 2013 11:38:00 -0700 Mark Knecht <markknecht@gmail.com>
> wrote:
>
>> On the read side I'm not sure if I'm understanding your point. I agree
>> that a so-designed RAID1 system could/might read smaller portions of a
>> larger read from RAID1 drives in parallel, taking some data from one
>> drive and some from another drive, and then only take action corrective
>> if one of the drives had troubles. However I don't know that
>> mdadm-based RAID1 does anything like that. Does it?
>
> It surely does. I have confirmed that at least monthly since md has
> existed in the kernel.

Out of curiosity, /how/ do you confirm that? I agree based on real usage
experience, but with a claim that you're confirming it at least monthly,
it sounds like you have a standardized/scripted test, and I'm interested
in what/how you do it.

--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman

Re: Is my RAID performance bad possibly due to starting sector value? [ In reply to ]

1i5t5.duncan at cox

Jun 22, 2013, 8:45 AM

Post #27 of 46 (7599 views)

Permalink

Rich Freeman posted on Sat, 22 Jun 2013 07:12:25 -0400 as excerpted:

> Multiple-level redundancy just seems to be past the point of diminishing
> returns to me. If I wanted to spend that kind of money I'd probably
> spend it differently.

My point was that for me, it wasn't multiple level redundancy. It was
simply device redundancy (raid), and fat-finger redundancy (backups), on
the same set of drives so I was protected from either scenario.

The fire/flood scenario would certainly get me if I didn't have offsite
backups, but just as you call multiple redundancy past your point of
diminishing returns, I call the fire/flood scenario past mine. If that
happens, I figure I'll have far more important things to worry about than
rebuilding my computer for awhile. And chances are, when I do get around
to it, things will be progressed enough that much of the data won't be
worth so much any more anyway. Besides, the real /important/ data is in
my head. What's worth rebuilding, will be nearly as easy to rebuild due
to what's in my head, as it would be to go thru what's now historical
data and try to pick up the pieces, sorting thru what's still worth
keeping around and what's not.

Tho as I said, I do/did keep an additional level of backup on that 1 TB
drive, but it's on-site too, and while not in the computer, it's
generally nearby enough that it'd be lost too in case of flood/fire.
It's more a convenience than a real backup, and I don't really keep it
upto date, but if it survived and what's in the computer itself didn't, I
do have old copies of much of my data, simply because it's still there
from the last time I used that drive as convenient temporary storage
while I switched things around.

> However, I do agree that mdadm should support more flexible arrays. For
> example, my boot partition is raid1 (since grub doesn't support anything
> else), and I have it set up across all 5 of my drives. However, the
> reality is that only two get used and the others are treated only as
> spares. So, that is just a waste of space, and it is actually more
> annoying from a config perspective because it would be really nice if my
> system could boot from an arbitrary drive.

Three points on that. First, obviously you're not on grub2 yet. It
handles all sorts of raid, lvm, newer filesystems like btrfs (and zfs for
those so inclined), various filesystems, etc, natively, thru its modules.

Second, /boot is an interesting case. Here, originally (with grub1 and
the raid6s across 4 drives) I setup a 4-drive raid1. But, I actually
installed grub to the boot sector of all four drives, and tested each one
booting just to grub by itself (the other drives off), so I knew it was
using its own grub, not pointed somewhere else.

But I was still worried about it as while I could boot from any of the
drives, they were a single raid1, which meant no fat-finger redundancy,
and doing a usable backup of /boot isn't so easy.

So I think it was when I switched from raid6 to raid1 for almost the
entire system, that I switched to dual dual-drive raid1s for /boot as
well, and of course tested booting to each one alone again, just to be
sure. That gave me fat-finger redundancy, as well as added convenience
since I run git kernels, and I was able to update just the one dual-drive
raid1 /boot with the git kernels, then update the backup with the
releases once they came out, which made for a nice division of stable
kernel vs pre-release there.

That dual dual-drive raid-1 setup proved very helpful when I upgraded to
grub2 as well, since I was able to play around with it on the one dual-
drive raid1 /boot while the other one stayed safely bootable grub1 until
I had grub2 working the way I wanted on the working /boot, and had again
installed and tested it on both component hard drives to boot to grub and
to the full raid1 system just from the one drive by itself, with the
others entirely shut off.

Only when I had both drives of the working /boot up and running grub2,
did I mounth the backup /boot as well, and copy over the now working
config to it, before running grub2-install on those two drives.

Of course somewhere along the way, IIRC at the same time as the raid6 to
raid1 conversion as well, I had also upgraded to gpt partitions from
traditional mbr. When I did I had the foresight to create BOTH dedicated
BIOS boot partitions and EFI partitions on each of the four drives.
grub1 wasn't using them, but that was fine; they were small (tiny). That
made the upgrade to grub2 even easier, since grub2 could install its core
into the dedicated BIOS partitions. The EFI partitions remain unused to
this day, but as I said, they're tiny, and with gpt they're specifically
typed and labeled so they can't mix me up, either.

(BTW, talking about data integrity, if you're not on GPT yet, do consider
it. It keeps a second partition table at the end of the drive as well as
the one at the beginning, and unlike mbr they're checksummed, so
corruption is detected. It also kills the primary/secondary/extended
difference so no more worrying about that, and allows partition labels,
much like filesystem labels, which makes tracking and managing what's
what **FAR** easier. I GPT partition everything now, including my USB
thumbdrives if I partition them at all!)

When that machine slowly died and I transferred to a new half-TB drive
thinking it was the aging 300-gigs (it wasn't, caps were dying on the by
then 8 year old mobo), and then transferred that into my new machine
without raid, I did the usual working/backup partition arrangement, but
got frustrated without the ability to have a backup /boot, because with
just one device, the boot sector could point just one place, at the core
grub2 in the dedicated BIOS boot partition, which in turn pointed at the
usual /boot. Now grub2's better in this regard than grub2, since that
core grub2 has an emergency mode that would give me limited ability to
load a backup /boot, that's an entirely manual process with a
comparatively limited grub2 emergency shell without additional modules
available, and I didn't actually take advantage of that to configure a
backup /boot that it could reach.

But when I switched to the SSDs, I again had multiple devices, the pair
of SSDs, which I setup with individual /boots, and the original one still
on the spinning rust. Again I installed grub2 to each one, pointed at
its own separately configured /boot, so now I actually have three
separately configured and bootable /boots, one on each of the SSDs and a
third on the spinning rust half-TB.

(FWIW the four old 300-gigs are sitting on the shelf. I need to badblocks
or dd them to wipe, and I have a friend that'll buy them off me.)

Third point. /boot partition raid1 across all five drives and three are
wasted? How? I believe if you check, all five will have a mirror of the
data (not just two unless it's btrfs raid1 not mdadm raid1, but btrfs is /
entirely/ different in that regard). Either they're all wasted but one,
or none are wasted, depending on how you look at it.

Meanwhile, do look into installing grub on each drive, so you can boot
from any of them. I definitely know it's possible as that's what I've
been doing, tested, for quite some time.

> Oh, as far as raid on partitions goes - I do use this for a different
> purpose. If you have a collection of drives of different sizes it can
> reduce space waste. Suppose you have 3 500GB drives and 2 1TB drives.
> If you put them all directly in a raid5 you get 2TB of space. If you
> chop the 1TB drives into 2 500GB partitions then you can get two raid5s
> - one 2TB in space, and the other 500GB in space. That is 500GB more
> data for the same space. Oh, and I realize I wrote raid5. With mdadm
> you can set up a 2-drive raid5. It is functionally equivalent to a
> raid1 I think,

You better check. Unless I'm misinformed, which I could be as I've not
looked at this in awhile and both mdadm and the kernel have changed quite
a bit since then, that'll be setup as a degraded raid5, which means if
you lose one...

But I do know raid10 can be setup like that, on fewer drives than it'd
normally take, with the mirrors in "far" mode I believe, and it just
arranges the stripes as it needs to. It's quite possible that they fixed
it so raid5 works similarly and can do the same thing now, in which case
that degraded thing I knew about is obsolete. But unless you know for
sure, please do check.

> and I believe you can convert between them, but since I generally intend
> to expand arrays I prefer to just set them up as raid5 from the start.
> Since I stick lvm on top I don't care if the space is chopped up.

There's a lot of raid conversion ability in modern mdadm. I think most
levels can be converted between, given sufficient devices. Again, a lot
has changed in that regard since I set my originals up, I'd guess
somewhere around 2008.

--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman

Re: Re: Is my RAID performance bad possibly due to starting sector value? [ In reply to ]

gem at rellim

Jun 22, 2013, 3:15 PM

Post #28 of 46 (7606 views)

Permalink

Yo Duncan!

On Sat, 22 Jun 2013 14:34:36 +0000 (UTC)
Duncan <1i5t5.duncan@cox.net> wrote:

> >> On the read side I'm not sure if I'm understanding your point. I
> >> agree that a so-designed RAID1 system could/might read smaller
> >> portions of a larger read from RAID1 drives in parallel, taking
> >> some data from one drive and some from another drive, and then
> >> only take action corrective if one of the drives had troubles.
> >> However I don't know that mdadm-based RAID1 does anything like
> >> that. Does it?
> >
> > It surely does. I have confirmed that at least monthly since md has
> > existed in the kernel.
>
> Out of curiosity, /how/ do you confirm that? I agree based on real
> usage experience, but with a claim that you're confirming it at least
> monthly, it sounds like you have a standardized/scripted test, and
> I'm interested in what/how you do it.

I have around 30 RAID1 sets in production right now. Some of them
doing mostly reads and some mostly writes. Some are HDD and some SSD.
The RAID sets are pushed pretty hard 24x7 and we watch the performance
pretty closely to plan updates. I have collectd performance graphs
going way back.

RGDS
GARY
---------------------------------------------------------------------------
Gary E. Miller Rellim 109 NW Wilmington Ave., Suite E, Bend, OR 97701
gem@rellim.com Tel:+1(541)382-8588

Re: Re: Is my RAID performance bad possibly due to starting sector value? [ In reply to ]

markknecht at gmail

Jun 22, 2013, 4:04 PM

Post #29 of 46 (7611 views)

Permalink

On Fri, Jun 21, 2013 at 3:28 AM, Rich Freeman <rich0@gentoo.org> wrote:
> On Fri, Jun 21, 2013 at 3:31 AM, Duncan <1i5t5.duncan@cox.net> wrote:
>> So with 4k block sizes on a 5-device raid6, you'd have 20k stripes, 12k
>> in data across three devices, and 8k of parity across the other two
>> devices.
>
> With mdadm on a 5-device raid6 with 512K chunks you have 1.5M in a
> stripe, not 20k. If you modify one block it needs to read all 1.5M,
> or it needs to read at least the old chunk on the single drive to be
> modified and both old parity chunks (which on such a small array is 3
> disks either way).
>

Hi Rich,
I've been rereading everyone's posts as well as trying to collect
my own thoughts. One question I have at this point, being that you and
I seem to be the two non-RAID1 users (but not necessarily devotees) at
this time, is what chunk size, stride & stripe width with you are
using? Are you currently using 512K chunks on your RAID5? If so that's
potentially quite different than my 16K chunk RAID6. The more I read
through this thread and other things on the web the more I am
concerned that 16K chunks has possibly forced far more IO operations
that really makes sense for performance. Unfortunately there's no easy
way to me to really test this right now as the RAID6 uses the whole
drive. However for every 512K I want to get off the drive you might
need 1 chuck whereas I'm going to need what, 32 chunks? That's got to
be a lot more IO operations on my machine isn't it?

For clarity, I'm a 16K chunk, stride of 4K, stripe of 12K:

c2RAID6 ~ # tune2fs -l /dev/md3 | grep RAID
Filesystem volume name: RAID6root
RAID stride: 4
RAID stripe width: 12
c2RAID6 ~ #

c2RAID6 ~ # cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md3 : active raid6 sdb3[9] sdf3[5] sde3[6] sdd3[7] sdc3[8]
1452264480 blocks super 1.2 level 6, 16k chunk, algorithm 2 [5/5] [UUUUU]

unused devices: <none>
c2RAID6 ~ #

As I understand one of your earlier responses I think you are using
4K sector drives, which again has that extra level of complexity in
terms of creating the partitions initially, but after that should be
fairly straight forward to use. (I think) That said there are
trade-offs between RAID5 & RAID6 but have you measured speeds using
anything like the dd method I posted yesterday, or any other way that
we could compare?

As I think Duncan asked about storage usage requirements in another
part of this thread I'll just document it here. The machine serves
main 3 purposes for me:

1) It's my day in, day out desktop. I run almostly totally Gentoo
64-bit stable unless I need to keyword a package to get what I need.
Over time I tend to let my keyworded packages go stable if they are
working for me. The overall storage requirements for this, including
my home directory, typically don't run over 50GB.

2) The machine runs 3 Windows VMs every day - 2 Win 7 & 1 Win XP.
Total storage for the basic VMs is about 150GB. XP is just for things
like NetFlix. These 3 VMs typically have allocated 9 cores allocated
to them (6+2+1) leaving 3 for Gentoo to run the hardware, etc. The 6
core VM is often using 80-100% of its CPUs sustained for times. (hours
to days.) It's doing a lot of stock market math...

3) More recently, and really the reason to consolidate into a single
RAID of any type, I have about 900GB of mp4s which has been on an
external USB drive, and backed up to a second USB drive. However this
is mostly storage. We watch most of this video on the TV using the
second copy drive hooked directly to the TV or copied onto Kindles.
I've been having to keep multiple backups of this outside the machine
(poor man's RAID1 - two separate USB drives hooked up one at a time!)
;-) I'd rather just keep it safe on the RAID 6, That said, I've not
yet put it on the RAID6 as I have these performance issues I'd like to
solve first. (If possible. Duncan is making me worry that they cannot
be solved...)

Lastly, even if I completely buy into Duncan's well formed reasons
about why RAID1 might be faster, using 500GB drives I see no single
RAID solution for me other than RAID5/6. The real RAID1/RAID6
comparison from a storage standpoint would be a (conceptual) 3-drive
RAID6 vs 3 drive RAID1. Both create 500GB of storage and can
(conceptually) lose 2 drives and still recover data. However adding
another drive to the RAID1 gains you more speed but no storage (buying
into Duncan's points) vs adding storage to the RAID6 and probably
reducing speed. As I need storage what other choices do I have?

Answering myself, take the 5 drives, create two RAIDS - a 500GB
2-drive RAID1 for the system + VMs, and then a 3-drive RAID5 for video
data maybe? I don't know...

Or buy more hardware and do a 2 drive SSD RAID1 for the system, or
a hardware RAID controller, etc. The options explode if I start buying
more hardware.

Also, THANKS TO EVERYONE for the continued conversation.

Cheers,
Mark

Re: Re: Is my RAID performance bad possibly due to starting sector value? [ In reply to ]

matt at professionalsysadmin

Jun 22, 2013, 4:17 PM

Post #30 of 46 (7609 views)

Permalink

I would recommend that anyone concerned about mdadm software raid
performance on gentoo test via tools like bonnie++ before putting any data
on the drives and separate from data into different sets/volumes.

I did testing two years ago watching read, write burst and sustained rates,
file ops per second, etc.... Ended up getting 7 2tb enterprise data drives
Disk 1 is os, no raid
Disk 2-5 are data, raid 10
Disk 6-7 are backups and to test/scratch space, raid 0
On Jun 22, 2013 4:04 PM, "Mark Knecht" <markknecht@gmail.com> wrote:

> On Fri, Jun 21, 2013 at 3:28 AM, Rich Freeman <rich0@gentoo.org> wrote:
> > On Fri, Jun 21, 2013 at 3:31 AM, Duncan <1i5t5.duncan@cox.net> wrote:
> >> So with 4k block sizes on a 5-device raid6, you'd have 20k stripes, 12k
> >> in data across three devices, and 8k of parity across the other two
> >> devices.
> >
> > With mdadm on a 5-device raid6 with 512K chunks you have 1.5M in a
> > stripe, not 20k. If you modify one block it needs to read all 1.5M,
> > or it needs to read at least the old chunk on the single drive to be
> > modified and both old parity chunks (which on such a small array is 3
> > disks either way).
> >
>
> Hi Rich,
> I've been rereading everyone's posts as well as trying to collect
> my own thoughts. One question I have at this point, being that you and
> I seem to be the two non-RAID1 users (but not necessarily devotees) at
> this time, is what chunk size, stride & stripe width with you are
> using? Are you currently using 512K chunks on your RAID5? If so that's
> potentially quite different than my 16K chunk RAID6. The more I read
> through this thread and other things on the web the more I am
> concerned that 16K chunks has possibly forced far more IO operations
> that really makes sense for performance. Unfortunately there's no easy
> way to me to really test this right now as the RAID6 uses the whole
> drive. However for every 512K I want to get off the drive you might
> need 1 chuck whereas I'm going to need what, 32 chunks? That's got to
> be a lot more IO operations on my machine isn't it?
>
> For clarity, I'm a 16K chunk, stride of 4K, stripe of 12K:
>
> c2RAID6 ~ # tune2fs -l /dev/md3 | grep RAID
> Filesystem volume name: RAID6root
> RAID stride: 4
> RAID stripe width: 12
> c2RAID6 ~ #
>
> c2RAID6 ~ # cat /proc/mdstat
> Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
> md3 : active raid6 sdb3[9] sdf3[5] sde3[6] sdd3[7] sdc3[8]
> 1452264480 blocks super 1.2 level 6, 16k chunk, algorithm 2 [5/5]
> [UUUUU]
>
> unused devices: <none>
> c2RAID6 ~ #
>
> As I understand one of your earlier responses I think you are using
> 4K sector drives, which again has that extra level of complexity in
> terms of creating the partitions initially, but after that should be
> fairly straight forward to use. (I think) That said there are
> trade-offs between RAID5 & RAID6 but have you measured speeds using
> anything like the dd method I posted yesterday, or any other way that
> we could compare?
>
> As I think Duncan asked about storage usage requirements in another
> part of this thread I'll just document it here. The machine serves
> main 3 purposes for me:
>
> 1) It's my day in, day out desktop. I run almostly totally Gentoo
> 64-bit stable unless I need to keyword a package to get what I need.
> Over time I tend to let my keyworded packages go stable if they are
> working for me. The overall storage requirements for this, including
> my home directory, typically don't run over 50GB.
>
> 2) The machine runs 3 Windows VMs every day - 2 Win 7 & 1 Win XP.
> Total storage for the basic VMs is about 150GB. XP is just for things
> like NetFlix. These 3 VMs typically have allocated 9 cores allocated
> to them (6+2+1) leaving 3 for Gentoo to run the hardware, etc. The 6
> core VM is often using 80-100% of its CPUs sustained for times. (hours
> to days.) It's doing a lot of stock market math...
>
> 3) More recently, and really the reason to consolidate into a single
> RAID of any type, I have about 900GB of mp4s which has been on an
> external USB drive, and backed up to a second USB drive. However this
> is mostly storage. We watch most of this video on the TV using the
> second copy drive hooked directly to the TV or copied onto Kindles.
> I've been having to keep multiple backups of this outside the machine
> (poor man's RAID1 - two separate USB drives hooked up one at a time!)
> ;-) I'd rather just keep it safe on the RAID 6, That said, I've not
> yet put it on the RAID6 as I have these performance issues I'd like to
> solve first. (If possible. Duncan is making me worry that they cannot
> be solved...)
>
> Lastly, even if I completely buy into Duncan's well formed reasons
> about why RAID1 might be faster, using 500GB drives I see no single
> RAID solution for me other than RAID5/6. The real RAID1/RAID6
> comparison from a storage standpoint would be a (conceptual) 3-drive
> RAID6 vs 3 drive RAID1. Both create 500GB of storage and can
> (conceptually) lose 2 drives and still recover data. However adding
> another drive to the RAID1 gains you more speed but no storage (buying
> into Duncan's points) vs adding storage to the RAID6 and probably
> reducing speed. As I need storage what other choices do I have?
>
> Answering myself, take the 5 drives, create two RAIDS - a 500GB
> 2-drive RAID1 for the system + VMs, and then a 3-drive RAID5 for video
> data maybe? I don't know...
>
> Or buy more hardware and do a 2 drive SSD RAID1 for the system, or
> a hardware RAID controller, etc. The options explode if I start buying
> more hardware.
>
> Also, THANKS TO EVERYONE for the continued conversation.
>
> Cheers,
> Mark
>
>

Re: Re: Is my RAID performance bad possibly due to starting sector value? [ In reply to ]

markknecht at gmail

Jun 22, 2013, 6:02 PM

Post #31 of 46 (7616 views)

Permalink

On Sat, Jun 22, 2013 at 7:23 AM, Duncan <1i5t5.duncan@cox.net> wrote:
> Mark Knecht posted on Fri, 21 Jun 2013 10:40:48 -0700 as excerpted:
>
>> On Fri, Jun 21, 2013 at 12:31 AM, Duncan <1i5t5.duncan@cox.net> wrote:
>> <SNIP>
<SNIP>
>
> ... Assuming $PWD is now on the raid. You had the path shown too, which
> I snipped, but that doesn't tell /me/ (as opposed to you, who should know
> based on your mounts) anything about whether it's on the raid or not.
> However, the above including the drop-caches demonstrates enough care
> that I'm quite confident you'd not make /that/ mistake.
>
>> 4) As a second test I read from the RAID6 and write back to the RAID6.
>> I see MUCH lower speeds, again repeatable:
>>
>> dd if=SDDCopy of=HDDWrite
>> 97656250+0 records in 97656250+0 records out
>> 50000000000 bytes (50 GB) copied, 1187.07 s, 42.1 MB/s
>
>> 5) As a final test, and just looking for problems if any, I do an SDD to
>> SDD copy which clocked in at close to 200MB/S
>>
>> dd if=random1 of=SDDCopy
>> 97656250+0 records in 97656250+0 records out
>> 50000000000 bytes (50 GB) copied, 251.105 s, 199 MB/s
>
>> So, being that this RAID6 was grown yesterday from something that
>> has existed for a year or two I'm not sure of it's fragmentation, or
>> even how to determine that at this time. However it seems my problem are
>> RAID6 reads, not RAID6 writes, at least to new an probably never used
>> disk space.
>
> Reading all that, one question occurs to me. If you want to test read
> and write separately, why the intermediate step of dd-ing from /dev/
> random to ssd, then from ssd to raid or ssd?
>
> Why not do direct dd if=/dev/random (or urandom, see note below)
> of=/desired/target ... for write tests, and then (after dropping caches),
> if=/desired/target of=/dev/null ... for read tests? That way there's
> just the one block device involved, not both.
>

1) I was a bit worried about using it in a way it might not have been
intended to be used.

2) I felt that if I had a specific file then results should be
repeatable, or at least not dependent on what's in the file.

<SNIP>
>
> Meanwhile, dd-ing either from /dev/urandom as source, or to /dev/null as
> sink, with only the test-target block device as a real block device,
> should give you "purer" read-only and write-only tests. In theory it
> shouldn't matter much given your method of testing, but as we all know,
> theory and reality aren't always well aligned.
>

Will try some tests this way tomorrow morning.

>
> Of course the next question follows on from the above. I see a write to
> the raid, and a copy from the raid to the raid, so read/write on the
> raid, and a copy from the ssd to the ssd, read/write on it, but no test
> of from the raid read.
>
> So
>
> if=/dev/urandom of=/mnt/raid/target ... should give you raid write.
>
> drop-caches
>
> if=/mnt/raid/target of=/dev/null ... should give you raid read.
>
> *THEN* we have good numbers on both to compare the raid read/write to.
>
> What I suspect you'll find, unless fragmentation IS your problem, is that
> both read (from the raid) alone and write (to the raid) alone should be
> much faster than read/write (from/to the raid).
>
> The problem with read/write is that you're on "rotating rust" hardware
> and there's some latency as it repositions the heads from the read
> location to the write location and back.
>

If this lack of performance is truly driven by the drive rotational
issues than I completely agree.

> If I'm correct and that's what you find, a workaround specific to dd
> would be to specify a much larger block size, so it reads in far more
> data at once, then writes it out at once, with far fewer switches between
> modes. In the above you didn't specify bs (or the separate input/output
> equivilents, ibs/obs respectively) at all, so it's using 512-byte
> blocksize defaults.
>

So help me clarify this before I do the work and find out I didn't
understand. Whereas earlier I created a file using:

dd if=/dev/random of=random1 bs=1000 count=0 seek=$[1000*1000*50]

if what you are suggesting is more like this very short example:

mark@c2RAID6 /VirtualMachines/bonnie $ dd if=/dev/urandom of=urandom1
bs=4096 count=$[1000*100]
100000+0 records in
100000+0 records out
409600000 bytes (410 MB) copied, 25.8825 s, 15.8 MB/s
mark@c2RAID6 /VirtualMachines/bonnie $

then the results for writing this 400MB file are very slow, but I'm
sure I don't understand what you're asking, or urandom is the limiting
factor here.

I'll look for a reply (you or anyone else that has Duncan's idea
better than I do) before I do much more.

Thanks!

- Mark

Re: Re: Is my RAID performance bad possibly due to starting sector value? [ In reply to ]

markknecht at gmail

Jun 22, 2013, 6:48 PM

Post #32 of 46 (7608 views)

Permalink

On Sat, Jun 22, 2013 at 6:02 PM, Mark Knecht <markknecht@gmail.com> wrote:

> if what you are suggesting is more like this very short example:
>
> mark@c2RAID6 /VirtualMachines/bonnie $ dd if=/dev/urandom of=urandom1
> bs=4096 count=$[1000*100]
> 100000+0 records in
> 100000+0 records out
> 409600000 bytes (410 MB) copied, 25.8825 s, 15.8 MB/s
> mark@c2RAID6 /VirtualMachines/bonnie $
>

Duncan,
Actually, using your idea of piping things to /dev/null it appears
that the random number generator itself is only capable of 15MB/S on
my machine.

mark@c2RAID6 /VirtualMachines/bonnie $ dd if=/dev/urandom of=/dev/null
bs=4096 count=$[1000]
1000+0 records in
1000+0 records out
4096000 bytes (4.1 MB) copied, 0.260608 s, 15.7 MB/s
mark@c2RAID6 /VirtualMachines/bonnie $

It doesn't change much based on block size of number of bytes I pipe.

If this speed is representative of how well that works then I think
I have to use a file. It appears this guy gets similar values:

http://www.globallinuxsecurity.pro/quickly-fill-a-disk-with-random-bits-without-dev-urandom/

On the other hand, piping /dev/zero appears to be very fast -
basically the speed of the processor I think:

mark@c2RAID6 /VirtualMachines/bonnie $ dd if=/dev/zero of=/dev/null
bs=4096 count=$[1000]
1000+0 records in
1000+0 records out
4096000 bytes (4.1 MB) copied, 0.000622594 s, 6.6 GB/s
mark@c2RAID6 /VirtualMachines/bonnie $

- Mark

Re: Is my RAID performance bad possibly due to starting sector value? [ In reply to ]

thegeezer at thegeezer

Jun 23, 2013, 4:31 AM

Post #33 of 46 (7611 views)

Permalink

Howdy,
My own 2c on the issue is to suggest LVM
this looks at things in a slightly different way and allows me to treat
all my disks as one large volume i can carve up.
It supports multi way mirroring, so i can choose to create a volume for
all my pictures which is on at least 3 drives.
It supports volume striping (RAID0) so i can put swap file and scratch
files there.
It does support other RAID levels but I can't find where the scrub option is
It supports volume concatenation so i can just keep growing my MythTV
recordings volume by just adding another disk.
It supports encrypted volumes so I can put all my guarded stuff in there.
it supports (with some magic) nested volumes, so i can have an encrypted
volume sitting inside a mirrored volume so my secrets are protected.
i can partition my drives in 3 parts, so that i can create a volume
group of fast, medium and slow based on where on the disk the partition
is (start track ~150MB/sec, end track ~60MB/sec, numbers sort of
remembered sort of made up)
I can have a bunch of disks for long term storage and hdparm can spin
them down all the time.
Live movement even of a root volume also means that i can keep moving
storage to the storage drives or decide to use a fast disk as a storage
disk and have that spin down too.

I think the crucial aspect is to also consider what you wish to put on
the drives.
If it is just pr0n, do you really care if it gets lost?
if it is just scratch areas that need to be fast, ditto.
Where the different RAIDs are good is the use of parity so you don't
lose half of your potential storage size if it were a mirror.
Bit rot is real, all it takes is a single misaligned charged particle
from that nuclear furnace in the sky to knock a single bit out of
magnetic alignment so it will require regular scrubbing maybe in a cron.
https://wiki.archlinux.org/index.php/Software_RAID_and_LVM#Data_scrubbing

Specifically on the bandwidth issue, I'd suggest
1. take all the drives out of RAID if you can, run a benchmark against
them individually, I like the benchmark tool in palimpset, but that's me.
2. concurrently run dd if=/dev/zero of=/dev/sdX on all drives and see
how it compares to the individual scores this will show you the computer
mainboard/chipset effect.
3. you might find this
https://raid.wiki.kernel.org/index.php/RAID_setup#Calculation a good
starting point for calculating strides and stripes
and this http://forums.gentoo.org/viewtopic-t-942794-start-0.html
shows the benefit of adjusting the numbers

hope this helps!

On 06/20/2013 08:10 PM, Mark Knecht wrote:
> Hi,
> Does anyone know of info on how the starting sector number might
> impact RAID performance under Gentoo? The drives are WD-500G RE3
> drives shown here:
>
> http://www.amazon.com/Western-Digital-WD5002ABYS-3-5-inch-Enterprise/dp/B001EMZPD0/ref=cm_cr_pr_product_top
>
> These are NOT 4k sector sized drives.
>
> Specifically I'm a 5-drive RAID6 for about 1.45TB of storage. My
> benchmarking seems abysmal at around 40MB/S using dd copying large
> files. It's higher, around 80MB/S if the file being transferred is
> coming from an SSD, but even 80MB/S seems slow to me. I see a LOT of
> wait time in top. And my 'large file' copies might not be large enough
> as the machine has 24GB of DRAM and I've only been copying 21GB so
> it's possible some of that is cached.
>
> Then I looked again at how I partitioned the drives originally and
> see the starting sector of sector 3 as 8594775. I started wondering if
> something like 4K block sizes at the file system level might be
> getting munged across 16k chunk sizes in the RAID. Maybe the blocks
> are being torn apart in bad ways for performance? That led me down a
> bunch of rabbit holes and I haven't found any light yet.
>
> Looking for some thoughtful ideas from those more experienced in this area.
>
> Cheers,
> Mark
>

Re: Re: Is my RAID performance bad possibly due to starting sector value? [ In reply to ]

rich0 at gentoo

Jun 23, 2013, 4:43 AM

Post #34 of 46 (7658 views)

Permalink

On Sat, Jun 22, 2013 at 7:04 PM, Mark Knecht <markknecht@gmail.com> wrote:
> I've been rereading everyone's posts as well as trying to collect
> my own thoughts. One question I have at this point, being that you and
> I seem to be the two non-RAID1 users (but not necessarily devotees) at
> this time, is what chunk size, stride & stripe width with you are
> using?

I'm using 512K chunks on the two RAID5s which are my LVM PVs:
md7 : active raid5 sdc3[0] sdd3[6] sde3[7] sda4[2] sdb4[5]
971765760 blocks super 1.2 level 5, 512k chunk, algorithm 2 [5/5] [UUUUU]
bitmap: 1/2 pages [4KB], 65536KB chunk

md6 : active raid5 sda3[0] sdd2[4] sdb3[3] sde2[5]
2197687296 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/4] [UUUU]
bitmap: 2/6 pages [8KB], 65536KB chunk

On top of this I have a few LVs with ext4 filesystems:
tune2fs -l /dev/vg1/root | grep RAID
RAID stride: 128
RAID stripe width: 384
(this is root, bin, sbin, lib)

tune2fs -l /dev/vg1/data | grep RAID
RAID stride: 19204
(this is just about everything else)

tune2fs -l /dev/vg1/video | grep RAID
RAID stride: 11047
(this is mythtv video)

Those were all the defaults picked, and with the exception of root I
believe the array was quite different when the others were created.
I'm pretty confident that none of these are optimizes, and I'd be
shocked if any of them are aligned unless this is automated (including
across pvmoves, reshaping, and such).

That is part of why I'd like to move to btrfs - optimizing raid with
mdadm+lvm+mkfs.ext4 involves a lot of micromanagement as far as I'm
aware. Docs are very spotty at best, and it isn't at all clear that
things get adjusted as needed when you actually take advantage of
things like pvmove or reshaping arrays. I suspect that having btrfs
on bare metal will be more likely to result in something that keeps
itself in-tune.

Rich

Re: Re: Is my RAID performance bad possibly due to starting sector value? [ In reply to ]

markknecht at gmail

Jun 23, 2013, 8:23 AM

Post #35 of 46 (7620 views)

Permalink

On Sun, Jun 23, 2013 at 4:43 AM, Rich Freeman <rich0@gentoo.org> wrote:
> On Sat, Jun 22, 2013 at 7:04 PM, Mark Knecht <markknecht@gmail.com> wrote:
>> I've been rereading everyone's posts as well as trying to collect
>> my own thoughts. One question I have at this point, being that you and
>> I seem to be the two non-RAID1 users (but not necessarily devotees) at
>> this time, is what chunk size, stride & stripe width with you are
>> using?
>
> I'm using 512K chunks on the two RAID5s which are my LVM PVs:
> md7 : active raid5 sdc3[0] sdd3[6] sde3[7] sda4[2] sdb4[5]
> 971765760 blocks super 1.2 level 5, 512k chunk, algorithm 2 [5/5] [UUUUU]
> bitmap: 1/2 pages [4KB], 65536KB chunk
>
> md6 : active raid5 sda3[0] sdd2[4] sdb3[3] sde2[5]
> 2197687296 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/4] [UUUU]
> bitmap: 2/6 pages [8KB], 65536KB chunk
>
> On top of this I have a few LVs with ext4 filesystems:
> tune2fs -l /dev/vg1/root | grep RAID
> RAID stride: 128
> RAID stripe width: 384
> (this is root, bin, sbin, lib)
>
> tune2fs -l /dev/vg1/data | grep RAID
> RAID stride: 19204
> (this is just about everything else)
>
> tune2fs -l /dev/vg1/video | grep RAID
> RAID stride: 11047
> (this is mythtv video)
>
> Those were all the defaults picked, and with the exception of root I
> believe the array was quite different when the others were created.
> I'm pretty confident that none of these are optimizes, and I'd be
> shocked if any of them are aligned unless this is automated (including
> across pvmoves, reshaping, and such).
>
> That is part of why I'd like to move to btrfs - optimizing raid with
> mdadm+lvm+mkfs.ext4 involves a lot of micromanagement as far as I'm
> aware. Docs are very spotty at best, and it isn't at all clear that
> things get adjusted as needed when you actually take advantage of
> things like pvmove or reshaping arrays. I suspect that having btrfs
> on bare metal will be more likely to result in something that keeps
> itself in-tune.
>
> Rich
>

Thanks Rich. I'm finding that helpful.

I completely agree on the micromanagement comment. At one level or
another that's sort of what this whole thread is about!

On your root partition I sort of wonder about the stripe width.
Assuming I did it right (5, 5, 512, 4) his little page calculates 128
for the stride and 512 stripe width. (4 data disks * 128 I think) Just
a piece of info.

http://busybox.net/~aldot/mkfs_stride.html

Returning to the title of the thread, asking about partition location
essentially, I woke up this morning and had sort of decided to just
try changing the chunk size to something large like your 512K. It
seems I'm out of luck as my partition size is not (apparently)
divisible by 512K:

c2RAID6 ~ # mdadm --grow /dev/md3 --chunk=512
--backup-file=/backups/ChunkSizeBackup
mdadm: component size 484088160K is not a multiple of chunksize 512K
c2RAID6 ~ # mdadm --grow /dev/md3 --chunk=256
--backup-file=/backups/ChunkSizeBackup
mdadm: component size 484088160K is not a multiple of chunksize 256K
c2RAID6 ~ # mdadm --grow /dev/md3 --chunk=128
--backup-file=/backups/ChunkSizeBackup
mdadm: component size 484088160K is not a multiple of chunksize 128K
c2RAID6 ~ # mdadm --grow /dev/md3 --chunk=64
--backup-file=/backups/ChunkSizeBackup
mdadm: component size 484088160K is not a multiple of chunksize 64K
c2RAID6 ~ #
c2RAID6 ~ # cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md3 : active raid6 sdb3[9] sdf3[5] sde3[6] sdd3[7] sdc3[8]
1452264480 blocks super 1.2 level 6, 16k chunk, algorithm 2 [5/5] [UUUUU]

unused devices: <none>
c2RAID6 ~ # fdisk -l /dev/sdb

Disk /dev/sdb: 500.1 GB, 500107862016 bytes, 976773168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x8b45be24

Device Boot Start End Blocks Id System
/dev/sdb1 * 63 112454 56196 83 Linux
/dev/sdb2 112455 8514449 4200997+ 82 Linux swap / Solaris
/dev/sdb3 8594775 976773167 484089196+ fd Linux raid autodetect
c2RAID6 ~ #

I suspect I might be much better off if all the partition sizes were
divisible by 2048 and started on 2048 multiple, like the newer fdisk
tools enforce.

I am thinking I won't make much headway unless I completely rebuild
the system from bare metal up. If I'm going to do that then I need to
get a good copy of the whole RAID onto some other drive which is a big
scary job, then start over with an install disk I guess.

Not sure I'm up for that just yet on a Sunday morning...

Take care,
Mark

Re: Is my RAID performance bad possibly due to starting sector value? [ In reply to ]

volkerarmin at googlemail

Jun 24, 2013, 11:47 AM

Post #36 of 46 (7611 views)

Permalink

Am 20.06.2013 22:45, schrieb Mark Knecht:
> On Thu, Jun 20, 2013 at 12:16 PM, Volker Armin Hemmann
> <volkerarmin@googlemail.com> wrote:
> <SNIP>
>> man mkfs.xfs
>>
>> man mkfs.ext4
>>
>> look for stripe size etc.
>>
>> Have fun.
>>
> Volker,
> I find way down at the bottom of the RAID setup page that they do
> say stride & stripe are important for RAID4 & RAID5, but remain
> non-committal for RAID6.

raid 6 is just raid5 with additional parity. So stripe size is not less
important.

Re: Is my RAID performance bad possibly due to starting sector value? [ In reply to ]

markknecht at gmail

Jun 24, 2013, 12:11 PM

Post #37 of 46 (7606 views)

Permalink

On Mon, Jun 24, 2013 at 11:47 AM, Volker Armin Hemmann
<volkerarmin@googlemail.com> wrote:
> Am 20.06.2013 22:45, schrieb Mark Knecht:
>> On Thu, Jun 20, 2013 at 12:16 PM, Volker Armin Hemmann
>> <volkerarmin@googlemail.com> wrote:
>> <SNIP>
>>> man mkfs.xfs
>>>
>>> man mkfs.ext4
>>>
>>> look for stripe size etc.
>>>
>>> Have fun.
>>>
>> Volker,
>> I find way down at the bottom of the RAID setup page that they do
>> say stride & stripe are important for RAID4 & RAID5, but remain
>> non-committal for RAID6.
>
> raid 6 is just raid5 with additional parity. So stripe size is not less
> important.
>

Yeah, as I continued to study that became more apparent. The Linux
RAID wiki not saying about it was (apparently) just an oversight on
their part.

At this point I'm basically getting set up to tear my whole machine
apart and rebuild it from scratch. When I do I'm benchmark whatever
RAID options I think will meet my long term needs and then report back
anything I find.

Personally, I think that RAID6 should be just slightly slower than
RAID5, and use slightly more CPU power doing it. How RAID5/6 really
compare with RAID1 isn't really that much of an issue for me as using
only RAID1 won't give me enough storage using any combinations of my 5
500GB drives.

I think if I was into spending some money I'd look at buying a second
SSD and do RAID1 for my / and then just use the disks for the VMs &
video, but don't see that as an option right now.

Cheers,
Mark

Re: Is my RAID performance bad possibly due to starting sector value? [ In reply to ]

1i5t5.duncan at cox

Jun 27, 2013, 5:20 PM

Post #38 of 46 (7584 views)

Permalink

Gary E. Miller posted on Sat, 22 Jun 2013 15:15:16 -0700 as excerpted:

>> >> [Does md/raid1 do parallel reads of multiple files at once?]
>> >
>> > It surely does. I have confirmed that at least monthly since md has
>> > existed in the kernel.
>>
>> Out of curiosity, /how/ do you confirm that? I agree based on real
>> usage experience, but with a claim that you're confirming it at least
>> monthly, it sounds like you have a standardized/scripted test, and I'm
>> interested in what/how you do it.
>
> I have around 30 RAID1 sets in production right now. Some of them doing
> mostly reads and some mostly writes. Some are HDD and some SSD.
> The RAID sets are pushed pretty hard 24x7 and we watch the performance
> pretty closely to plan updates. I have collectd performance graphs
> going way back.

So you're basically confirming it with normal usage as well, but have
documented performance history going pretty well all the way back. Not
the simple test script I was hoping for, but pretty impressive, none-the-
less.

Thanks.

--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman

Re: Re: Is my RAID performance bad possibly due to starting sector value? [ In reply to ]

gem at rellim

Jun 27, 2013, 5:41 PM

Post #39 of 46 (7599 views)

Permalink

Yo Duncan!

On Fri, 28 Jun 2013 00:20:45 +0000 (UTC)
Duncan <1i5t5.duncan@cox.net> wrote:

> > I have around 30 RAID1 sets in production right now. Some of them
> > doing mostly reads and some mostly writes. Some are HDD and some
> > SSD. The RAID sets are pushed pretty hard 24x7 and we watch the
> > performance pretty closely to plan updates. I have collectd
> > performance graphs going way back.
>
> So you're basically confirming it with normal usage as well, but have
> documented performance history going pretty well all the way back.
> Not the simple test script I was hoping for, but pretty impressive,
> none-the- less.

I find that 'hdparm -tT', 'dd' and 'bonnie++' will match up pretty well
with what I see in production. Just be sure to use really large test
file sizes with bonnie++ and dd. dd also needs a pretty large block size
(bs=) and pretty large/fast source of bits when writing.

With bonnie++ you can easily see the speed differences between raw disks
and various RAID types.

RGDS
GARY
---------------------------------------------------------------------------
Gary E. Miller Rellim 109 NW Wilmington Ave., Suite E, Bend, OR 97701
gem@rellim.com Tel:+1(541)382-8588

Re: Is my RAID performance bad possibly due to starting sector value? [ In reply to ]

1i5t5.duncan at cox

Jun 27, 2013, 5:51 PM

Post #40 of 46 (7577 views)

Permalink

Mark Knecht posted on Sat, 22 Jun 2013 16:04:06 -0700 as excerpted:

> Lastly, even if I completely buy into Duncan's well formed reasons about
> why RAID1 might be faster, using 500GB drives I see no single RAID
> solution for me other than RAID5/6. The real RAID1/RAID6 comparison from
> a storage standpoint would be a (conceptual) 3-drive RAID6 vs 3 drive
> RAID1. Both create 500GB of storage and can (conceptually) lose 2 drives
> and still recover data. However adding another drive to the RAID1 gains
> you more speed but no storage (buying into Duncan's points) vs adding
> storage to the RAID6 and probably reducing speed. As I need storage what
> other choices do I have?
>
> Answering myself, take the 5 drives, create two RAIDS - a 500GB
> 2-drive RAID1 for the system + VMs, and then a 3-drive RAID5 for video
> data maybe? I don't know...
>
> Or buy more hardware and do a 2 drive SSD RAID1 for the system, or
> a hardware RAID controller, etc. The options explode if I start buying
> more hardware.

Finally getting back to this on what's my "weekend"...

Unfortunately, given 900 gigs media data and 150 gigs of VMs, with 5 500
gig drives to work with, you're right, simply making a raid1 out of
everything isn't possible.

You could do a 4-drive raid10, two-way striped and two-way mirrored, for
a TB of storage for the media files and possibly squeeze the VMs between
the SSD and the raid, with the 5th half-TB as a backup, but it'd be quite
tight and non-optimal, plus losing the wrong two drives on the raid10
would put it out of commission so you'd have only one-drive-loss-
tolerance there.

You could buy a sixth half-TB and try either three-way-striping and two-
way mirroring for the same one-drive-loss tolerance but a good 1.5 TB (3-
way half-TB stripe) space, giving you plenty of space and thruput speed
but at the cost of only single-drive-loss-tolerance.

You could use the same six in a raid10 with the reverse configuration,
two-way-stripe three-way-mirror, for better loss-of-two-tolerance but at
only a TB of space and have the same squeeze as the 4-way raid10 (but now
without the extra drive for backup), or...

Personally, I'd probably be intensely motivated enough to try the 2-way-
stripe 3-way-mirror 6-drive raid10, squeezing the media space as
necessary to do it (maybe by using external drives for what wouldn't
fit), but that's still a compromise... and includes buying that sixth
drive.

So the raid6 might well be the best alternative you have, given the data
size AND physical device size constraints.

But some time testing the performance of different configs and
familiarizing yourself with the options and operation, as you've decided
to do now, certainly won't hurt. I DID say I wasn't real strong on the
chunk options, etc, myself, and you're using ext4, not the reiserfs I was
using, and I believe ext4 has at least some potential performance upside
compared to reiserfs, so it's quite possible that with some chunk/stride/
etc tweaking, you can get something better, performance-wise. Tho I
expect raid6 will never be a speed demon, and may well never perform as
you had originally expected/hoped. But better than the initial results
should be possible, hopefully, and familiarizing yourself with things
while experimenting has benefits of its own, so that's an idea I can
agree with 100%. =:^)

--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman

Re: Re: Is my RAID performance bad possibly due to starting sector value? [ In reply to ]

matt at professionalsysadmin

Jun 27, 2013, 8:18 PM

Post #41 of 46 (7596 views)

Permalink

I supported about 250 gentoo vm's using about 30 SAS 15K rpm 144GB
drives awhile back. Drives were split into 14 disk RAID10 sets. Then
each RAID10 set was split it into 200-500GB virtual drives, and the
virtual machines were grouped into sets of 3-5 and matched with a
virtual drive. Virtual machines on the same virtual drive were setup
to use thin provisioning, so that only used up as much storage space
as their data differed from the canonical gentoo os image which was
usually less than 20%. The virtual drives were usually only 30-50%
full and we could virtually provision 2TB+ of virtual machines on a
single 500GB virtual drive.

Don't underestimate what you can do with small drives, especially if
they are fast and you have a lot of them....

On Thu, Jun 27, 2013 at 5:51 PM, Duncan <1i5t5.duncan@cox.net> wrote:
> Mark Knecht posted on Sat, 22 Jun 2013 16:04:06 -0700 as excerpted:
>
>> Lastly, even if I completely buy into Duncan's well formed reasons about
>> why RAID1 might be faster, using 500GB drives I see no single RAID
>> solution for me other than RAID5/6. The real RAID1/RAID6 comparison from
>> a storage standpoint would be a (conceptual) 3-drive RAID6 vs 3 drive
>> RAID1. Both create 500GB of storage and can (conceptually) lose 2 drives
>> and still recover data. However adding another drive to the RAID1 gains
>> you more speed but no storage (buying into Duncan's points) vs adding
>> storage to the RAID6 and probably reducing speed. As I need storage what
>> other choices do I have?
>>
>> Answering myself, take the 5 drives, create two RAIDS - a 500GB
>> 2-drive RAID1 for the system + VMs, and then a 3-drive RAID5 for video
>> data maybe? I don't know...
>>
>> Or buy more hardware and do a 2 drive SSD RAID1 for the system, or
>> a hardware RAID controller, etc. The options explode if I start buying
>> more hardware.
>
> Finally getting back to this on what's my "weekend"...
>
> Unfortunately, given 900 gigs media data and 150 gigs of VMs, with 5 500
> gig drives to work with, you're right, simply making a raid1 out of
> everything isn't possible.
>
> You could do a 4-drive raid10, two-way striped and two-way mirrored, for
> a TB of storage for the media files and possibly squeeze the VMs between
> the SSD and the raid, with the 5th half-TB as a backup, but it'd be quite
> tight and non-optimal, plus losing the wrong two drives on the raid10
> would put it out of commission so you'd have only one-drive-loss-
> tolerance there.
>
> You could buy a sixth half-TB and try either three-way-striping and two-
> way mirroring for the same one-drive-loss tolerance but a good 1.5 TB (3-
> way half-TB stripe) space, giving you plenty of space and thruput speed
> but at the cost of only single-drive-loss-tolerance.
>
> You could use the same six in a raid10 with the reverse configuration,
> two-way-stripe three-way-mirror, for better loss-of-two-tolerance but at
> only a TB of space and have the same squeeze as the 4-way raid10 (but now
> without the extra drive for backup), or...
>
> Personally, I'd probably be intensely motivated enough to try the 2-way-
> stripe 3-way-mirror 6-drive raid10, squeezing the media space as
> necessary to do it (maybe by using external drives for what wouldn't
> fit), but that's still a compromise... and includes buying that sixth
> drive.
>
> So the raid6 might well be the best alternative you have, given the data
> size AND physical device size constraints.
>
> But some time testing the performance of different configs and
> familiarizing yourself with the options and operation, as you've decided
> to do now, certainly won't hurt. I DID say I wasn't real strong on the
> chunk options, etc, myself, and you're using ext4, not the reiserfs I was
> using, and I believe ext4 has at least some potential performance upside
> compared to reiserfs, so it's quite possible that with some chunk/stride/
> etc tweaking, you can get something better, performance-wise. Tho I
> expect raid6 will never be a speed demon, and may well never perform as
> you had originally expected/hoped. But better than the initial results
> should be possible, hopefully, and familiarizing yourself with things
> while experimenting has benefits of its own, so that's an idea I can
> agree with 100%. =:^)
>
> --
> Duncan - List replies preferred. No HTML msgs.
> "Every nonfree program has a lord, a master --
> and if you use the program, he is your master." Richard Stallman
>
>

Re: Is my RAID performance bad possibly due to starting sector value? [ In reply to ]

1i5t5.duncan at cox

Jun 27, 2013, 8:36 PM

Post #42 of 46 (7601 views)

Permalink

Mark Knecht posted on Sat, 22 Jun 2013 18:48:15 -0700 as excerpted:

> Duncan,

Again, following up now that it's my "weekend" and I have a chance...

> Actually, using your idea of piping things to /dev/null it appears
> that the random number generator itself is only capable of 15MB/S on my
> machine. It doesn't change much based on block size of number of bytes
> I pipe.

=:^(

Well, you tried.

> If this speed is representative of how well that works then I think
> I have to use a file. It appears this guy gets similar values:
>
> http://www.globallinuxsecurity.pro/quickly-fill-a-disk-with-random-bits-
without-dev-urandom/

Wow, that's a very nice idea he has there! I'll have to remember that!
The same idea should work for creating any relatively large random file,
regardless of final use. Just crypt-setup the thing and dd /dev/zero
into it.

FWIW, you're doing better than my system does, however. I seem to run
about 13 MB/s from /dev/urandom (upto 13.7 depending on blocksize). And
back to the random vs urandom discussion, random totally blocked here
after a few dozen bytes, waiting for more random data to be generated.
So the fact that you actually got a usefully sized file out of it does
indicate that you must have hardware random and that it's apparently
working well.

> On the other hand, piping /dev/zero appears to be very fast -
> basically the speed of the processor I think:
>
> $ dd if=/dev/zero of=/dev/null bs=4096 count=$[1000]
> 1000+0 records in 1000+0 records out 4096000 bytes (4.1 MB) copied,
> 0.000622594 s, 6.6 GB/s

What's most interesting to me when I tried that here is that unlike
urandom, zero's output varies DRAMATICALLY by blocksize. With
bs=$((1024*1024)) (aka 1MB), I get 14.3 GB/s, tho at the default bs=512,
I get only 1.2 GB/s. (Trying a few more values, 1024*512 gives me very
similar 14.5 GB/s, 1024*64 is already down to 13.2 GB/s, 1024*128=13.9
and 1024*256=14.1, while on the high side 1024*1024*2 is already down to
10.2 GB/s. So quarter MB to one MB seems the ideal range, on my
hardware.)

But of course, if your device is compressible-data speed-sensitive, as
are say the sandforce-controller-based ssds, /dev/zero isn't going to
give you anything like the real-world benchmark random data would (tho it
should be a great best-case compressible-data test). Tho it's unlikely
to matter on most spinning rust, AFAIK, and SSDs like my Corsair Neutrons
(Link_A_Media/LAMD-based controller), which have as a bullet-point
feature that they're data compression agnostic, unlike the sandforce-
based SSDs.

Since /dev/zero is so fast, I'd probably do a few initial tests to
determine whether compressible data makes a difference on what you're
testing, then use /dev/zero if it doesn't appear to, to get a reasonable
base config, then finally double-check that against random data again.

Meanwhile, here's another idea for random data, seeing as /dev/urandom is
speed limited. Upto your memory constraints anyway, you should be able
to dd if=/dev/urandom of=/some/file/on/tmpfs . Then you can
dd if=/tmpfs/file, of=/dev/test/target, or if you want a bigger file than
a direct tmpfs file will let you use, try something like this:

cat /tmpfs/file /tmpfs/file /tmpfs/file | dd of=/dev/test/target

... which would give you 3X the data size of /tmpfs/file.

(Man, testing that with a 10 GB tmpfs file (on a 12 GB tmpfs /tmp), I can
see see how slow that 13 MB/s /dev/urandom actually is as I'm creating
it! OUCH! I waited awhile before I started typing this comment... I've
been typing slowly and looking at the usage graph as I type, and I'm
still only at maybe 8 gigs, depending on where my cache usage was when I
started, right now!)

cd /tmp

dd if=/dev/urandom of=/tmp/10gig.testfile bs=$((1024*1024)) count=10240

(10240 records, 10737418240 bytes, but it says 11 GB copied, I guess dd
uses 10^3 multipliers, anyway, ~783 s, 13.7 MB/s)

ls -l 10gig.testfile

(confirm the size, 10737418240 bytes)

cat 10gig.testfile 10gig.testfile 10gig.testfile \
10gig.testfile 10gig.testfile | dd of=/dev/null

(that's 5x, yielding 50 GB power of 2, 104857600+0 records, 53687091200
bytes, ~140s, 385 MB/s at the default 512-byte blocksize)

Wow, what a difference block size makes there, too! Trying the above cat/
dd with bs=$((1024*1024)) (1MB) yields ~30s, 1.8 GB/s!

1GB block size (1024*1024*1024) yields about the same, 30s, 1.8 GB/s.

LOL dd didn't like my idea to try a 10 GB buffer size!

dd: memory exhausted by input buffer of size 10737418240 bytes (10 GiB)

(No wonder, as that'd be 10GB in tmpfs/cache and a 10GB buffer, and I'm
/only/ running 16 gigs RAM and no swap! But it won't take 2 GB either.
Checking, looks like as my normal user I'm running a ulimit of 1-gig
memory size, 2-gig virtual-size, so I'm sort of surprised it took the 1GB
buffer... maybe that counts against virtual only or something? )

Low side again, ~90s, 599 MB/s @ 1KB (1024 byte) bs, already a dramatic
improvement from the 140s 385 MB/s of the default 512-byte block.

2KB bs yields 52s, 1 GB/s

16KB bs yields 31s, 1.7 GB/s, near optimum already.

High side again, 1024*1024*4 (4MB) bs appears to be best-case, just under
29s, 1.9 GB/s. Going to 8MB takes another second, 1.8 GB/s again, which
is not a big surprise given that the memory page size is 4MB, so that's
an unsurprising peak performance point.

FWIW, cat seems to run just over 100% single-core saturation while dd
seems to run just under, @97% or so.

Running two instances in parallel (using the peak 4MB block size, 1.9 GB/
s with a single run) seems to cut performance some, but not nearly in
half. (I got 1.5 GB/s and 1.6 GB/s, but I started one then switched to a
different terminal to start the other, so they only overlapped by maybe
30s or so of the 35s on each.).

OK, so that's all memory/cpu since neither end is actual storage, but
that does give me a reasonable base against which to benchmark actual
storage (rust or ssd), if I wished.

What's interesting is that by, I guess pure coincidence, my 385 MB/s
original 512-byte blocksize figure is reasonably close to what the SSD
read benchmarks are with hddparm. IIRC the hdparm/ssd numbers were some
higher, but not so much so (470 MB/sec I just tested). But the bus speed
maxes out not /too/ far above that (500-600 MB/sec, theoretically 600 MB/
sec on SATA-600, but real world obviously won't /quite/ hit that, IIRC
best numbers I've seen anywhere are 585 or so).

So now I guess I send this and do some more testing of real device, now
that you've provoked my curiosity and I have the 50 GB (mostly)
pseudorandom file sitting in tmpfs already. Maybe I'll post those
results later.

--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman

Re: Is my RAID performance bad possibly due to starting sector value? [ In reply to ]

1i5t5.duncan at cox

Jun 28, 2013, 2:12 AM

Post #43 of 46 (7589 views)

Permalink

Duncan posted on Fri, 28 Jun 2013 03:36:10 +0000 as excerpted:

> So now I guess I send this and do some more testing of real device, now
> that you've provoked my curiosity and I have the 50 GB (mostly)
> pseudorandom file sitting in tmpfs already. Maybe I'll post those
> results later.

Well, I decided to use something rather smaller, both because I wanted to
run it against my much smaller btrfs partitions on the ssd, and because
the big file was taking too long for the benchmarks I wanted to do in the
time I wanted to do them.

I settled on a 4 GiB file. Speeds are power-of-10-based since that's
what dd reports, unless otherwise stated. Sizes are power-of-2-based
unless otherwise stated. This was filesystem-layer-based, not direct to
device, and single I/O task, plus whatever the system might have had
going on in the background.

Also note that after reading the dd manpage, I added the conv=fsync
parameter, hoping that gave me more accurate speed ratings due to the
reducing the write-caching.

SSD speeds, dual Corsair Neutron n256gp3 SATA-600 ssds, running btrfs
raid1 data and metadata:

To SSD: peak was upper 250s MB/s over a wide blocksize range of 1 MiB to
1GiB. I believe the btrfs checksumming might lower speeds here somewhat,
as it's quite lower than the rated 450 MB/s sequential write speed.

From SSD: peak was lower 480s MB/s, blocksize 32 KiB to 512 KiB (smaller
blocksize range but much smaller block than I expected). This is MUCH
better, far closer to the 540 MB/s ratings.

To/from SSD: At around 220 MB/s, peak was somewhat lower than write-only
peak, as might be expected. Best-case blocksize range seemed to be 256
KiB to 2 MiB.

So, best mixed-access case would seem to be a blocksize near 1 MiB.

I did a few timed cps also, then did the math to confirm the dd numbers.
They were close enough.

Spinning rust speeds, single Seagate st9500424as, 7200rpm 2.5" 16MB
buffer SATA-300 disk drive, reiserfs. Tests were done on a partition
located roughly 40% thru the drive. I didn't test this one as closely
and didn't do rust-to-rust tests at all, but:

To rust: upper 70s MB/s, blocksize didn't seem to matter much.

From rust: upper 90s MB/s, blocksize upto 4 MiB.

--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman

Re: Re: Is my RAID performance bad possibly due to starting sector value? [ In reply to ]

gem at rellim

Jun 28, 2013, 10:50 AM

Post #44 of 46 (7590 views)

Permalink

Yo Duncan!

On Fri, 28 Jun 2013 09:12:24 +0000 (UTC)
Duncan <1i5t5.duncan@cox.net> wrote:

> Duncan posted on Fri, 28 Jun 2013 03:36:10 +0000 as excerpted:
>
> I settled on a 4 GiB file. Speeds are power-of-10-based since that's
> what dd reports, unless otherwise stated.

dd is pretty good at testing linear file performance, pretty useless
for testing mysql performance.

> To SSD: peak was upper 250s MB/s over a wide blocksize range of 1 MiB
> >From SSD: peak was lower 480s MB/s, blocksize 32 KiB to 512 KiB

Sounds about right. Your speeds are now so high that small differences
in the SATA controller chip will be bigger than that between some
SSD drives. Use a PCIe/SATA card and your performance will drop from
what you see.

> Spinning rust speeds, single Seagate st9500424as, 7200rpm 2.5" 16MB

Those are pretty old and slow. If you are going to test an HDD against a
newer SSD you should at least test a newer HDD. A new 2TB drive could
get pretty close to your SSD performance in linear tests.

> To rust: upper 70s MB/s, blocksize didn't seem to matter much.
> >From rust: upper 90s MB/s, blocksize upto 4 MiB.

Seems about right, for that drive.

I think your numbers are about right, if your workload is just reading
and writing big linear files. For a MySQL workload there would be a lot
of random reads/writes/seeks and the SSD would really shine.

RGDS
GARY
---------------------------------------------------------------------------
Gary E. Miller Rellim 109 NW Wilmington Ave., Suite E, Bend, OR 97701
gem@rellim.com Tel:+1(541)382-8588

Re: Is my RAID performance bad possibly due to starting sector value? [ In reply to ]

1i5t5.duncan at cox

Jun 28, 2013, 10:40 PM

Post #45 of 46 (7600 views)

Permalink

Gary E. Miller posted on Fri, 28 Jun 2013 10:50:08 -0700 as excerpted:

> Yo Duncan!

Nice greeting, BTW. Good to cheer a reader up after a long day with
things not going right, especially after seeing it several times in a row
so there's a bit of familiarity now. =:^)

> On Fri, 28 Jun 2013 09:12:24 +0000 (UTC)
> Duncan <1i5t5.duncan@cox.net> wrote:
>
>> Duncan posted on Fri, 28 Jun 2013 03:36:10 +0000 as excerpted:
>>
>> I settled on a 4 GiB file. Speeds are power-of-10-based since that's
>> what dd reports, unless otherwise stated.
>
> dd is pretty good at testing linear file performance, pretty useless for
> testing mysql performance.

Recognized. A single i/o job test, but it's something, it's reasonably
repeatable, and when done on the actual filesystem, it's real-world times
and flexible in data and block size, if single-job limited.

Plus, unlike some of the more exotic tests which need to be installed
separately, it's commonly already installed and available for use on most
*ix systems. =:^)

>> To SSD: peak was upper 250s MB/s over a wide blocksize range of 1 MiB
>> >From SSD: peak was lower 480s MB/s, blocksize 32 KiB to 512 KiB
>
> Sounds about right. Your speeds are now so high that small differences
> in the SATA controller chip will be bigger than that between some SSD
> drives. Use a PCIe/SATA card and your performance will drop from what
> you see.

Good point.

I was thinking about that the other day. SSDs are fast enough that they
saturate a modern PCIe 3.0 1x and a single SATA-600 channel all by
themselves. SATA port-multipliers are arguably still useful for for
slower spinning rust, but not so much for SSD, where the bottleneck is
often already the SATA and/or PCIe, so doubling up will indeed only slow
things down.

And most add-on SATA cards have several SATA ports hanging off the same
1x PCIe, which means they'll bottleneck if actually using more than a
single port, too.

I believe I have seen 4x PCIe SATA cards, which would allow four or so
SATA ports (I think 5), but they tend to be higher priced.

After pondering that for a bit, I decided I'd take a closer look next
time I was at Fry's Electronics, to see what was actually available, as
well as the prices. Until last year I was still running old PCI-X boxes,
so the whole PCI-E thing itself is still relatively new to me, and I'm
still reorienting myself to the modern bus and its implications in terms
of addon cards, etc.

>> Spinning rust speeds, single Seagate st9500424as, 7200rpm 2.5" 16MB
>
> Those are pretty old and slow. If you are going to test an HDD against
> a newer SSD you should at least test a newer HDD. A new 2TB drive could
> get pretty close to your SSD performance in linear tests.

Well, it's not particularly old, but it *IS* a 2.5 inch, down from the
old 3.5 inch standard, which due to the smaller diameter does mean lower
rim/maximum speeds at the same RPM. And of course 7200 RPM is middle of
the pack as well. The fast stuff (tho calling any spinning rust "fast"
in the age of SSDs does rather jar, it's relative!) is 15000.

But 2.5 inch does seem to be on its way as the new standard for desktops
and servers as well, helped along by the three factors of storage
density, SSDs (which are invariably 2.5 inch, and even that's due to the
standard form factor as often the circuit boards aren't full height and/
or are largely empty space, 3.5 inch is just /ridiculously/ huge for
them), and the newer focus on power efficiency (plus raw spindle density!
) in the data center.

There's still a lot of inertia behind the 3.5 inch standard, just as
there is behind spinning rust, and it's not going away overnight, but in
the larger picture, 3.5 inch tends to look as anachronous as a full size
desktop in an age when even the laptop is being displaced by the tablet
and mobile phone. Which isn't to say there's no one still using them, by
far (my main machine is still a mid tower, easier to switch out parts on
them, after all), but just sayin' what I'm sayin'.

Anyway...

>> To rust: upper 70s MB/s, blocksize didn't seem to matter much.
>> >From rust: upper 90s MB/s, blocksize upto 4 MiB.
>
> Seems about right, for that drive.
>
> I think your numbers are about right, if your workload is just reading
> and writing big linear files. For a MySQL workload there would be a lot
> of random reads/writes/seeks and the SSD would really shine.

Absolutely. And perhaps more to the point given the list and thus the
readership...

As I said in a different thread on a different list, recently, I didn't
see my boot times change much, the major factor there being the ntp-
client time-sync, at ~12 seconds usually (just long enough to trigger
openrc's first 10-second warning in the minute timeout...), but *WOW*,
did the SSDs drop my emerge sync, as well as kernel git pull, time!
Those are both many smaller files that will tend to highly fragment over
time due to constant churn, and that's EXACTLY the job type where good
SSDs can (and do!) really shine!

Like your MySQL db example (tho that's high activity large-file rather
than high-activity huge number of smaller files), except this one's
likely more directly useful to a larger share of the list readership. =:^)

Meanwhile, the other thing with the boot times is that I boot to a CLI
login, so don't tend to count the X and kde startup times as boot. But
kde starts up much faster too, and that would count as boot time for many
users.

Additionally, I have one X app, pan, that in my (not exactly design
targeted) usage, had a startup time that really suffered on spinning
rust, so much so that for years I've had it start with the kde session,
so that if it takes five minutes on cold-cache to startup, no big deal, I
have other things to do and it's ready when I'm ready for it. It's a
newsgroups (nntp) app, which as designed (by default) ships with a 10 MB
article cache, and expires headers in (IIRC) two weeks. But my usage, in
addition to following my various lists with it using gmane's list2news
service, is as a long-time technical group and list archive. My text-
instance pan (the one I use the most) has a cache size of several gig
(with about a gig actually used) and is set to no-expiry on messages. In
fact, I have ISP newsgroups archived in pan for an ISP server that hasn't
hasn't even existed for several years, now, as well as the archives for
several mailing lists going back over a decade to 2002.

So this text-instance pan tends to be another prime example of best-use-
case-for-SSDs. Thousands, actually tens of thousands of files I believe,
all in the same cache dir, with pan accessing them all to rebuild its
threading tree in memory at startup. (For years there's been talk of
switching that to a database, so it doesn't have to all be in memory at
once, but the implementation has yet to be coded up and switched to.) On
spinning rust, I did note a good speed boost if I backed up everything
and did a mkfs and restore from time to time, so it's definitely a high
fragmentation use-case as well. *GREAT* use case for SSD, and there too,
I noticed a HUGE difference. Tho I've not actually timed the startup
since switching to SSD, I do know that the pan icon appears in the system
tray far earlier than it did, such that I almost think it's there as soon
as the system tray is, now, whereas on the spinning rust, it would take
five minutes or more to appear.

... Which is something those dd results don't, and can't, show at all.
Single i/o thread access to a single rather large (GBs) unchanged since
original write file is one thing. Access to thousands or tens of
thousands of constantly changing or multi-write-thread interwoven little
files, or for that matter, to a high activity large file thus (depending
on the filesystem) potentially triggering COW fragmentation there, is
something entirely different.

And the many-parallel-job seek latency of spinning rust is something that
dd simply does not and cannot really measure, as it's simply the wrong
tool for that sort of purpose.

--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman

Re: Re: Is my RAID performance bad possibly due to starting sector value? [ In reply to ]

rich0 at gentoo

Jun 29, 2013, 6:04 PM

Post #46 of 46 (7573 views)

Permalink

On Fri, Jun 21, 2013 at 3:31 AM, Duncan <1i5t5.duncan@cox.net> wrote:
> BUT RAID5/6 DOESN'T USE
> THAT DATA FOR INTEGRITY CHECKING ANYWAY, ONLY FOR RECONSTRUCTION IN THE
> CASE OF DEVICE LOSS!

Well, to drive this point home in the case of the thread that wouldn't
die, I had put an entry in crontab a week ago to do a weekly forced
check of all my arrays. Last week it passed. Today towards the end
drive performance seriously deteriorated, and eventually smartd sent
me an email about pending sectors (these are read errors).

Long story short I ended up failing the drive out of the array (at
which point my system stopped crawling), and tried wiping the bad
sectors individually, and after self tests kept failing I even tried
zeroing the drive. With sustained read failures under those
circumstances I decided the drive had to be suitable for RMA. The
drive was almost a year old.

So, crossing my fingers that I don't suffer another failure and I'll
be ginger with my clean shutdowns. Since the problem was discovered
before I had dual failures the RAID should be recoverable without
further loss.

If you don't already, check your arrays weekly in crontab. Scripts
for this can be found online or I'd be happy to post the one I dug up
somewhere...

Rich

Mailing List Archive

Attached Files:

Attached Files:

Attached Files: