Mailing List Archive

Is there a best practice for adjust RAID "scrub" options?
Hello NetApp Gurus. I'm hoping mostly to get some guidance on RAID
group scrubbing options and with some luck, perhaps pointers to
documentation that would help me determine appropriate values in
our environment.

We've been seeing what feels like a large number of scrubbing
"timeouts" on our filers. Log entries similar to this one:

May 28 05:00:02 fc1-ev-n1.console.private [kern.notice] [fc1-ev-n1:raid.scrub.suspended.timer:notice]: Disk scrub suspended because the scrub time limit 240 was exceeded. It will resume at the next weekly/scheduled scrub.

What we'd like to know is how concerned we should be about this.
Ideally we'd be seeing "scrubs" complete reasonably frequently, but I'm
honestly not sure how I could determine how frequently that happens,
or how frequently it *should* happen for that matter.

The scrub options on the filer have not been changed from the Ontap
defaults (NetApp Release 9.5P2, but we've been seeing this with
earlier versions as well):

Node Option Value Constraint
------- ----------------------- ------ ----------
fc1-n1 raid.media_scrub.rate 600 only_one
fc1-n1 raid.scrub.perf_impact low only_one
fc1-n1 raid.scrub.schedule none

(and the same for the partner node, of course)

The "storage raid-options" manual page indicates that the default
schedule of daily at 1am for 4 hours, except Sundays when it runs
for 12 hours, applies if no explicit schedule is defined.

If I examine the scrub status of our aggregates:

fc1-ev::> storage aggregate scrub -aggregate * -action status

Raid Group:/e1n2_tssd/plex0/rg0, Is Suspended:false, Last Scrub:Thu May 30 02:39:10 2019
Raid Group:/e1n2_t01/plex0/rg0, Is Suspended:true, Last Scrub:Sun May 26 02:55:15 2019, Percentage Completed:65%
Raid Group:/e1n2_t02/plex0/rg0, Is Suspended:false, Last Scrub:Thu May 30 02:28:37 2019
Raid Group:/e1n2_root/plex0/rg0, Is Suspended:false, Last Scrub:Thu May 30 03:39:03 2019
Raid Group:/e1n1_root/plex0/rg0, Is Suspended:false, Last Scrub:Wed May 29 02:03:43 2019
Raid Group:/e1n1_d01/plex0/rg0, Is Suspended:true, Last Scrub:Tue May 28 03:22:52 2019, Percentage Completed:7%
Raid Group:/e1n1_d01/plex0/rg1, Is Suspended:true, Last Scrub:Wed May 29 02:00:56 2019, Percentage Completed:4%
Raid Group:/e1n1_d01/plex0/rg2, Is Suspended:false, Last Scrub:Wed May 29 04:02:50 2019
Raid Group:/e1n1_d01/plex0/rg3, Is Suspended:false, Last Scrub:Wed May 29 04:08:45 2019
Raid Group:/e1n1_d00/plex0/rg0, Is Suspended:true, Last Scrub:Sun Apr 28 06:00:40 2019, Percentage Completed:81%
Raid Group:/e1n1_d00/plex0/rg1, Is Suspended:true, Last Scrub:Sun Apr 28 07:38:30 2019, Percentage Completed:80%

The truth is I'm not sure how to interpret this output:

- Is it the case that each RAID group where "Is Suspended:false"
*completed* its scrub at the "Last Scrub" time, while those that
are suspended are those for which we're seeing log entries?

- Given the default schedule that has the scrub run for 12 hours
on Sunday mornings, does it seem odd that /e1n2_t01/plex0/rg0 was
suspended last Sunday at 02:55:15, prior to completion? In fact,
all those interrupted on a Sunday were interrupted well before
12 hours. Might there be other reasons for suspending scrub
operations? The load on this filer is not excessive in any way:
CPU utilization is typically comfortably below 50%

- How do I determine why the two RAID groups containing e1n1_d00
haven't run scrubbing in over a month? Is there something I
should do about that?

I've found documentation that explains the options and how to change
them, but none that explains how to decide whether I *should* change
them, or how to determine what to change them to. I'm interpretting
that raid.media_scrub.rate and raid.scrub.schedule could be used
together to tune the scrubbing, but am quite unsure how to determine
what the best values would be for our filers. Any pointers to
documentation that would help here would be hugely appreciated.

Thanks in advance ...

--
----------------------------------------------------------------------
Sylvain Robitaille syl@encs.concordia.ca

Systems analyst / AITS Concordia University
Faculty of Engineering and Computer Science Montreal, Quebec, Canada
----------------------------------------------------------------------
_______________________________________________
Toasters mailing list
Toasters@teaparty.net
http://www.teaparty.net/mailman/listinfo/toasters
Re: Is there a best practice for adjust RAID "scrub" options? [ In reply to ]
This is normal. I forget the options at this point (dont really use them
any more) but there is a default limit as to how long scrubs will run. They
remember where they left off and pick up one week later. It also, if I
recall correctly will only do so many scrubs at the same time. Remember, it
is not just scrubbing the aggregate, but looking at each Raid Group for
consistency.

--tmac




On Thu, May 30, 2019 at 1:19 PM Sylvain Robitaille <syl@encs.concordia.ca>
wrote:

>
> Hello NetApp Gurus. I'm hoping mostly to get some guidance on RAID
> group scrubbing options and with some luck, perhaps pointers to
> documentation that would help me determine appropriate values in
> our environment.
>
> We've been seeing what feels like a large number of scrubbing
> "timeouts" on our filers. Log entries similar to this one:
>
> May 28 05:00:02 fc1-ev-n1.console.private [kern.notice]
> [fc1-ev-n1:raid.scrub.suspended.timer:notice]: Disk scrub suspended because
> the scrub time limit 240 was exceeded. It will resume at the next
> weekly/scheduled scrub.
>
> What we'd like to know is how concerned we should be about this.
> Ideally we'd be seeing "scrubs" complete reasonably frequently, but I'm
> honestly not sure how I could determine how frequently that happens,
> or how frequently it *should* happen for that matter.
>
> The scrub options on the filer have not been changed from the Ontap
> defaults (NetApp Release 9.5P2, but we've been seeing this with
> earlier versions as well):
>
> Node Option Value Constraint
> ------- ----------------------- ------ ----------
> fc1-n1 raid.media_scrub.rate 600 only_one
> fc1-n1 raid.scrub.perf_impact low only_one
> fc1-n1 raid.scrub.schedule none
>
> (and the same for the partner node, of course)
>
> The "storage raid-options" manual page indicates that the default
> schedule of daily at 1am for 4 hours, except Sundays when it runs
> for 12 hours, applies if no explicit schedule is defined.
>
> If I examine the scrub status of our aggregates:
>
> fc1-ev::> storage aggregate scrub -aggregate * -action status
>
> Raid Group:/e1n2_tssd/plex0/rg0, Is Suspended:false, Last Scrub:Thu
> May 30 02:39:10 2019
> Raid Group:/e1n2_t01/plex0/rg0, Is Suspended:true, Last Scrub:Sun May
> 26 02:55:15 2019, Percentage Completed:65%
> Raid Group:/e1n2_t02/plex0/rg0, Is Suspended:false, Last Scrub:Thu May
> 30 02:28:37 2019
> Raid Group:/e1n2_root/plex0/rg0, Is Suspended:false, Last Scrub:Thu
> May 30 03:39:03 2019
> Raid Group:/e1n1_root/plex0/rg0, Is Suspended:false, Last Scrub:Wed
> May 29 02:03:43 2019
> Raid Group:/e1n1_d01/plex0/rg0, Is Suspended:true, Last Scrub:Tue May
> 28 03:22:52 2019, Percentage Completed:7%
> Raid Group:/e1n1_d01/plex0/rg1, Is Suspended:true, Last Scrub:Wed May
> 29 02:00:56 2019, Percentage Completed:4%
> Raid Group:/e1n1_d01/plex0/rg2, Is Suspended:false, Last Scrub:Wed May
> 29 04:02:50 2019
> Raid Group:/e1n1_d01/plex0/rg3, Is Suspended:false, Last Scrub:Wed May
> 29 04:08:45 2019
> Raid Group:/e1n1_d00/plex0/rg0, Is Suspended:true, Last Scrub:Sun Apr
> 28 06:00:40 2019, Percentage Completed:81%
> Raid Group:/e1n1_d00/plex0/rg1, Is Suspended:true, Last Scrub:Sun Apr
> 28 07:38:30 2019, Percentage Completed:80%
>
> The truth is I'm not sure how to interpret this output:
>
> - Is it the case that each RAID group where "Is Suspended:false"
> *completed* its scrub at the "Last Scrub" time, while those that
> are suspended are those for which we're seeing log entries?
>
> - Given the default schedule that has the scrub run for 12 hours
> on Sunday mornings, does it seem odd that /e1n2_t01/plex0/rg0 was
> suspended last Sunday at 02:55:15, prior to completion? In fact,
> all those interrupted on a Sunday were interrupted well before
> 12 hours. Might there be other reasons for suspending scrub
> operations? The load on this filer is not excessive in any way:
> CPU utilization is typically comfortably below 50%
>
> - How do I determine why the two RAID groups containing e1n1_d00
> haven't run scrubbing in over a month? Is there something I
> should do about that?
>
> I've found documentation that explains the options and how to change
> them, but none that explains how to decide whether I *should* change
> them, or how to determine what to change them to. I'm interpretting
> that raid.media_scrub.rate and raid.scrub.schedule could be used
> together to tune the scrubbing, but am quite unsure how to determine
> what the best values would be for our filers. Any pointers to
> documentation that would help here would be hugely appreciated.
>
> Thanks in advance ...
>
> --
> ----------------------------------------------------------------------
> Sylvain Robitaille syl@encs.concordia.ca
>
> Systems analyst / AITS Concordia University
> Faculty of Engineering and Computer Science Montreal, Quebec, Canada
> ----------------------------------------------------------------------
> _______________________________________________
> Toasters mailing list
> Toasters@teaparty.net
> http://www.teaparty.net/mailman/listinfo/toasters
>
Re: Is there a best practice for adjust RAID "scrub" options? [ In reply to ]
On Thu, 30 May 2019, tmac wrote:

> This is normal. I forget the options at this point (dont really use
> them any more) but there is a default limit as to how long scrubs
> will run. They remember where they left off and pick up one week
> later.

Right. I understand all that. I was really hoping more for pointers to
documentation that would help me decide wether or not to make any
adjustments, and what to adjust _to_. The default, at least for the
version of Ontap we're using is described in my original message (as
well as, come to think of it, which options are relevant ...).

> It also, if I recall correctly will only do so many scrubs at the same
> time.

I haven't found any documentation to that effect (though, of course it
makes sense, and I do expect that's the case). Do you have any you can
point me to?

> Remember, it is not just scrubbing the aggregate, but looking at each
> Raid Group for consistency.

Yes, I understand that.

--
----------------------------------------------------------------------
Sylvain Robitaille syl@encs.concordia.ca

Systems analyst / AITS Concordia University
Faculty of Engineering and Computer Science Montreal, Quebec, Canada
----------------------------------------------------------------------
_______________________________________________
Toasters mailing list
Toasters@teaparty.net
http://www.teaparty.net/mailman/listinfo/toasters
Re: Is there a best practice for adjust RAID "scrub" options? [ In reply to ]
So it’s possible that someone at NetApp has done some further analysis on this;
but my take on this is that this is the process that:
* validates that it can read data from a disk
* validates the RAID checksums
* validates the WAFL block checksums
* does other validity checks on the RG, aggr, filesystem, etc.

Because you don’t want problems to add up (especially, you don’t want problems
to add up to the point where you discover 3 read errors in the same stripe during
a rebuild!) you want to find and repair the issues relatively quickly (and also trigger
any disk health thresholds sooner rather than later).

So my take has always been to aim for doing a full scrub of all the media in a filer
within a month, and have it able to restart from the beginning and repeat the next
month, etc.

On large production filers, I’ve changed it from the default of a couple of hours once a
week to running for several early-AM hours every day; on DR filers that are mostly only
doing snapmirrors, I tend to be more aggressive - give it 8-12 hours each day and maybe
raise the scrub priority, as long as it doesn’t impact snapmirror update times (and it usually
doesn’t).

I’ve had another brand of disk array that had so much horsepower in the controllers they
would run continuous scans of every raid group, with enough smarts to immediately
give control over the disks to live I/O and pick up again when things went idle; that may
not be necessary but gave decent peace of mind.

I have seen these scrubs pick up and repair errors before but haven’t checked logs to
see how often it happens nowadays; with 10/12/14tb drives i’d expect it to happen more
often, but don’t know how true that is.

Someone let me know if any of my takes are incorrect, but I definitely don’t see a harm
in raising the schedule so that each bit gets scrubbed more often.

-dalvenjah

> On May 31, 2019, at 12:57 PM, Sylvain Robitaille <syl@encs.concordia.ca> wrote:
>
> On Thu, 30 May 2019, tmac wrote:
>
>> This is normal. I forget the options at this point (dont really use
>> them any more) but there is a default limit as to how long scrubs
>> will run. They remember where they left off and pick up one week
>> later.
>
> Right. I understand all that. I was really hoping more for pointers to
> documentation that would help me decide wether or not to make any
> adjustments, and what to adjust _to_. The default, at least for the
> version of Ontap we're using is described in my original message (as
> well as, come to think of it, which options are relevant ...).
>
>> It also, if I recall correctly will only do so many scrubs at the same
>> time.
>
> I haven't found any documentation to that effect (though, of course it
> makes sense, and I do expect that's the case). Do you have any you can
> point me to?
>
>> Remember, it is not just scrubbing the aggregate, but looking at each
>> Raid Group for consistency.
>
> Yes, I understand that.
>
> --
> ----------------------------------------------------------------------
> Sylvain Robitaille syl@encs.concordia.ca
>
> Systems analyst / AITS Concordia University
> Faculty of Engineering and Computer Science Montreal, Quebec, Canada
> ----------------------------------------------------------------------
> _______________________________________________
> Toasters mailing list
> Toasters@teaparty.net
> http://www.teaparty.net/mailman/listinfo/toasters


_______________________________________________
Toasters mailing list
Toasters@teaparty.net
http://www.teaparty.net/mailman/listinfo/toasters
Re: Is there a best practice for adjust RAID "scrub" options? [ In reply to ]
So, looks like best to utilize the "storage raid-options" command to
manipulate the scrubbing in ONTAP. Here are some options (from 9.6, YMMV
with versions before ONTAP 9.6!)


* raid.media_scrub.enable*

This option enables/disables continuous background media scrubs for all the
aggregates in the system. Valid values are on and off. The default value is
on. When enabled, a low-overhead version of scrub that checks only for
media errors runs continuously on all aggregates in the system. Background
media scrub has a negligible performance impact on the user workload and
this is achieved by aggressive disk and CPU throttling.


*raid.media_scrub.rate*

This option sets the rate of media scrub on an aggregate. Valid values for
this option range from 300 to 3000 where a rate of 300 represents a media
scrub of approximately 512 MB per hour, and 3000 represents a media scrub
of approximately 5GB per hour. The default value for this option is 600,
which is a rate of approximately 1GB per hour


*raid.scrub.duration*

This options sets the duration of automatically started scrubs, in minutes.
If this is not set or set to 0, the default duration is 4 hours (240
minutes). If set to -1, all automatic scrubs run to completion.


*raid.scrub.enable*

This option enables/disables the RAID scrub feature. Valid values are on or
off. The default value is on. This option only affects the scrubbing
process that gets started from cron. This option is ignored for
user-requested scrubs.


*raid.scrub.perf_impact*

This option sets the overall performance impact of RAID scrubbing (whether
started automatically or manually). When the CPU and disk bandwidth are not
consumed by serving clients, scrubbing consumes as much bandwidth as it
needs. If the serving of clients is already consuming most or all of the
CPU and disk bandwidth, this option allows control over the CPU and disk
bandwidth that can be taken away for scrubbing, and thereby enables control
over the negative performance impact on the serving of clients. As the
value of this option is increased, the speed of scrubbing also increases.
The possible values for this option are low, medium, and high. The default
value is low. When scrub and mirror verify are running at the same time,
the system does not distinguish between their separate resource consumption
on shared resources (like CPU or a shared disk). In this case, the combined
resource utilization of these operations is limited to the maximum resource
entitlement for individual operations.

*raid.scrub.schedule*

This option specifies the weekly schedule (day, time and duration) for
scrubs started automatically by the raid.scrub.enable option. On a non-AFF
system, the default schedule is daily at 1 a.m. for the duration of 4 hours
except on Sunday when it is 12 hours. On an AFF system, the default
schedule is weekly at 1 a.m. on Sunday for the duration of 6 hours. If an
empty string ("") is specified as an argument, it will delete the previous
scrub schedule and add the default schedule. One or more schedules can be
specified using this option. The syntax is duration[h|m]@weekday@start_time
,[duration[h|m]@weekday@start_time,...] where duration is the time period
for which scrub operation is allowed to run, in hours or minutes ('h' or
'm' respectively).If duration is not specified, the
raid.scrub.duration option value will be used as duration for the schedule.


Weekday is the day on which the scrub is scheduled to start. The valid
values are sun, mon, tue, wed, thu, fri, sat.

start_time is the time when scrub is schedule to start. It is specified in
24 hour format. Only the hour (0-23) needs to be specified.

For example, options raid.scrub.schedule 240m@tue@2,8h@sat@22 will cause
scrub to start on every Tuesday at 2 a.m. for 240 minutes, and on every
Saturday at 10 p.m. for 480 minutes.


--tmac

*Tim McCarthy, **Principal Consultant*

*Proud Member of the #NetAppATeam <https://twitter.com/NetAppATeam>*

*I Blog at TMACsRack <https://tmacsrack.wordpress.com/>*



On Fri, May 31, 2019 at 4:36 PM Dalvenjah FoxFire <dalvenjah@dal.net> wrote:

> So it’s possible that someone at NetApp has done some further analysis on
> this;
> but my take on this is that this is the process that:
> * validates that it can read data from a disk
> * validates the RAID checksums
> * validates the WAFL block checksums
> * does other validity checks on the RG, aggr, filesystem, etc.
>
> Because you don’t want problems to add up (especially, you don’t want
> problems
> to add up to the point where you discover 3 read errors in the same stripe
> during
> a rebuild!) you want to find and repair the issues relatively quickly (and
> also trigger
> any disk health thresholds sooner rather than later).
>
> So my take has always been to aim for doing a full scrub of all the media
> in a filer
> within a month, and have it able to restart from the beginning and repeat
> the next
> month, etc.
>
> On large production filers, I’ve changed it from the default of a couple
> of hours once a
> week to running for several early-AM hours every day; on DR filers that
> are mostly only
> doing snapmirrors, I tend to be more aggressive - give it 8-12 hours each
> day and maybe
> raise the scrub priority, as long as it doesn’t impact snapmirror update
> times (and it usually
> doesn’t).
>
> I’ve had another brand of disk array that had so much horsepower in the
> controllers they
> would run continuous scans of every raid group, with enough smarts to
> immediately
> give control over the disks to live I/O and pick up again when things went
> idle; that may
> not be necessary but gave decent peace of mind.
>
> I have seen these scrubs pick up and repair errors before but haven’t
> checked logs to
> see how often it happens nowadays; with 10/12/14tb drives i’d expect it to
> happen more
> often, but don’t know how true that is.
>
> Someone let me know if any of my takes are incorrect, but I definitely
> don’t see a harm
> in raising the schedule so that each bit gets scrubbed more often.
>
> -dalvenjah
>
> > On May 31, 2019, at 12:57 PM, Sylvain Robitaille <syl@encs.concordia.ca>
> wrote:
> >
> > On Thu, 30 May 2019, tmac wrote:
> >
> >> This is normal. I forget the options at this point (dont really use
> >> them any more) but there is a default limit as to how long scrubs
> >> will run. They remember where they left off and pick up one week
> >> later.
> >
> > Right. I understand all that. I was really hoping more for pointers to
> > documentation that would help me decide wether or not to make any
> > adjustments, and what to adjust _to_. The default, at least for the
> > version of Ontap we're using is described in my original message (as
> > well as, come to think of it, which options are relevant ...).
> >
> >> It also, if I recall correctly will only do so many scrubs at the same
> >> time.
> >
> > I haven't found any documentation to that effect (though, of course it
> > makes sense, and I do expect that's the case). Do you have any you can
> > point me to?
> >
> >> Remember, it is not just scrubbing the aggregate, but looking at each
> >> Raid Group for consistency.
> >
> > Yes, I understand that.
> >
> > --
> > ----------------------------------------------------------------------
> > Sylvain Robitaille syl@encs.concordia.ca
> >
> > Systems analyst / AITS Concordia University
> > Faculty of Engineering and Computer Science Montreal, Quebec, Canada
> > ----------------------------------------------------------------------
> > _______________________________________________
> > Toasters mailing list
> > Toasters@teaparty.net
> > http://www.teaparty.net/mailman/listinfo/toasters
>
>
> _______________________________________________
> Toasters mailing list
> Toasters@teaparty.net
> http://www.teaparty.net/mailman/listinfo/toasters
>
Re: Is there a best practice for adjust RAID "scrub" options? [ In reply to ]
On Fri, 31 May 2019, Dalvenjah FoxFire wrote:

> So my take has always been to aim for doing a full scrub of all
> the media in a filer within a month, and have it able to restart
> from the beginning and repeat the next month, etc.

Thanks. That's at least a data point I can use.

--
----------------------------------------------------------------------
Sylvain Robitaille syl@encs.concordia.ca

Systems analyst / AITS Concordia University
Faculty of Engineering and Computer Science Montreal, Quebec, Canada
----------------------------------------------------------------------
_______________________________________________
Toasters mailing list
Toasters@teaparty.net
http://www.teaparty.net/mailman/listinfo/toasters