Mailing List Archive

Attack of the Zombies
Is anyone familiar with messages such as these, and how to diagnose what volume(s) on what aggregate is triggering it?

4/27/2018 11:24:31 node1 NOTICE wafl.zombie.susp.msg.threshold: WAFL(R) is experiencing zombie throttling possibly due to requests for large number of file deletions. This can be mitigated by a combination of a) reducing the load on the system, b) issuing the file deletion requests in smaller batches and c) increasing the limits to allow more zombies to be queued on the system (Please contact technical support).

I'm troubleshooting a latency/performance issue and this caught my eye in the event logs. Web searches aren't really coming up with anything useful and there is mention of it in the NetApp KB but no real background information or troubleshooting steps.

Also, the (R)egistered Trademark after WAFL in the notice is weird to have in a technical event log.



Ian Ehrenwald
Senior Infrastructure Engineer
Hachette Book Group, Inc.
1.617.263.1948 / ian.ehrenwald@hbgusa.com

This may contain confidential material. If you are not an intended recipient, please notify the sender, delete immediately, and understand that no disclosure or reliance on the information herein is permitted. Hachette Book Group may monitor email to and from our network.


_______________________________________________
Toasters mailing list
Toasters@teaparty.net
http://www.teaparty.net/mailman/listinfo/toasters
Re: Attack of the Zombies [ In reply to ]
Netapp is rather silent on how....challenged it is towards deleting large
amounts of files, total block, or both, depending what version you are on.

And depending what version you are on, you have multiple ways to manage it,
or not.

This would be a good support call, to understand what you can do, or not.

What you are probably seeing is something like this:
https://www.flickr.com/photos/28804666@N08/shares/t9s941

A funner example here:
https://www.flickr.com/photos/28804666@N08/shares/x32YM1

A bump in read -and- write latency, which is quite odd, as you dont see
much more throughput that you did before, maybe the client(s) did a lookup
storm to go find things to delete as well. In this examples, yes,
throughput for the cluster went up, but its actually capable of ~4GB/sec,
so I know in my environment 1.4 is scratch.

But what happened under the covers in our release (9.1xx) is that
background delete workload clogs up the CP process, and it chokes the whole
box, and you see B2B CPs as a result. There are some dials and bootargs to
remediate this, and since then I can wipe out 16-20TB at once with no
impact.

What we see via some dials and bootargs for our code on a SATA HA pair now
looks like this. We delete huge amounts of hbase data every night. So
its tight.

https://www.flickr.com/photos/28804666@N08/shares/E0fz56

On Fri, Apr 27, 2018 at 8:37 AM, Ehrenwald, Ian <Ian.Ehrenwald@hbgusa.com>
wrote:

> Is anyone familiar with messages such as these, and how to diagnose what
> volume(s) on what aggregate is triggering it?
>
> 4/27/2018 11:24:31 node1 NOTICE wafl.zombie.susp.msg.threshold:
> WAFL(R) is experiencing zombie throttling possibly due to requests for
> large number of file deletions. This can be mitigated by a combination of
> a) reducing the load on the system, b) issuing the file deletion requests
> in smaller batches and c) increasing the limits to allow more zombies to be
> queued on the system (Please contact technical support).
>
> I'm troubleshooting a latency/performance issue and this caught my eye in
> the event logs. Web searches aren't really coming up with anything useful
> and there is mention of it in the NetApp KB but no real background
> information or troubleshooting steps.
>
> Also, the (R)egistered Trademark after WAFL in the notice is weird to have
> in a technical event log.
>
>
>
> Ian Ehrenwald
> Senior Infrastructure Engineer
> Hachette Book Group, Inc.
> 1.617.263.1948 / ian.ehrenwald@hbgusa.com
>
> This may contain confidential material. If you are not an intended
> recipient, please notify the sender, delete immediately, and understand
> that no disclosure or reliance on the information herein is permitted.
> Hachette Book Group may monitor email to and from our network.
>
>
> _______________________________________________
> Toasters mailing list
> Toasters@teaparty.net
> http://www.teaparty.net/mailman/listinfo/toasters
>
Re: Attack of the Zombies [ In reply to ]
What version of ontap cmode are you on, was it a recent update?

On Fri, Apr 27, 2018 at 11:37 AM, Ehrenwald, Ian <Ian.Ehrenwald@hbgusa.com>
wrote:

> Is anyone familiar with messages such as these, and how to diagnose what
> volume(s) on what aggregate is triggering it?
>
> 4/27/2018 11:24:31 node1 NOTICE wafl.zombie.susp.msg.threshold:
> WAFL(R) is experiencing zombie throttling possibly due to requests for
> large number of file deletions. This can be mitigated by a combination of
> a) reducing the load on the system, b) issuing the file deletion requests
> in smaller batches and c) increasing the limits to allow more zombies to be
> queued on the system (Please contact technical support).
>
> I'm troubleshooting a latency/performance issue and this caught my eye in
> the event logs. Web searches aren't really coming up with anything useful
> and there is mention of it in the NetApp KB but no real background
> information or troubleshooting steps.
>
> Also, the (R)egistered Trademark after WAFL in the notice is weird to have
> in a technical event log.
>
>
>
> Ian Ehrenwald
> Senior Infrastructure Engineer
> Hachette Book Group, Inc.
> 1.617.263.1948 / ian.ehrenwald@hbgusa.com
>
> This may contain confidential material. If you are not an intended
> recipient, please notify the sender, delete immediately, and understand
> that no disclosure or reliance on the information herein is permitted.
> Hachette Book Group may monitor email to and from our network.
>
>
> _______________________________________________
> Toasters mailing list
> Toasters@teaparty.net
> http://www.teaparty.net/mailman/listinfo/toasters
>