Mailing List Archive

NFS issue after upgrading filers to 9.2P2
Hi gents today we have upgraded our Coventry cluster from 9.1P6 to 9.2P2 and we are about 99% working we just have a strange issue with SAP database servers NFS mounts. When the server is bounced the mounts are attached with no problems but after a few minutes a df -h starts to be very slow reporting the NFS mounted directories and if the databases are started up they hang and a df -h then also hangs. This sometimes recovers enough to then allow a df -h to work again but the databases are a lost cause right now.

In the server messages we get lots of entries for the SVM

Jan 23 07:01:27 jwukccsbci kernel: nfs: server JWUKCSVM01 not responding, still trying
Jan 23 07:01:47 jwukccsbci last message repeated 5 times
Jan 23 07:02:07 jwukccsbci kernel: nfs: server JWUKCSVM01 OK
Jan 23 07:02:07 jwukccsbci last message repeated 5 times
Jan 23 07:02:27 jwukccsbci kernel: nfs: server JWUKCSVM01 not responding, still trying
Jan 23 07:02:47 jwukccsbci last message repeated 5 times
Jan 23 07:02:48 jwukccsbci kernel: nfs: server JWUKCSVM01 OK

Is there anything that would of changed in the upgrade to lock down NFS or changes options that we might need to change back.

The redhat servers are an old kernel version 2.6.18-371.el5 that has some bugs but this was working fine before the filer upgrade was carried out.


Regards
Mark
Data Centre Sysadmin Team
Managed Services
Phone:- 02476 694455 Ext 2567
The Sysadmin Team promoting PCMS Values ~Integrity~Respect~Commitment~ ~Continuous Improvement~
The information contained in this e-mail is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. If you are not the intended recipient of this e-mail, the use of this information or any disclosure, copying or distribution is prohibited and may be unlawful. If you received this in error, please contact the sender and delete the material from any computer. The views expressed in this e-mail may not necessarily be the views of the PCMS Group plc and should not be taken as authority to carry out any instruction contained. The PCMS Group reserves the right to monitor and examine the content of all e-mails.

The PCMS Group plc is a company registered in England and Wales with company number 1459419 whose registered office is at PCMS House, Torwood Close, Westwood Business Park, Coventry CV4 8HX, United Kingdom. VAT No: GB 705338743
RE: NFS issue after upgrading filers to 9.2P2 [ In reply to ]
Those messages are indicative of a network problem. The packets are going through, then they succeed when the NFS client retries, then they pause again.

I can't think why an ONTAP upgrade of this type would cause such a problem. If it was working before, it should be working now. If you had any kind of a locking, firewall, or general configuration problem you should have no access whatsoever.

I've seen some weird NFS bug sin SUSE, but that RHEL version should be fine.

What are the mount options used, and are you using DNFS?

From: toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Mark Saunders
Sent: Tuesday, January 23, 2018 4:29 PM
To: toasters@teaparty.net
Subject: NFS issue after upgrading filers to 9.2P2

Hi gents today we have upgraded our Coventry cluster from 9.1P6 to 9.2P2 and we are about 99% working we just have a strange issue with SAP database servers NFS mounts. When the server is bounced the mounts are attached with no problems but after a few minutes a df -h starts to be very slow reporting the NFS mounted directories and if the databases are started up they hang and a df -h then also hangs. This sometimes recovers enough to then allow a df -h to work again but the databases are a lost cause right now.

In the server messages we get lots of entries for the SVM

Jan 23 07:01:27 jwukccsbci kernel: nfs: server JWUKCSVM01 not responding, still trying
Jan 23 07:01:47 jwukccsbci last message repeated 5 times
Jan 23 07:02:07 jwukccsbci kernel: nfs: server JWUKCSVM01 OK
Jan 23 07:02:07 jwukccsbci last message repeated 5 times
Jan 23 07:02:27 jwukccsbci kernel: nfs: server JWUKCSVM01 not responding, still trying
Jan 23 07:02:47 jwukccsbci last message repeated 5 times
Jan 23 07:02:48 jwukccsbci kernel: nfs: server JWUKCSVM01 OK

Is there anything that would of changed in the upgrade to lock down NFS or changes options that we might need to change back.

The redhat servers are an old kernel version 2.6.18-371.el5 that has some bugs but this was working fine before the filer upgrade was carried out.


Regards
Mark
Data Centre Sysadmin Team
Managed Services
Phone:- 02476 694455 Ext 2567
The Sysadmin Team promoting PCMS Values ~Integrity~Respect~Commitment~ ~Continuous Improvement~
The information contained in this e-mail is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. If you are not the intended recipient of this e-mail, the use of this information or any disclosure, copying or distribution is prohibited and may be unlawful. If you received this in error, please contact the sender and delete the material from any computer. The views expressed in this e-mail may not necessarily be the views of the PCMS Group plc and should not be taken as authority to carry out any instruction contained. The PCMS Group reserves the right to monitor and examine the content of all e-mails.

The PCMS Group plc is a company registered in England and Wales with company number 1459419 whose registered office is at PCMS House, Torwood Close, Westwood Business Park, Coventry CV4 8HX, United Kingdom. VAT No: GB 705338743
RE: NFS issue after upgrading filers to 9.2P2 [ In reply to ]
It takes a lot for an ONTAP system to flat-out be unable to respond. Unless the timeout parameters are exceedingly short, you shouldn't reach that point, especially with anything capable of running ONTAP 9.2.

I'd open a support case on this one. In addition, if you want to trigger an autosupport and send me the serial numbers directly I can take a glance at a few stats to see if anything looks odd.
From: Fenn, Michael [mailto:fennm@DEShawResearch.com]
Sent: Tuesday, January 23, 2018 6:23 PM
To: Steiner, Jeffrey <Jeffrey.Steiner@netapp.com>; Mark Saunders <Mark.Saunders@pcmsgroup.com>; toasters@teaparty.net
Subject: Re: NFS issue after upgrading filers to 9.2P2

The messages are not necessarily indicative of a network problem.

The kernel prints "nfs: server … not responding, still trying" when an operation times out (timeo deciseconds) for the configured (retrans) number of tries. Once the server responds, then it prints "nfs: server … OK".

Networking problems are certainly one reason that an operation would time out, but not the only reason. An overloaded or down file server will cause the same effect.

Thanks,
Michael

From: <toasters-bounces@teaparty.net<mailto:toasters-bounces@teaparty.net>> on behalf of "Steiner, Jeffrey" <Jeffrey.Steiner@netapp.com<mailto:Jeffrey.Steiner@netapp.com>>
Date: Tuesday, January 23, 2018 at 10:38 AM
To: Mark Saunders <Mark.Saunders@pcmsgroup.com<mailto:Mark.Saunders@pcmsgroup.com>>, "toasters@teaparty.net<mailto:toasters@teaparty.net>" <toasters@teaparty.net<mailto:toasters@teaparty.net>>
Subject: RE: NFS issue after upgrading filers to 9.2P2

Those messages are indicative of a network problem. The packets are going through, then they succeed when the NFS client retries, then they pause again.

I can't think why an ONTAP upgrade of this type would cause such a problem. If it was working before, it should be working now. If you had any kind of a locking, firewall, or general configuration problem you should have no access whatsoever.

I've seen some weird NFS bug sin SUSE, but that RHEL version should be fine.

What are the mount options used, and are you using DNFS?

From: toasters-bounces@teaparty.net<mailto:toasters-bounces@teaparty.net> [mailto:toasters-bounces@teaparty.net] On Behalf Of Mark Saunders
Sent: Tuesday, January 23, 2018 4:29 PM
To: toasters@teaparty.net<mailto:toasters@teaparty.net>
Subject: NFS issue after upgrading filers to 9.2P2

Hi gents today we have upgraded our Coventry cluster from 9.1P6 to 9.2P2 and we are about 99% working we just have a strange issue with SAP database servers NFS mounts. When the server is bounced the mounts are attached with no problems but after a few minutes a df –h starts to be very slow reporting the NFS mounted directories and if the databases are started up they hang and a df –h then also hangs. This sometimes recovers enough to then allow a df –h to work again but the databases are a lost cause right now.

In the server messages we get lots of entries for the SVM

Jan 23 07:01:27 jwukccsbci kernel: nfs: server JWUKCSVM01 not responding, still trying
Jan 23 07:01:47 jwukccsbci last message repeated 5 times
Jan 23 07:02:07 jwukccsbci kernel: nfs: server JWUKCSVM01 OK
Jan 23 07:02:07 jwukccsbci last message repeated 5 times
Jan 23 07:02:27 jwukccsbci kernel: nfs: server JWUKCSVM01 not responding, still trying
Jan 23 07:02:47 jwukccsbci last message repeated 5 times
Jan 23 07:02:48 jwukccsbci kernel: nfs: server JWUKCSVM01 OK

Is there anything that would of changed in the upgrade to lock down NFS or changes options that we might need to change back.

The redhat servers are an old kernel version 2.6.18-371.el5 that has some bugs but this was working fine before the filer upgrade was carried out.


Regards
Mark
Data Centre Sysadmin Team
Managed Services
Phone:- 02476 694455 Ext 2567
The Sysadmin Team promoting PCMS Values ~Integrity~Respect~Commitment~ ~Continuous Improvement~
The information contained in this e-mail is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. If you are not the intended recipient of this e-mail, the use of this information or any disclosure, copying or distribution is prohibited and may be unlawful. If you received this in error, please contact the sender and delete the material from any computer. The views expressed in this e-mail may not necessarily be the views of the PCMS Group plc and should not be taken as authority to carry out any instruction contained. The PCMS Group reserves the right to monitor and examine the content of all e-mails.

The PCMS Group plc is a company registered in England and Wales with company number 1459419 whose registered office is at PCMS House, Torwood Close, Westwood Business Park, Coventry CV4 8HX, United Kingdom. VAT No: GB 705338743
RE: NFS issue after upgrading filers to 9.2P2 [ In reply to ]
Thanks for the quick replies sorry for the delay in e responding but I was working on this since 5am so had to go sleep.

I have a call open with netapp but have had the coockie cutter response of it isn’t on the Interoperability Matrix Tool as a supported version (It wasn’t when on 9.1 anyway)

A third party we have contact with have sent me a link to details about fastpathing being removed but I don’t think we were using it so maybe another false line to look down.

The mount options were kept fairly straight forward as

nfs nolock,_netdev,udp 0 0

and we have also tried the same as the one of the production servers which had tuned options, this is on another cluster so isn’t affected by this yet.

nfsvers=3,nolock,_netdev,rw,udp,rsize=32768,wsize=32768,timeo=600 0 0

How would I be able to tell if we are using DNFS ?

I will send you the support details tomorrow when I am back in the office.

Regards

Mark


From: Steiner, Jeffrey [mailto:Jeffrey.Steiner@netapp.com]
Sent: 23 January 2018 17:29
To: Fenn, Michael; Mark Saunders; toasters@teaparty.net
Subject: RE: NFS issue after upgrading filers to 9.2P2

It takes a lot for an ONTAP system to flat-out be unable to respond. Unless the timeout parameters are exceedingly short, you shouldn't reach that point, especially with anything capable of running ONTAP 9.2.

I'd open a support case on this one. In addition, if you want to trigger an autosupport and send me the serial numbers directly I can take a glance at a few stats to see if anything looks odd.

From: Fenn, Michael [mailto:fennm@DEShawResearch.com]
Sent: Tuesday, January 23, 2018 6:23 PM
To: Steiner, Jeffrey <Jeffrey.Steiner@netapp.com>; Mark Saunders <Mark.Saunders@pcmsgroup.com>; toasters@teaparty.net
Subject: Re: NFS issue after upgrading filers to 9.2P2

The messages are not necessarily indicative of a network problem.

The kernel prints "nfs: server … not responding, still trying" when an operation times out (timeo deciseconds) for the configured (retrans) number of tries. Once the server responds, then it prints "nfs: server … OK".

Networking problems are certainly one reason that an operation would time out, but not the only reason. An overloaded or down file server will cause the same effect.

Thanks,
Michael

From: <toasters-bounces@teaparty.net<mailto:toasters-bounces@teaparty.net>> on behalf of "Steiner, Jeffrey" <Jeffrey.Steiner@netapp.com<mailto:Jeffrey.Steiner@netapp.com>>
Date: Tuesday, January 23, 2018 at 10:38 AM
To: Mark Saunders <Mark.Saunders@pcmsgroup.com<mailto:Mark.Saunders@pcmsgroup.com>>, "toasters@teaparty.net<mailto:toasters@teaparty.net>" <toasters@teaparty.net<mailto:toasters@teaparty.net>>
Subject: RE: NFS issue after upgrading filers to 9.2P2

Those messages are indicative of a network problem. The packets are going through, then they succeed when the NFS client retries, then they pause again.

I can't think why an ONTAP upgrade of this type would cause such a problem. If it was working before, it should be working now. If you had any kind of a locking, firewall, or general configuration problem you should have no access whatsoever.

I've seen some weird NFS bug sin SUSE, but that RHEL version should be fine.

What are the mount options used, and are you using DNFS?

From: toasters-bounces@teaparty.net<mailto:toasters-bounces@teaparty.net> [mailto:toasters-bounces@teaparty.net] On Behalf Of Mark Saunders
Sent: Tuesday, January 23, 2018 4:29 PM
To: toasters@teaparty.net<mailto:toasters@teaparty.net>
Subject: NFS issue after upgrading filers to 9.2P2

Hi gents today we have upgraded our Coventry cluster from 9.1P6 to 9.2P2 and we are about 99% working we just have a strange issue with SAP database servers NFS mounts. When the server is bounced the mounts are attached with no problems but after a few minutes a df –h starts to be very slow reporting the NFS mounted directories and if the databases are started up they hang and a df –h then also hangs. This sometimes recovers enough to then allow a df –h to work again but the databases are a lost cause right now.

In the server messages we get lots of entries for the SVM

Jan 23 07:01:27 jwukccsbci kernel: nfs: server JWUKCSVM01 not responding, still trying
Jan 23 07:01:47 jwukccsbci last message repeated 5 times
Jan 23 07:02:07 jwukccsbci kernel: nfs: server JWUKCSVM01 OK
Jan 23 07:02:07 jwukccsbci last message repeated 5 times
Jan 23 07:02:27 jwukccsbci kernel: nfs: server JWUKCSVM01 not responding, still trying
Jan 23 07:02:47 jwukccsbci last message repeated 5 times
Jan 23 07:02:48 jwukccsbci kernel: nfs: server JWUKCSVM01 OK

Is there anything that would of changed in the upgrade to lock down NFS or changes options that we might need to change back.

The redhat servers are an old kernel version 2.6.18-371.el5 that has some bugs but this was working fine before the filer upgrade was carried out.


Regards
Mark
Data Centre Sysadmin Team
Managed Services
Phone:- 02476 694455 Ext 2567
The Sysadmin Team promoting PCMS Values ~Integrity~Respect~Commitment~ ~Continuous Improvement~
The information contained in this e-mail is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. If you are not the intended recipient of this e-mail, the use of this information or any disclosure, copying or distribution is prohibited and may be unlawful. If you received this in error, please contact the sender and delete the material from any computer. The views expressed in this e-mail may not necessarily be the views of the PCMS Group plc and should not be taken as authority to carry out any instruction contained. The PCMS Group reserves the right to monitor and examine the content of all e-mails.

The PCMS Group plc is a company registered in England and Wales with company number 1459419 whose registered office is at PCMS House, Torwood Close, Westwood Business Park, Coventry CV4 8HX, United Kingdom. VAT No: GB 705338743
RE: NFS issue after upgrading filers to 9.2P2 [ In reply to ]
I should have asked - is this SAP HANA or something like SAP on an Oracle database?

Also, what do they mean "it's not on the IMT?" Virtually everything NFS is on the IMT. We support any NFSv3 and NFSv4 client that obeys the specification. There's a tiny number of exceptions, but generally speaking we'll support linux, Solaris, AIX, mainframe, OpenVMS, HP-UX, Oracle DNFS, AS/400, etc. There really should be no issue there.

The thing about fastpath does ring a few bells.

From: Mark Saunders [mailto:Mark.Saunders@pcmsgroup.com]
Sent: Tuesday, January 23, 2018 11:18 PM
To: Steiner, Jeffrey <Jeffrey.Steiner@netapp.com>; Fenn, Michael <fennm@DEShawResearch.com>; toasters@teaparty.net
Subject: RE: NFS issue after upgrading filers to 9.2P2

Thanks for the quick replies sorry for the delay in e responding but I was working on this since 5am so had to go sleep.

I have a call open with netapp but have had the coockie cutter response of it isn’t on the Interoperability Matrix Tool as a supported version (It wasn’t when on 9.1 anyway)

A third party we have contact with have sent me a link to details about fastpathing being removed but I don’t think we were using it so maybe another false line to look down.

The mount options were kept fairly straight forward as

nfs nolock,_netdev,udp 0 0

and we have also tried the same as the one of the production servers which had tuned options, this is on another cluster so isn’t affected by this yet.

nfsvers=3,nolock,_netdev,rw,udp,rsize=32768,wsize=32768,timeo=600 0 0

How would I be able to tell if we are using DNFS ?

I will send you the support details tomorrow when I am back in the office.

Regards

Mark


From: Steiner, Jeffrey [mailto:Jeffrey.Steiner@netapp.com]
Sent: 23 January 2018 17:29
To: Fenn, Michael; Mark Saunders; toasters@teaparty.net<mailto:toasters@teaparty.net>
Subject: RE: NFS issue after upgrading filers to 9.2P2

It takes a lot for an ONTAP system to flat-out be unable to respond. Unless the timeout parameters are exceedingly short, you shouldn't reach that point, especially with anything capable of running ONTAP 9.2.

I'd open a support case on this one. In addition, if you want to trigger an autosupport and send me the serial numbers directly I can take a glance at a few stats to see if anything looks odd.

From: Fenn, Michael [mailto:fennm@DEShawResearch.com]
Sent: Tuesday, January 23, 2018 6:23 PM
To: Steiner, Jeffrey <Jeffrey.Steiner@netapp.com<mailto:Jeffrey.Steiner@netapp.com>>; Mark Saunders <Mark.Saunders@pcmsgroup.com<mailto:Mark.Saunders@pcmsgroup.com>>; toasters@teaparty.net<mailto:toasters@teaparty.net>
Subject: Re: NFS issue after upgrading filers to 9.2P2

The messages are not necessarily indicative of a network problem.

The kernel prints "nfs: server … not responding, still trying" when an operation times out (timeo deciseconds) for the configured (retrans) number of tries. Once the server responds, then it prints "nfs: server … OK".

Networking problems are certainly one reason that an operation would time out, but not the only reason. An overloaded or down file server will cause the same effect.

Thanks,
Michael

From: <toasters-bounces@teaparty.net<mailto:toasters-bounces@teaparty.net>> on behalf of "Steiner, Jeffrey" <Jeffrey.Steiner@netapp.com<mailto:Jeffrey.Steiner@netapp.com>>
Date: Tuesday, January 23, 2018 at 10:38 AM
To: Mark Saunders <Mark.Saunders@pcmsgroup.com<mailto:Mark.Saunders@pcmsgroup.com>>, "toasters@teaparty.net<mailto:toasters@teaparty.net>" <toasters@teaparty.net<mailto:toasters@teaparty.net>>
Subject: RE: NFS issue after upgrading filers to 9.2P2

Those messages are indicative of a network problem. The packets are going through, then they succeed when the NFS client retries, then they pause again.

I can't think why an ONTAP upgrade of this type would cause such a problem. If it was working before, it should be working now. If you had any kind of a locking, firewall, or general configuration problem you should have no access whatsoever.

I've seen some weird NFS bug sin SUSE, but that RHEL version should be fine.

What are the mount options used, and are you using DNFS?

From: toasters-bounces@teaparty.net<mailto:toasters-bounces@teaparty.net> [mailto:toasters-bounces@teaparty.net] On Behalf Of Mark Saunders
Sent: Tuesday, January 23, 2018 4:29 PM
To: toasters@teaparty.net<mailto:toasters@teaparty.net>
Subject: NFS issue after upgrading filers to 9.2P2

Hi gents today we have upgraded our Coventry cluster from 9.1P6 to 9.2P2 and we are about 99% working we just have a strange issue with SAP database servers NFS mounts. When the server is bounced the mounts are attached with no problems but after a few minutes a df –h starts to be very slow reporting the NFS mounted directories and if the databases are started up they hang and a df –h then also hangs. This sometimes recovers enough to then allow a df –h to work again but the databases are a lost cause right now.

In the server messages we get lots of entries for the SVM

Jan 23 07:01:27 jwukccsbci kernel: nfs: server JWUKCSVM01 not responding, still trying
Jan 23 07:01:47 jwukccsbci last message repeated 5 times
Jan 23 07:02:07 jwukccsbci kernel: nfs: server JWUKCSVM01 OK
Jan 23 07:02:07 jwukccsbci last message repeated 5 times
Jan 23 07:02:27 jwukccsbci kernel: nfs: server JWUKCSVM01 not responding, still trying
Jan 23 07:02:47 jwukccsbci last message repeated 5 times
Jan 23 07:02:48 jwukccsbci kernel: nfs: server JWUKCSVM01 OK

Is there anything that would of changed in the upgrade to lock down NFS or changes options that we might need to change back.

The redhat servers are an old kernel version 2.6.18-371.el5 that has some bugs but this was working fine before the filer upgrade was carried out.


Regards
Mark
Data Centre Sysadmin Team
Managed Services
Phone:- 02476 694455 Ext 2567
The Sysadmin Team promoting PCMS Values ~Integrity~Respect~Commitment~ ~Continuous Improvement~
The information contained in this e-mail is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. If you are not the intended recipient of this e-mail, the use of this information or any disclosure, copying or distribution is prohibited and may be unlawful. If you received this in error, please contact the sender and delete the material from any computer. The views expressed in this e-mail may not necessarily be the views of the PCMS Group plc and should not be taken as authority to carry out any instruction contained. The PCMS Group reserves the right to monitor and examine the content of all e-mails.

The PCMS Group plc is a company registered in England and Wales with company number 1459419 whose registered office is at PCMS House, Torwood Close, Westwood Business Park, Coventry CV4 8HX, United Kingdom. VAT No: GB 705338743
RE: NFS issue after upgrading filers to 9.2P2 [ In reply to ]
The network stack changed in 9.2 and IP fastpath was removed. But fastpath was mainly for more efficient routing.

https://library.netapp.com/ecmdocs/ECMP1114171/html/GUID-8276014A-16EB-4902-9EDC-868C5292381B.html

The stack was changed to a more standard BSD stack, so fastpath was no longer needed. It’s possible that’s an issue here, but I’d suggest getting network sniffs on each endpoint of the network to see where the packet is being dropped.

From: toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Steiner, Jeffrey
Sent: Tuesday, January 23, 2018 5:24 PM
To: Mark Saunders <Mark.Saunders@pcmsgroup.com>; Fenn, Michael <fennm@DEShawResearch.com>; toasters@teaparty.net
Subject: RE: NFS issue after upgrading filers to 9.2P2

I should have asked - is this SAP HANA or something like SAP on an Oracle database?

Also, what do they mean "it's not on the IMT?" Virtually everything NFS is on the IMT. We support any NFSv3 and NFSv4 client that obeys the specification. There's a tiny number of exceptions, but generally speaking we'll support linux, Solaris, AIX, mainframe, OpenVMS, HP-UX, Oracle DNFS, AS/400, etc. There really should be no issue there.

The thing about fastpath does ring a few bells.

From: Mark Saunders [mailto:Mark.Saunders@pcmsgroup.com]
Sent: Tuesday, January 23, 2018 11:18 PM
To: Steiner, Jeffrey <Jeffrey.Steiner@netapp.com<mailto:Jeffrey.Steiner@netapp.com>>; Fenn, Michael <fennm@DEShawResearch.com<mailto:fennm@DEShawResearch.com>>; toasters@teaparty.net<mailto:toasters@teaparty.net>
Subject: RE: NFS issue after upgrading filers to 9.2P2

Thanks for the quick replies sorry for the delay in e responding but I was working on this since 5am so had to go sleep.

I have a call open with netapp but have had the coockie cutter response of it isn’t on the Interoperability Matrix Tool as a supported version (It wasn’t when on 9.1 anyway)

A third party we have contact with have sent me a link to details about fastpathing being removed but I don’t think we were using it so maybe another false line to look down.

The mount options were kept fairly straight forward as

nfs nolock,_netdev,udp 0 0

and we have also tried the same as the one of the production servers which had tuned options, this is on another cluster so isn’t affected by this yet.

nfsvers=3,nolock,_netdev,rw,udp,rsize=32768,wsize=32768,timeo=600 0 0

How would I be able to tell if we are using DNFS ?

I will send you the support details tomorrow when I am back in the office.

Regards

Mark


From: Steiner, Jeffrey [mailto:Jeffrey.Steiner@netapp.com]
Sent: 23 January 2018 17:29
To: Fenn, Michael; Mark Saunders; toasters@teaparty.net<mailto:toasters@teaparty.net>
Subject: RE: NFS issue after upgrading filers to 9.2P2

It takes a lot for an ONTAP system to flat-out be unable to respond. Unless the timeout parameters are exceedingly short, you shouldn't reach that point, especially with anything capable of running ONTAP 9.2.

I'd open a support case on this one. In addition, if you want to trigger an autosupport and send me the serial numbers directly I can take a glance at a few stats to see if anything looks odd.

From: Fenn, Michael [mailto:fennm@DEShawResearch.com]
Sent: Tuesday, January 23, 2018 6:23 PM
To: Steiner, Jeffrey <Jeffrey.Steiner@netapp.com<mailto:Jeffrey.Steiner@netapp.com>>; Mark Saunders <Mark.Saunders@pcmsgroup.com<mailto:Mark.Saunders@pcmsgroup.com>>; toasters@teaparty.net<mailto:toasters@teaparty.net>
Subject: Re: NFS issue after upgrading filers to 9.2P2

The messages are not necessarily indicative of a network problem.

The kernel prints "nfs: server … not responding, still trying" when an operation times out (timeo deciseconds) for the configured (retrans) number of tries. Once the server responds, then it prints "nfs: server … OK".

Networking problems are certainly one reason that an operation would time out, but not the only reason. An overloaded or down file server will cause the same effect.

Thanks,
Michael

From: <toasters-bounces@teaparty.net<mailto:toasters-bounces@teaparty.net>> on behalf of "Steiner, Jeffrey" <Jeffrey.Steiner@netapp.com<mailto:Jeffrey.Steiner@netapp.com>>
Date: Tuesday, January 23, 2018 at 10:38 AM
To: Mark Saunders <Mark.Saunders@pcmsgroup.com<mailto:Mark.Saunders@pcmsgroup.com>>, "toasters@teaparty.net<mailto:toasters@teaparty.net>" <toasters@teaparty.net<mailto:toasters@teaparty.net>>
Subject: RE: NFS issue after upgrading filers to 9.2P2

Those messages are indicative of a network problem. The packets are going through, then they succeed when the NFS client retries, then they pause again.

I can't think why an ONTAP upgrade of this type would cause such a problem. If it was working before, it should be working now. If you had any kind of a locking, firewall, or general configuration problem you should have no access whatsoever.

I've seen some weird NFS bug sin SUSE, but that RHEL version should be fine.

What are the mount options used, and are you using DNFS?

From: toasters-bounces@teaparty.net<mailto:toasters-bounces@teaparty.net> [mailto:toasters-bounces@teaparty.net] On Behalf Of Mark Saunders
Sent: Tuesday, January 23, 2018 4:29 PM
To: toasters@teaparty.net<mailto:toasters@teaparty.net>
Subject: NFS issue after upgrading filers to 9.2P2

Hi gents today we have upgraded our Coventry cluster from 9.1P6 to 9.2P2 and we are about 99% working we just have a strange issue with SAP database servers NFS mounts. When the server is bounced the mounts are attached with no problems but after a few minutes a df –h starts to be very slow reporting the NFS mounted directories and if the databases are started up they hang and a df –h then also hangs. This sometimes recovers enough to then allow a df –h to work again but the databases are a lost cause right now.

In the server messages we get lots of entries for the SVM

Jan 23 07:01:27 jwukccsbci kernel: nfs: server JWUKCSVM01 not responding, still trying
Jan 23 07:01:47 jwukccsbci last message repeated 5 times
Jan 23 07:02:07 jwukccsbci kernel: nfs: server JWUKCSVM01 OK
Jan 23 07:02:07 jwukccsbci last message repeated 5 times
Jan 23 07:02:27 jwukccsbci kernel: nfs: server JWUKCSVM01 not responding, still trying
Jan 23 07:02:47 jwukccsbci last message repeated 5 times
Jan 23 07:02:48 jwukccsbci kernel: nfs: server JWUKCSVM01 OK

Is there anything that would of changed in the upgrade to lock down NFS or changes options that we might need to change back.

The redhat servers are an old kernel version 2.6.18-371.el5 that has some bugs but this was working fine before the filer upgrade was carried out.


Regards
Mark
Data Centre Sysadmin Team
Managed Services
Phone:- 02476 694455 Ext 2567
The Sysadmin Team promoting PCMS Values ~Integrity~Respect~Commitment~ ~Continuous Improvement~
The information contained in this e-mail is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. If you are not the intended recipient of this e-mail, the use of this information or any disclosure, copying or distribution is prohibited and may be unlawful. If you received this in error, please contact the sender and delete the material from any computer. The views expressed in this e-mail may not necessarily be the views of the PCMS Group plc and should not be taken as authority to carry out any instruction contained. The PCMS Group reserves the right to monitor and examine the content of all e-mails.

The PCMS Group plc is a company registered in England and Wales with company number 1459419 whose registered office is at PCMS House, Torwood Close, Westwood Business Park, Coventry CV4 8HX, United Kingdom. VAT No: GB 705338743
RE: NFS issue after upgrading filers to 9.2P2 [ In reply to ]
This community post also does a good job explaining it:

https://community.netapp.com/t5/Data-ONTAP-Discussions/NetApp-Ontap-9-2-Upgrade-review-your-network-first/td-p/136657

From: toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Parisi, Justin
Sent: Tuesday, January 23, 2018 5:28 PM
To: Steiner, Jeffrey <Jeffrey.Steiner@netapp.com>; Mark Saunders <Mark.Saunders@pcmsgroup.com>; Fenn, Michael <fennm@DEShawResearch.com>; toasters@teaparty.net
Subject: RE: NFS issue after upgrading filers to 9.2P2

The network stack changed in 9.2 and IP fastpath was removed. But fastpath was mainly for more efficient routing.

https://library.netapp.com/ecmdocs/ECMP1114171/html/GUID-8276014A-16EB-4902-9EDC-868C5292381B.html

The stack was changed to a more standard BSD stack, so fastpath was no longer needed. It’s possible that’s an issue here, but I’d suggest getting network sniffs on each endpoint of the network to see where the packet is being dropped.

From: toasters-bounces@teaparty.net<mailto:toasters-bounces@teaparty.net> [mailto:toasters-bounces@teaparty.net] On Behalf Of Steiner, Jeffrey
Sent: Tuesday, January 23, 2018 5:24 PM
To: Mark Saunders <Mark.Saunders@pcmsgroup.com<mailto:Mark.Saunders@pcmsgroup.com>>; Fenn, Michael <fennm@DEShawResearch.com<mailto:fennm@DEShawResearch.com>>; toasters@teaparty.net<mailto:toasters@teaparty.net>
Subject: RE: NFS issue after upgrading filers to 9.2P2

I should have asked - is this SAP HANA or something like SAP on an Oracle database?

Also, what do they mean "it's not on the IMT?" Virtually everything NFS is on the IMT. We support any NFSv3 and NFSv4 client that obeys the specification. There's a tiny number of exceptions, but generally speaking we'll support linux, Solaris, AIX, mainframe, OpenVMS, HP-UX, Oracle DNFS, AS/400, etc. There really should be no issue there.

The thing about fastpath does ring a few bells.

From: Mark Saunders [mailto:Mark.Saunders@pcmsgroup.com]
Sent: Tuesday, January 23, 2018 11:18 PM
To: Steiner, Jeffrey <Jeffrey.Steiner@netapp.com<mailto:Jeffrey.Steiner@netapp.com>>; Fenn, Michael <fennm@DEShawResearch.com<mailto:fennm@DEShawResearch.com>>; toasters@teaparty.net<mailto:toasters@teaparty.net>
Subject: RE: NFS issue after upgrading filers to 9.2P2

Thanks for the quick replies sorry for the delay in e responding but I was working on this since 5am so had to go sleep.

I have a call open with netapp but have had the coockie cutter response of it isn’t on the Interoperability Matrix Tool as a supported version (It wasn’t when on 9.1 anyway)

A third party we have contact with have sent me a link to details about fastpathing being removed but I don’t think we were using it so maybe another false line to look down.

The mount options were kept fairly straight forward as

nfs nolock,_netdev,udp 0 0

and we have also tried the same as the one of the production servers which had tuned options, this is on another cluster so isn’t affected by this yet.

nfsvers=3,nolock,_netdev,rw,udp,rsize=32768,wsize=32768,timeo=600 0 0

How would I be able to tell if we are using DNFS ?

I will send you the support details tomorrow when I am back in the office.

Regards

Mark


From: Steiner, Jeffrey [mailto:Jeffrey.Steiner@netapp.com]
Sent: 23 January 2018 17:29
To: Fenn, Michael; Mark Saunders; toasters@teaparty.net<mailto:toasters@teaparty.net>
Subject: RE: NFS issue after upgrading filers to 9.2P2

It takes a lot for an ONTAP system to flat-out be unable to respond. Unless the timeout parameters are exceedingly short, you shouldn't reach that point, especially with anything capable of running ONTAP 9.2.

I'd open a support case on this one. In addition, if you want to trigger an autosupport and send me the serial numbers directly I can take a glance at a few stats to see if anything looks odd.

From: Fenn, Michael [mailto:fennm@DEShawResearch.com]
Sent: Tuesday, January 23, 2018 6:23 PM
To: Steiner, Jeffrey <Jeffrey.Steiner@netapp.com<mailto:Jeffrey.Steiner@netapp.com>>; Mark Saunders <Mark.Saunders@pcmsgroup.com<mailto:Mark.Saunders@pcmsgroup.com>>; toasters@teaparty.net<mailto:toasters@teaparty.net>
Subject: Re: NFS issue after upgrading filers to 9.2P2

The messages are not necessarily indicative of a network problem.

The kernel prints "nfs: server … not responding, still trying" when an operation times out (timeo deciseconds) for the configured (retrans) number of tries. Once the server responds, then it prints "nfs: server … OK".

Networking problems are certainly one reason that an operation would time out, but not the only reason. An overloaded or down file server will cause the same effect.

Thanks,
Michael

From: <toasters-bounces@teaparty.net<mailto:toasters-bounces@teaparty.net>> on behalf of "Steiner, Jeffrey" <Jeffrey.Steiner@netapp.com<mailto:Jeffrey.Steiner@netapp.com>>
Date: Tuesday, January 23, 2018 at 10:38 AM
To: Mark Saunders <Mark.Saunders@pcmsgroup.com<mailto:Mark.Saunders@pcmsgroup.com>>, "toasters@teaparty.net<mailto:toasters@teaparty.net>" <toasters@teaparty.net<mailto:toasters@teaparty.net>>
Subject: RE: NFS issue after upgrading filers to 9.2P2

Those messages are indicative of a network problem. The packets are going through, then they succeed when the NFS client retries, then they pause again.

I can't think why an ONTAP upgrade of this type would cause such a problem. If it was working before, it should be working now. If you had any kind of a locking, firewall, or general configuration problem you should have no access whatsoever.

I've seen some weird NFS bug sin SUSE, but that RHEL version should be fine.

What are the mount options used, and are you using DNFS?

From: toasters-bounces@teaparty.net<mailto:toasters-bounces@teaparty.net> [mailto:toasters-bounces@teaparty.net] On Behalf Of Mark Saunders
Sent: Tuesday, January 23, 2018 4:29 PM
To: toasters@teaparty.net<mailto:toasters@teaparty.net>
Subject: NFS issue after upgrading filers to 9.2P2

Hi gents today we have upgraded our Coventry cluster from 9.1P6 to 9.2P2 and we are about 99% working we just have a strange issue with SAP database servers NFS mounts. When the server is bounced the mounts are attached with no problems but after a few minutes a df –h starts to be very slow reporting the NFS mounted directories and if the databases are started up they hang and a df –h then also hangs. This sometimes recovers enough to then allow a df –h to work again but the databases are a lost cause right now.

In the server messages we get lots of entries for the SVM

Jan 23 07:01:27 jwukccsbci kernel: nfs: server JWUKCSVM01 not responding, still trying
Jan 23 07:01:47 jwukccsbci last message repeated 5 times
Jan 23 07:02:07 jwukccsbci kernel: nfs: server JWUKCSVM01 OK
Jan 23 07:02:07 jwukccsbci last message repeated 5 times
Jan 23 07:02:27 jwukccsbci kernel: nfs: server JWUKCSVM01 not responding, still trying
Jan 23 07:02:47 jwukccsbci last message repeated 5 times
Jan 23 07:02:48 jwukccsbci kernel: nfs: server JWUKCSVM01 OK

Is there anything that would of changed in the upgrade to lock down NFS or changes options that we might need to change back.

The redhat servers are an old kernel version 2.6.18-371.el5 that has some bugs but this was working fine before the filer upgrade was carried out.


Regards
Mark
Data Centre Sysadmin Team
Managed Services
Phone:- 02476 694455 Ext 2567
The Sysadmin Team promoting PCMS Values ~Integrity~Respect~Commitment~ ~Continuous Improvement~
The information contained in this e-mail is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. If you are not the intended recipient of this e-mail, the use of this information or any disclosure, copying or distribution is prohibited and may be unlawful. If you received this in error, please contact the sender and delete the material from any computer. The views expressed in this e-mail may not necessarily be the views of the PCMS Group plc and should not be taken as authority to carry out any instruction contained. The PCMS Group reserves the right to monitor and examine the content of all e-mails.

The PCMS Group plc is a company registered in England and Wales with company number 1459419 whose registered office is at PCMS House, Torwood Close, Westwood Business Park, Coventry CV4 8HX, United Kingdom. VAT No: GB 705338743
RE: NFS issue after upgrading filers to 9.2P2 [ In reply to ]
In fact, maybe look at this as a root cause… do your NFS interfaces share nodes with admin interfaces?

“NFS issues were caused by using a NAS interface on the same node as the SVM admin interface, once I realised we moved all servers NFS to the node without the admin interface.”

From: Parisi, Justin
Sent: Tuesday, January 23, 2018 5:30 PM
To: Parisi, Justin <Justin.Parisi@netapp.com>; Steiner, Jeffrey <Jeffrey.Steiner@netapp.com>; Mark Saunders <Mark.Saunders@pcmsgroup.com>; Fenn, Michael <fennm@DEShawResearch.com>; toasters@teaparty.net
Subject: RE: NFS issue after upgrading filers to 9.2P2

This community post also does a good job explaining it:

https://community.netapp.com/t5/Data-ONTAP-Discussions/NetApp-Ontap-9-2-Upgrade-review-your-network-first/td-p/136657

From: toasters-bounces@teaparty.net<mailto:toasters-bounces@teaparty.net> [mailto:toasters-bounces@teaparty.net] On Behalf Of Parisi, Justin
Sent: Tuesday, January 23, 2018 5:28 PM
To: Steiner, Jeffrey <Jeffrey.Steiner@netapp.com<mailto:Jeffrey.Steiner@netapp.com>>; Mark Saunders <Mark.Saunders@pcmsgroup.com<mailto:Mark.Saunders@pcmsgroup.com>>; Fenn, Michael <fennm@DEShawResearch.com<mailto:fennm@DEShawResearch.com>>; toasters@teaparty.net<mailto:toasters@teaparty.net>
Subject: RE: NFS issue after upgrading filers to 9.2P2

The network stack changed in 9.2 and IP fastpath was removed. But fastpath was mainly for more efficient routing.

https://library.netapp.com/ecmdocs/ECMP1114171/html/GUID-8276014A-16EB-4902-9EDC-868C5292381B.html

The stack was changed to a more standard BSD stack, so fastpath was no longer needed. It’s possible that’s an issue here, but I’d suggest getting network sniffs on each endpoint of the network to see where the packet is being dropped.

From: toasters-bounces@teaparty.net<mailto:toasters-bounces@teaparty.net> [mailto:toasters-bounces@teaparty.net] On Behalf Of Steiner, Jeffrey
Sent: Tuesday, January 23, 2018 5:24 PM
To: Mark Saunders <Mark.Saunders@pcmsgroup.com<mailto:Mark.Saunders@pcmsgroup.com>>; Fenn, Michael <fennm@DEShawResearch.com<mailto:fennm@DEShawResearch.com>>; toasters@teaparty.net<mailto:toasters@teaparty.net>
Subject: RE: NFS issue after upgrading filers to 9.2P2

I should have asked - is this SAP HANA or something like SAP on an Oracle database?

Also, what do they mean "it's not on the IMT?" Virtually everything NFS is on the IMT. We support any NFSv3 and NFSv4 client that obeys the specification. There's a tiny number of exceptions, but generally speaking we'll support linux, Solaris, AIX, mainframe, OpenVMS, HP-UX, Oracle DNFS, AS/400, etc. There really should be no issue there.

The thing about fastpath does ring a few bells.

From: Mark Saunders [mailto:Mark.Saunders@pcmsgroup.com]
Sent: Tuesday, January 23, 2018 11:18 PM
To: Steiner, Jeffrey <Jeffrey.Steiner@netapp.com<mailto:Jeffrey.Steiner@netapp.com>>; Fenn, Michael <fennm@DEShawResearch.com<mailto:fennm@DEShawResearch.com>>; toasters@teaparty.net<mailto:toasters@teaparty.net>
Subject: RE: NFS issue after upgrading filers to 9.2P2

Thanks for the quick replies sorry for the delay in e responding but I was working on this since 5am so had to go sleep.

I have a call open with netapp but have had the coockie cutter response of it isn’t on the Interoperability Matrix Tool as a supported version (It wasn’t when on 9.1 anyway)

A third party we have contact with have sent me a link to details about fastpathing being removed but I don’t think we were using it so maybe another false line to look down.

The mount options were kept fairly straight forward as

nfs nolock,_netdev,udp 0 0

and we have also tried the same as the one of the production servers which had tuned options, this is on another cluster so isn’t affected by this yet.

nfsvers=3,nolock,_netdev,rw,udp,rsize=32768,wsize=32768,timeo=600 0 0

How would I be able to tell if we are using DNFS ?

I will send you the support details tomorrow when I am back in the office.

Regards

Mark


From: Steiner, Jeffrey [mailto:Jeffrey.Steiner@netapp.com]
Sent: 23 January 2018 17:29
To: Fenn, Michael; Mark Saunders; toasters@teaparty.net<mailto:toasters@teaparty.net>
Subject: RE: NFS issue after upgrading filers to 9.2P2

It takes a lot for an ONTAP system to flat-out be unable to respond. Unless the timeout parameters are exceedingly short, you shouldn't reach that point, especially with anything capable of running ONTAP 9.2.

I'd open a support case on this one. In addition, if you want to trigger an autosupport and send me the serial numbers directly I can take a glance at a few stats to see if anything looks odd.

From: Fenn, Michael [mailto:fennm@DEShawResearch.com]
Sent: Tuesday, January 23, 2018 6:23 PM
To: Steiner, Jeffrey <Jeffrey.Steiner@netapp.com<mailto:Jeffrey.Steiner@netapp.com>>; Mark Saunders <Mark.Saunders@pcmsgroup.com<mailto:Mark.Saunders@pcmsgroup.com>>; toasters@teaparty.net<mailto:toasters@teaparty.net>
Subject: Re: NFS issue after upgrading filers to 9.2P2

The messages are not necessarily indicative of a network problem.

The kernel prints "nfs: server … not responding, still trying" when an operation times out (timeo deciseconds) for the configured (retrans) number of tries. Once the server responds, then it prints "nfs: server … OK".

Networking problems are certainly one reason that an operation would time out, but not the only reason. An overloaded or down file server will cause the same effect.

Thanks,
Michael

From: <toasters-bounces@teaparty.net<mailto:toasters-bounces@teaparty.net>> on behalf of "Steiner, Jeffrey" <Jeffrey.Steiner@netapp.com<mailto:Jeffrey.Steiner@netapp.com>>
Date: Tuesday, January 23, 2018 at 10:38 AM
To: Mark Saunders <Mark.Saunders@pcmsgroup.com<mailto:Mark.Saunders@pcmsgroup.com>>, "toasters@teaparty.net<mailto:toasters@teaparty.net>" <toasters@teaparty.net<mailto:toasters@teaparty.net>>
Subject: RE: NFS issue after upgrading filers to 9.2P2

Those messages are indicative of a network problem. The packets are going through, then they succeed when the NFS client retries, then they pause again.

I can't think why an ONTAP upgrade of this type would cause such a problem. If it was working before, it should be working now. If you had any kind of a locking, firewall, or general configuration problem you should have no access whatsoever.

I've seen some weird NFS bug sin SUSE, but that RHEL version should be fine.

What are the mount options used, and are you using DNFS?

From: toasters-bounces@teaparty.net<mailto:toasters-bounces@teaparty.net> [mailto:toasters-bounces@teaparty.net] On Behalf Of Mark Saunders
Sent: Tuesday, January 23, 2018 4:29 PM
To: toasters@teaparty.net<mailto:toasters@teaparty.net>
Subject: NFS issue after upgrading filers to 9.2P2

Hi gents today we have upgraded our Coventry cluster from 9.1P6 to 9.2P2 and we are about 99% working we just have a strange issue with SAP database servers NFS mounts. When the server is bounced the mounts are attached with no problems but after a few minutes a df –h starts to be very slow reporting the NFS mounted directories and if the databases are started up they hang and a df –h then also hangs. This sometimes recovers enough to then allow a df –h to work again but the databases are a lost cause right now.

In the server messages we get lots of entries for the SVM

Jan 23 07:01:27 jwukccsbci kernel: nfs: server JWUKCSVM01 not responding, still trying
Jan 23 07:01:47 jwukccsbci last message repeated 5 times
Jan 23 07:02:07 jwukccsbci kernel: nfs: server JWUKCSVM01 OK
Jan 23 07:02:07 jwukccsbci last message repeated 5 times
Jan 23 07:02:27 jwukccsbci kernel: nfs: server JWUKCSVM01 not responding, still trying
Jan 23 07:02:47 jwukccsbci last message repeated 5 times
Jan 23 07:02:48 jwukccsbci kernel: nfs: server JWUKCSVM01 OK

Is there anything that would of changed in the upgrade to lock down NFS or changes options that we might need to change back.

The redhat servers are an old kernel version 2.6.18-371.el5 that has some bugs but this was working fine before the filer upgrade was carried out.


Regards
Mark
Data Centre Sysadmin Team
Managed Services
Phone:- 02476 694455 Ext 2567
The Sysadmin Team promoting PCMS Values ~Integrity~Respect~Commitment~ ~Continuous Improvement~
The information contained in this e-mail is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. If you are not the intended recipient of this e-mail, the use of this information or any disclosure, copying or distribution is prohibited and may be unlawful. If you received this in error, please contact the sender and delete the material from any computer. The views expressed in this e-mail may not necessarily be the views of the PCMS Group plc and should not be taken as authority to carry out any instruction contained. The PCMS Group reserves the right to monitor and examine the content of all e-mails.

The PCMS Group plc is a company registered in England and Wales with company number 1459419 whose registered office is at PCMS House, Torwood Close, Westwood Business Park, Coventry CV4 8HX, United Kingdom. VAT No: GB 705338743
Re: NFS issue after upgrading filers to 9.2P2 [ In reply to ]
Oracle DNFS. Requires extra setup and bypasses the system NFS. Oracle built
their own that is pretty efficient.

Just out of curiosity, take a look at your Pause Frames (Xon/Xoff). Look on
the NetApp side and the client side.
(netapp -> system node run -node node-0x ifconfig e0a)
Client side will depend, but you want to see the eth stats.
If it is possible, even check the ports on the switch.

Maybe they are being generated more frequently after the upgrade?

As Justin suggested, looking at a packet trace from both ends would be
helpful.


--tmac

*Tim McCarthy, **Principal Consultant*

*Proud Member of the #NetAppATeam <https://twitter.com/NetAppATeam>*

*I Blog at TMACsRack <https://tmacsrack.wordpress.com/>*



On Tue, Jan 23, 2018 at 5:17 PM, Mark Saunders <Mark.Saunders@pcmsgroup.com>
wrote:

> Thanks for the quick replies sorry for the delay in e responding but I was
> working on this since 5am so had to go sleep.
>
>
>
> I have a call open with netapp but have had the coockie cutter response of
> it isn’t on the Interoperability Matrix Tool as a supported version (It
> wasn’t when on 9.1 anyway)
>
>
>
> A third party we have contact with have sent me a link to details about
> fastpathing being removed but I don’t think we were using it so maybe
> another false line to look down.
>
>
>
> The mount options were kept fairly straight forward as
>
>
>
> nfs nolock,_netdev,udp 0 0
>
>
>
> and we have also tried the same as the one of the production servers which
> had tuned options, this is on another cluster so isn’t affected by this yet.
>
>
>
> nfsvers=3,nolock,_netdev,rw,udp,rsize=32768,wsize=32768,timeo=600 0 0
>
>
>
> How would I be able to tell if we are using DNFS ?
>
>
>
> I will send you the support details tomorrow when I am back in the office.
>
>
>
> Regards
>
>
>
> Mark
>
>
>
>
>
> *From:* Steiner, Jeffrey [mailto:Jeffrey.Steiner@netapp.com]
> *Sent:* 23 January 2018 17:29
> *To:* Fenn, Michael; Mark Saunders; toasters@teaparty.net
>
> *Subject:* RE: NFS issue after upgrading filers to 9.2P2
>
>
>
> It takes a lot for an ONTAP system to flat-out be unable to respond.
> Unless the timeout parameters are exceedingly short, you shouldn't reach
> that point, especially with anything capable of running ONTAP 9.2.
>
>
>
> I'd open a support case on this one. In addition, if you want to trigger
> an autosupport and send me the serial numbers directly I can take a glance
> at a few stats to see if anything looks odd.
>
>
>
> *From:* Fenn, Michael [mailto:fennm@DEShawResearch.com]
> *Sent:* Tuesday, January 23, 2018 6:23 PM
> *To:* Steiner, Jeffrey <Jeffrey.Steiner@netapp.com>; Mark Saunders <
> Mark.Saunders@pcmsgroup.com>; toasters@teaparty.net
> *Subject:* Re: NFS issue after upgrading filers to 9.2P2
>
>
>
> The messages are not necessarily indicative of a network problem.
>
>
>
> The kernel prints "nfs: server … not responding, still trying" when an
> operation times out (timeo deciseconds) for the configured (retrans) number
> of tries. Once the server responds, then it prints "nfs: server … OK".
>
>
>
> Networking problems are certainly one reason that an operation would time
> out, but not the only reason. An overloaded or down file server will cause
> the same effect.
>
>
>
> Thanks,
>
> Michael
>
>
>
> *From: *<toasters-bounces@teaparty.net> on behalf of "Steiner, Jeffrey" <
> Jeffrey.Steiner@netapp.com>
> *Date: *Tuesday, January 23, 2018 at 10:38 AM
> *To: *Mark Saunders <Mark.Saunders@pcmsgroup.com>, "toasters@teaparty.net"
> <toasters@teaparty.net>
> *Subject: *RE: NFS issue after upgrading filers to 9.2P2
>
>
>
> Those messages are indicative of a network problem. The packets are going
> through, then they succeed when the NFS client retries, then they pause
> again.
>
>
>
> I can't think why an ONTAP upgrade of this type would cause such a
> problem. If it was working before, it should be working now. If you had any
> kind of a locking, firewall, or general configuration problem you should
> have no access whatsoever.
>
>
>
> I've seen some weird NFS bug sin SUSE, but that RHEL version should be
> fine.
>
>
>
> What are the mount options used, and are you using DNFS?
>
>
>
> *From:* toasters-bounces@teaparty.net [mailto:toasters-bounces@
> teaparty.net <toasters-bounces@teaparty.net>] *On Behalf Of *Mark Saunders
> *Sent:* Tuesday, January 23, 2018 4:29 PM
> *To:* toasters@teaparty.net
> *Subject:* NFS issue after upgrading filers to 9.2P2
>
>
>
> Hi gents today we have upgraded our Coventry cluster from 9.1P6 to 9.2P2
> and we are about 99% working we just have a strange issue with SAP database
> servers NFS mounts. When the server is bounced the mounts are attached with
> no problems but after a few minutes a df –h starts to be very slow
> reporting the NFS mounted directories and if the databases are started up
> they hang and a df –h then also hangs. This sometimes recovers enough to
> then allow a df –h to work again but the databases are a lost cause right
> now.
>
>
>
> In the server messages we get lots of entries for the SVM
>
>
>
> Jan 23 07:01:27 jwukccsbci kernel: nfs: server JWUKCSVM01 not responding,
> still trying
>
> Jan 23 07:01:47 jwukccsbci last message repeated 5 times
>
> Jan 23 07:02:07 jwukccsbci kernel: nfs: server JWUKCSVM01 OK
>
> Jan 23 07:02:07 jwukccsbci last message repeated 5 times
>
> Jan 23 07:02:27 jwukccsbci kernel: nfs: server JWUKCSVM01 not responding,
> still trying
>
> Jan 23 07:02:47 jwukccsbci last message repeated 5 times
>
> Jan 23 07:02:48 jwukccsbci kernel: nfs: server JWUKCSVM01 OK
>
>
>
> Is there anything that would of changed in the upgrade to lock down NFS or
> changes options that we might need to change back.
>
>
>
> The redhat servers are an old kernel version 2.6.18-371.el5 that has some
> bugs but this was working fine before the filer upgrade was carried out.
>
>
>
>
>
> Regards
>
> Mark
>
> Data Centre Sysadmin Team
>
> Managed Services
>
> Phone:- 02476 694455 Ext 2567
>
> The Sysadmin Team promoting PCMS Values ~Integrity~Respect~Commitment~
> ~Continuous Improvement~
>
> The information contained in this e-mail is intended only for the person
> or entity to which it is addressed and may contain confidential and/or
> privileged material. If you are not the intended recipient of this e-mail,
> the use of this information or any disclosure, copying or distribution is
> prohibited and may be unlawful. If you received this in error, please
> contact the sender and delete the material from any computer. The views
> expressed in this e-mail may not necessarily be the views of the PCMS Group
> plc and should not be taken as authority to carry out any instruction
> contained. The PCMS Group reserves the right to monitor and examine the
> content of all e-mails.
>
>
>
> The PCMS Group plc is a company registered in England and Wales with
> company number 1459419 whose registered office is at PCMS House, Torwood
> Close, Westwood Business Park, Coventry CV4 8HX, United Kingdom. VAT No: GB
> 705338743
>
>
>
> _______________________________________________
> Toasters mailing list
> Toasters@teaparty.net
> http://www.teaparty.net/mailman/listinfo/toasters
>
>
RE: NFS issue after upgrading filers to 9.2P2 [ In reply to ]
I faintly remember a customer or two who had issues with their network that were somewhat remediated by fastpath, and when fastpath went away they got bit by the weirdness in their network config.

Also having udp in the mount options doesn't make sense.

Justin - I thought UDP was totally desupported in cDOT, and it's probably risky to use anyway.

When you finish reading your 250 emails on the subject after you wake up, let us know whether this is SAP HANA or SAP on Oracle.

From: Parisi, Justin
Sent: Tuesday, January 23, 2018 11:33 PM
To: Steiner, Jeffrey <Jeffrey.Steiner@netapp.com>; Mark Saunders <Mark.Saunders@pcmsgroup.com>; Fenn, Michael <fennm@DEShawResearch.com>; toasters@teaparty.net
Subject: RE: NFS issue after upgrading filers to 9.2P2

In fact, maybe look at this as a root cause… do your NFS interfaces share nodes with admin interfaces?

“NFS issues were caused by using a NAS interface on the same node as the SVM admin interface, once I realised we moved all servers NFS to the node without the admin interface.”

From: Parisi, Justin
Sent: Tuesday, January 23, 2018 5:30 PM
To: Parisi, Justin <Justin.Parisi@netapp.com<mailto:Justin.Parisi@netapp.com>>; Steiner, Jeffrey <Jeffrey.Steiner@netapp.com<mailto:Jeffrey.Steiner@netapp.com>>; Mark Saunders <Mark.Saunders@pcmsgroup.com<mailto:Mark.Saunders@pcmsgroup.com>>; Fenn, Michael <fennm@DEShawResearch.com<mailto:fennm@DEShawResearch.com>>; toasters@teaparty.net<mailto:toasters@teaparty.net>
Subject: RE: NFS issue after upgrading filers to 9.2P2

This community post also does a good job explaining it:

https://community.netapp.com/t5/Data-ONTAP-Discussions/NetApp-Ontap-9-2-Upgrade-review-your-network-first/td-p/136657

From: toasters-bounces@teaparty.net<mailto:toasters-bounces@teaparty.net> [mailto:toasters-bounces@teaparty.net] On Behalf Of Parisi, Justin
Sent: Tuesday, January 23, 2018 5:28 PM
To: Steiner, Jeffrey <Jeffrey.Steiner@netapp.com<mailto:Jeffrey.Steiner@netapp.com>>; Mark Saunders <Mark.Saunders@pcmsgroup.com<mailto:Mark.Saunders@pcmsgroup.com>>; Fenn, Michael <fennm@DEShawResearch.com<mailto:fennm@DEShawResearch.com>>; toasters@teaparty.net<mailto:toasters@teaparty.net>
Subject: RE: NFS issue after upgrading filers to 9.2P2

The network stack changed in 9.2 and IP fastpath was removed. But fastpath was mainly for more efficient routing.

https://library.netapp.com/ecmdocs/ECMP1114171/html/GUID-8276014A-16EB-4902-9EDC-868C5292381B.html

The stack was changed to a more standard BSD stack, so fastpath was no longer needed. It’s possible that’s an issue here, but I’d suggest getting network sniffs on each endpoint of the network to see where the packet is being dropped.

From: toasters-bounces@teaparty.net<mailto:toasters-bounces@teaparty.net> [mailto:toasters-bounces@teaparty.net] On Behalf Of Steiner, Jeffrey
Sent: Tuesday, January 23, 2018 5:24 PM
To: Mark Saunders <Mark.Saunders@pcmsgroup.com<mailto:Mark.Saunders@pcmsgroup.com>>; Fenn, Michael <fennm@DEShawResearch.com<mailto:fennm@DEShawResearch.com>>; toasters@teaparty.net<mailto:toasters@teaparty.net>
Subject: RE: NFS issue after upgrading filers to 9.2P2

I should have asked - is this SAP HANA or something like SAP on an Oracle database?

Also, what do they mean "it's not on the IMT?" Virtually everything NFS is on the IMT. We support any NFSv3 and NFSv4 client that obeys the specification. There's a tiny number of exceptions, but generally speaking we'll support linux, Solaris, AIX, mainframe, OpenVMS, HP-UX, Oracle DNFS, AS/400, etc. There really should be no issue there.

The thing about fastpath does ring a few bells.

From: Mark Saunders [mailto:Mark.Saunders@pcmsgroup.com]
Sent: Tuesday, January 23, 2018 11:18 PM
To: Steiner, Jeffrey <Jeffrey.Steiner@netapp.com<mailto:Jeffrey.Steiner@netapp.com>>; Fenn, Michael <fennm@DEShawResearch.com<mailto:fennm@DEShawResearch.com>>; toasters@teaparty.net<mailto:toasters@teaparty.net>
Subject: RE: NFS issue after upgrading filers to 9.2P2

Thanks for the quick replies sorry for the delay in e responding but I was working on this since 5am so had to go sleep.

I have a call open with netapp but have had the coockie cutter response of it isn’t on the Interoperability Matrix Tool as a supported version (It wasn’t when on 9.1 anyway)

A third party we have contact with have sent me a link to details about fastpathing being removed but I don’t think we were using it so maybe another false line to look down.

The mount options were kept fairly straight forward as

nfs nolock,_netdev,udp 0 0

and we have also tried the same as the one of the production servers which had tuned options, this is on another cluster so isn’t affected by this yet.

nfsvers=3,nolock,_netdev,rw,udp,rsize=32768,wsize=32768,timeo=600 0 0

How would I be able to tell if we are using DNFS ?

I will send you the support details tomorrow when I am back in the office.

Regards

Mark


From: Steiner, Jeffrey [mailto:Jeffrey.Steiner@netapp.com]
Sent: 23 January 2018 17:29
To: Fenn, Michael; Mark Saunders; toasters@teaparty.net<mailto:toasters@teaparty.net>
Subject: RE: NFS issue after upgrading filers to 9.2P2

It takes a lot for an ONTAP system to flat-out be unable to respond. Unless the timeout parameters are exceedingly short, you shouldn't reach that point, especially with anything capable of running ONTAP 9.2.

I'd open a support case on this one. In addition, if you want to trigger an autosupport and send me the serial numbers directly I can take a glance at a few stats to see if anything looks odd.

From: Fenn, Michael [mailto:fennm@DEShawResearch.com]
Sent: Tuesday, January 23, 2018 6:23 PM
To: Steiner, Jeffrey <Jeffrey.Steiner@netapp.com<mailto:Jeffrey.Steiner@netapp.com>>; Mark Saunders <Mark.Saunders@pcmsgroup.com<mailto:Mark.Saunders@pcmsgroup.com>>; toasters@teaparty.net<mailto:toasters@teaparty.net>
Subject: Re: NFS issue after upgrading filers to 9.2P2

The messages are not necessarily indicative of a network problem.

The kernel prints "nfs: server … not responding, still trying" when an operation times out (timeo deciseconds) for the configured (retrans) number of tries. Once the server responds, then it prints "nfs: server … OK".

Networking problems are certainly one reason that an operation would time out, but not the only reason. An overloaded or down file server will cause the same effect.

Thanks,
Michael

From: <toasters-bounces@teaparty.net<mailto:toasters-bounces@teaparty.net>> on behalf of "Steiner, Jeffrey" <Jeffrey.Steiner@netapp.com<mailto:Jeffrey.Steiner@netapp.com>>
Date: Tuesday, January 23, 2018 at 10:38 AM
To: Mark Saunders <Mark.Saunders@pcmsgroup.com<mailto:Mark.Saunders@pcmsgroup.com>>, "toasters@teaparty.net<mailto:toasters@teaparty.net>" <toasters@teaparty.net<mailto:toasters@teaparty.net>>
Subject: RE: NFS issue after upgrading filers to 9.2P2

Those messages are indicative of a network problem. The packets are going through, then they succeed when the NFS client retries, then they pause again.

I can't think why an ONTAP upgrade of this type would cause such a problem. If it was working before, it should be working now. If you had any kind of a locking, firewall, or general configuration problem you should have no access whatsoever.

I've seen some weird NFS bug sin SUSE, but that RHEL version should be fine.

What are the mount options used, and are you using DNFS?

From: toasters-bounces@teaparty.net<mailto:toasters-bounces@teaparty.net> [mailto:toasters-bounces@teaparty.net] On Behalf Of Mark Saunders
Sent: Tuesday, January 23, 2018 4:29 PM
To: toasters@teaparty.net<mailto:toasters@teaparty.net>
Subject: NFS issue after upgrading filers to 9.2P2

Hi gents today we have upgraded our Coventry cluster from 9.1P6 to 9.2P2 and we are about 99% working we just have a strange issue with SAP database servers NFS mounts. When the server is bounced the mounts are attached with no problems but after a few minutes a df –h starts to be very slow reporting the NFS mounted directories and if the databases are started up they hang and a df –h then also hangs. This sometimes recovers enough to then allow a df –h to work again but the databases are a lost cause right now.

In the server messages we get lots of entries for the SVM

Jan 23 07:01:27 jwukccsbci kernel: nfs: server JWUKCSVM01 not responding, still trying
Jan 23 07:01:47 jwukccsbci last message repeated 5 times
Jan 23 07:02:07 jwukccsbci kernel: nfs: server JWUKCSVM01 OK
Jan 23 07:02:07 jwukccsbci last message repeated 5 times
Jan 23 07:02:27 jwukccsbci kernel: nfs: server JWUKCSVM01 not responding, still trying
Jan 23 07:02:47 jwukccsbci last message repeated 5 times
Jan 23 07:02:48 jwukccsbci kernel: nfs: server JWUKCSVM01 OK

Is there anything that would of changed in the upgrade to lock down NFS or changes options that we might need to change back.

The redhat servers are an old kernel version 2.6.18-371.el5 that has some bugs but this was working fine before the filer upgrade was carried out.


Regards
Mark
Data Centre Sysadmin Team
Managed Services
Phone:- 02476 694455 Ext 2567
The Sysadmin Team promoting PCMS Values ~Integrity~Respect~Commitment~ ~Continuous Improvement~
The information contained in this e-mail is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. If you are not the intended recipient of this e-mail, the use of this information or any disclosure, copying or distribution is prohibited and may be unlawful. If you received this in error, please contact the sender and delete the material from any computer. The views expressed in this e-mail may not necessarily be the views of the PCMS Group plc and should not be taken as authority to carry out any instruction contained. The PCMS Group reserves the right to monitor and examine the content of all e-mails.

The PCMS Group plc is a company registered in England and Wales with company number 1459419 whose registered office is at PCMS House, Torwood Close, Westwood Business Park, Coventry CV4 8HX, United Kingdom. VAT No: GB 705338743
RE: NFS issue after upgrading filers to 9.2P2 [ In reply to ]
Justin

I have just checked the SVM and there are no admin/management interfaces configured for it there are three data lifs for different vlans. I have checked through our other systems this morning and there are no issues in vmware (5.5) or SLES 11/12 so this is just with the redhat servers.

I have checked the interfaces at the server end and it is not showing errors or dropped packets. On the filer end we have 4 physical ports in an interface group with vlans on top. I have run “statistics start –obeject nfs_exports_access_cache” which when checked doesn’t report any errors.

On the server interface

eth1 Link encap:Ethernet HWaddr 00:50:56:A5:0D:6A
inet addr:10.240.1.30 Bcast:10.240.1.31 Mask:255.255.255.224
inet6 addr: fe80::250:56ff:fea5:d6a/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:127209 errors:0 dropped:0 overruns:0 frame:0
TX packets:26100 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:104158360 (99.3 MiB) TX bytes:14489402 (13.8 MiB)



While investigating we have found that the file system is fine just after a reboot and you can ls each mount so they are initially all OK. It is when starting the application so putting a bigger load over the network that the file systems stop responding.


Regards

Mark

From: Parisi, Justin [mailto:Justin.Parisi@netapp.com]
Sent: 23 January 2018 22:33
To: Steiner, Jeffrey; Mark Saunders; Fenn, Michael; toasters@teaparty.net
Subject: RE: NFS issue after upgrading filers to 9.2P2

In fact, maybe look at this as a root cause… do your NFS interfaces share nodes with admin interfaces?

“NFS issues were caused by using a NAS interface on the same node as the SVM admin interface, once I realised we moved all servers NFS to the node without the admin interface.”

From: Parisi, Justin
Sent: Tuesday, January 23, 2018 5:30 PM
To: Parisi, Justin <Justin.Parisi@netapp.com<mailto:Justin.Parisi@netapp.com>>; Steiner, Jeffrey <Jeffrey.Steiner@netapp.com<mailto:Jeffrey.Steiner@netapp.com>>; Mark Saunders <Mark.Saunders@pcmsgroup.com<mailto:Mark.Saunders@pcmsgroup.com>>; Fenn, Michael <fennm@DEShawResearch.com<mailto:fennm@DEShawResearch.com>>; toasters@teaparty.net<mailto:toasters@teaparty.net>
Subject: RE: NFS issue after upgrading filers to 9.2P2

This community post also does a good job explaining it:

https://community.netapp.com/t5/Data-ONTAP-Discussions/NetApp-Ontap-9-2-Upgrade-review-your-network-first/td-p/136657

From: toasters-bounces@teaparty.net<mailto:toasters-bounces@teaparty.net> [mailto:toasters-bounces@teaparty.net] On Behalf Of Parisi, Justin
Sent: Tuesday, January 23, 2018 5:28 PM
To: Steiner, Jeffrey <Jeffrey.Steiner@netapp.com<mailto:Jeffrey.Steiner@netapp.com>>; Mark Saunders <Mark.Saunders@pcmsgroup.com<mailto:Mark.Saunders@pcmsgroup.com>>; Fenn, Michael <fennm@DEShawResearch.com<mailto:fennm@DEShawResearch.com>>; toasters@teaparty.net<mailto:toasters@teaparty.net>
Subject: RE: NFS issue after upgrading filers to 9.2P2

The network stack changed in 9.2 and IP fastpath was removed. But fastpath was mainly for more efficient routing.

https://library.netapp.com/ecmdocs/ECMP1114171/html/GUID-8276014A-16EB-4902-9EDC-868C5292381B.html

The stack was changed to a more standard BSD stack, so fastpath was no longer needed. It’s possible that’s an issue here, but I’d suggest getting network sniffs on each endpoint of the network to see where the packet is being dropped.

From: toasters-bounces@teaparty.net<mailto:toasters-bounces@teaparty.net> [mailto:toasters-bounces@teaparty.net] On Behalf Of Steiner, Jeffrey
Sent: Tuesday, January 23, 2018 5:24 PM
To: Mark Saunders <Mark.Saunders@pcmsgroup.com<mailto:Mark.Saunders@pcmsgroup.com>>; Fenn, Michael <fennm@DEShawResearch.com<mailto:fennm@DEShawResearch.com>>; toasters@teaparty.net<mailto:toasters@teaparty.net>
Subject: RE: NFS issue after upgrading filers to 9.2P2

I should have asked - is this SAP HANA or something like SAP on an Oracle database?

Also, what do they mean "it's not on the IMT?" Virtually everything NFS is on the IMT. We support any NFSv3 and NFSv4 client that obeys the specification. There's a tiny number of exceptions, but generally speaking we'll support linux, Solaris, AIX, mainframe, OpenVMS, HP-UX, Oracle DNFS, AS/400, etc. There really should be no issue there.

The thing about fastpath does ring a few bells.

From: Mark Saunders [mailto:Mark.Saunders@pcmsgroup.com]
Sent: Tuesday, January 23, 2018 11:18 PM
To: Steiner, Jeffrey <Jeffrey.Steiner@netapp.com<mailto:Jeffrey.Steiner@netapp.com>>; Fenn, Michael <fennm@DEShawResearch.com<mailto:fennm@DEShawResearch.com>>; toasters@teaparty.net<mailto:toasters@teaparty.net>
Subject: RE: NFS issue after upgrading filers to 9.2P2

Thanks for the quick replies sorry for the delay in e responding but I was working on this since 5am so had to go sleep.

I have a call open with netapp but have had the coockie cutter response of it isn’t on the Interoperability Matrix Tool as a supported version (It wasn’t when on 9.1 anyway)

A third party we have contact with have sent me a link to details about fastpathing being removed but I don’t think we were using it so maybe another false line to look down.

The mount options were kept fairly straight forward as

nfs nolock,_netdev,udp 0 0

and we have also tried the same as the one of the production servers which had tuned options, this is on another cluster so isn’t affected by this yet.

nfsvers=3,nolock,_netdev,rw,udp,rsize=32768,wsize=32768,timeo=600 0 0

How would I be able to tell if we are using DNFS ?

I will send you the support details tomorrow when I am back in the office.

Regards

Mark


From: Steiner, Jeffrey [mailto:Jeffrey.Steiner@netapp.com]
Sent: 23 January 2018 17:29
To: Fenn, Michael; Mark Saunders; toasters@teaparty.net<mailto:toasters@teaparty.net>
Subject: RE: NFS issue after upgrading filers to 9.2P2

It takes a lot for an ONTAP system to flat-out be unable to respond. Unless the timeout parameters are exceedingly short, you shouldn't reach that point, especially with anything capable of running ONTAP 9.2.

I'd open a support case on this one. In addition, if you want to trigger an autosupport and send me the serial numbers directly I can take a glance at a few stats to see if anything looks odd.

From: Fenn, Michael [mailto:fennm@DEShawResearch.com]
Sent: Tuesday, January 23, 2018 6:23 PM
To: Steiner, Jeffrey <Jeffrey.Steiner@netapp.com<mailto:Jeffrey.Steiner@netapp.com>>; Mark Saunders <Mark.Saunders@pcmsgroup.com<mailto:Mark.Saunders@pcmsgroup.com>>; toasters@teaparty.net<mailto:toasters@teaparty.net>
Subject: Re: NFS issue after upgrading filers to 9.2P2

The messages are not necessarily indicative of a network problem.

The kernel prints "nfs: server … not responding, still trying" when an operation times out (timeo deciseconds) for the configured (retrans) number of tries. Once the server responds, then it prints "nfs: server … OK".

Networking problems are certainly one reason that an operation would time out, but not the only reason. An overloaded or down file server will cause the same effect.

Thanks,
Michael

From: <toasters-bounces@teaparty.net<mailto:toasters-bounces@teaparty.net>> on behalf of "Steiner, Jeffrey" <Jeffrey.Steiner@netapp.com<mailto:Jeffrey.Steiner@netapp.com>>
Date: Tuesday, January 23, 2018 at 10:38 AM
To: Mark Saunders <Mark.Saunders@pcmsgroup.com<mailto:Mark.Saunders@pcmsgroup.com>>, "toasters@teaparty.net<mailto:toasters@teaparty.net>" <toasters@teaparty.net<mailto:toasters@teaparty.net>>
Subject: RE: NFS issue after upgrading filers to 9.2P2

Those messages are indicative of a network problem. The packets are going through, then they succeed when the NFS client retries, then they pause again.

I can't think why an ONTAP upgrade of this type would cause such a problem. If it was working before, it should be working now. If you had any kind of a locking, firewall, or general configuration problem you should have no access whatsoever.

I've seen some weird NFS bug sin SUSE, but that RHEL version should be fine.

What are the mount options used, and are you using DNFS?

From: toasters-bounces@teaparty.net<mailto:toasters-bounces@teaparty.net> [mailto:toasters-bounces@teaparty.net] On Behalf Of Mark Saunders
Sent: Tuesday, January 23, 2018 4:29 PM
To: toasters@teaparty.net<mailto:toasters@teaparty.net>
Subject: NFS issue after upgrading filers to 9.2P2

Hi gents today we have upgraded our Coventry cluster from 9.1P6 to 9.2P2 and we are about 99% working we just have a strange issue with SAP database servers NFS mounts. When the server is bounced the mounts are attached with no problems but after a few minutes a df –h starts to be very slow reporting the NFS mounted directories and if the databases are started up they hang and a df –h then also hangs. This sometimes recovers enough to then allow a df –h to work again but the databases are a lost cause right now.

In the server messages we get lots of entries for the SVM

Jan 23 07:01:27 jwukccsbci kernel: nfs: server JWUKCSVM01 not responding, still trying
Jan 23 07:01:47 jwukccsbci last message repeated 5 times
Jan 23 07:02:07 jwukccsbci kernel: nfs: server JWUKCSVM01 OK
Jan 23 07:02:07 jwukccsbci last message repeated 5 times
Jan 23 07:02:27 jwukccsbci kernel: nfs: server JWUKCSVM01 not responding, still trying
Jan 23 07:02:47 jwukccsbci last message repeated 5 times
Jan 23 07:02:48 jwukccsbci kernel: nfs: server JWUKCSVM01 OK

Is there anything that would of changed in the upgrade to lock down NFS or changes options that we might need to change back.

The redhat servers are an old kernel version 2.6.18-371.el5 that has some bugs but this was working fine before the filer upgrade was carried out.


Regards
Mark
Data Centre Sysadmin Team
Managed Services
Phone:- 02476 694455 Ext 2567
The Sysadmin Team promoting PCMS Values ~Integrity~Respect~Commitment~ ~Continuous Improvement~
The information contained in this e-mail is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. If you are not the intended recipient of this e-mail, the use of this information or any disclosure, copying or distribution is prohibited and may be unlawful. If you received this in error, please contact the sender and delete the material from any computer. The views expressed in this e-mail may not necessarily be the views of the PCMS Group plc and should not be taken as authority to carry out any instruction contained. The PCMS Group reserves the right to monitor and examine the content of all e-mails.

The PCMS Group plc is a company registered in England and Wales with company number 1459419 whose registered office is at PCMS House, Torwood Close, Westwood Business Park, Coventry CV4 8HX, United Kingdom. VAT No: GB 705338743
RE: NFS issue after upgrading filers to 9.2P2 [ In reply to ]
The is SAP on Oracle

I have found that on our production servers there is a redhat kernel bug so a network restart has been put into the boot sequence we are going to replicate that on one of the servers that is having issues.

We were using udp for the mount options as it was giving better performance than tcp we have put in the things to test today changing it back to tcp.

Regards

Mark

From: Steiner, Jeffrey [mailto:Jeffrey.Steiner@netapp.com]
Sent: 23 January 2018 22:37
To: Parisi, Justin; Mark Saunders; Fenn, Michael; toasters@teaparty.net
Subject: RE: NFS issue after upgrading filers to 9.2P2

I faintly remember a customer or two who had issues with their network that were somewhat remediated by fastpath, and when fastpath went away they got bit by the weirdness in their network config.

Also having udp in the mount options doesn't make sense.

Justin - I thought UDP was totally desupported in cDOT, and it's probably risky to use anyway.

When you finish reading your 250 emails on the subject after you wake up, let us know whether this is SAP HANA or SAP on Oracle.

From: Parisi, Justin
Sent: Tuesday, January 23, 2018 11:33 PM
To: Steiner, Jeffrey <Jeffrey.Steiner@netapp.com<mailto:Jeffrey.Steiner@netapp.com>>; Mark Saunders <Mark.Saunders@pcmsgroup.com<mailto:Mark.Saunders@pcmsgroup.com>>; Fenn, Michael <fennm@DEShawResearch.com<mailto:fennm@DEShawResearch.com>>; toasters@teaparty.net<mailto:toasters@teaparty.net>
Subject: RE: NFS issue after upgrading filers to 9.2P2

In fact, maybe look at this as a root cause… do your NFS interfaces share nodes with admin interfaces?

“NFS issues were caused by using a NAS interface on the same node as the SVM admin interface, once I realised we moved all servers NFS to the node without the admin interface.”

From: Parisi, Justin
Sent: Tuesday, January 23, 2018 5:30 PM
To: Parisi, Justin <Justin.Parisi@netapp.com<mailto:Justin.Parisi@netapp.com>>; Steiner, Jeffrey <Jeffrey.Steiner@netapp.com<mailto:Jeffrey.Steiner@netapp.com>>; Mark Saunders <Mark.Saunders@pcmsgroup.com<mailto:Mark.Saunders@pcmsgroup.com>>; Fenn, Michael <fennm@DEShawResearch.com<mailto:fennm@DEShawResearch.com>>; toasters@teaparty.net<mailto:toasters@teaparty.net>
Subject: RE: NFS issue after upgrading filers to 9.2P2

This community post also does a good job explaining it:

https://community.netapp.com/t5/Data-ONTAP-Discussions/NetApp-Ontap-9-2-Upgrade-review-your-network-first/td-p/136657

From: toasters-bounces@teaparty.net<mailto:toasters-bounces@teaparty.net> [mailto:toasters-bounces@teaparty.net] On Behalf Of Parisi, Justin
Sent: Tuesday, January 23, 2018 5:28 PM
To: Steiner, Jeffrey <Jeffrey.Steiner@netapp.com<mailto:Jeffrey.Steiner@netapp.com>>; Mark Saunders <Mark.Saunders@pcmsgroup.com<mailto:Mark.Saunders@pcmsgroup.com>>; Fenn, Michael <fennm@DEShawResearch.com<mailto:fennm@DEShawResearch.com>>; toasters@teaparty.net<mailto:toasters@teaparty.net>
Subject: RE: NFS issue after upgrading filers to 9.2P2

The network stack changed in 9.2 and IP fastpath was removed. But fastpath was mainly for more efficient routing.

https://library.netapp.com/ecmdocs/ECMP1114171/html/GUID-8276014A-16EB-4902-9EDC-868C5292381B.html

The stack was changed to a more standard BSD stack, so fastpath was no longer needed. It’s possible that’s an issue here, but I’d suggest getting network sniffs on each endpoint of the network to see where the packet is being dropped.

From: toasters-bounces@teaparty.net<mailto:toasters-bounces@teaparty.net> [mailto:toasters-bounces@teaparty.net] On Behalf Of Steiner, Jeffrey
Sent: Tuesday, January 23, 2018 5:24 PM
To: Mark Saunders <Mark.Saunders@pcmsgroup.com<mailto:Mark.Saunders@pcmsgroup.com>>; Fenn, Michael <fennm@DEShawResearch.com<mailto:fennm@DEShawResearch.com>>; toasters@teaparty.net<mailto:toasters@teaparty.net>
Subject: RE: NFS issue after upgrading filers to 9.2P2

I should have asked - is this SAP HANA or something like SAP on an Oracle database?

Also, what do they mean "it's not on the IMT?" Virtually everything NFS is on the IMT. We support any NFSv3 and NFSv4 client that obeys the specification. There's a tiny number of exceptions, but generally speaking we'll support linux, Solaris, AIX, mainframe, OpenVMS, HP-UX, Oracle DNFS, AS/400, etc. There really should be no issue there.

The thing about fastpath does ring a few bells.

From: Mark Saunders [mailto:Mark.Saunders@pcmsgroup.com]
Sent: Tuesday, January 23, 2018 11:18 PM
To: Steiner, Jeffrey <Jeffrey.Steiner@netapp.com<mailto:Jeffrey.Steiner@netapp.com>>; Fenn, Michael <fennm@DEShawResearch.com<mailto:fennm@DEShawResearch.com>>; toasters@teaparty.net<mailto:toasters@teaparty.net>
Subject: RE: NFS issue after upgrading filers to 9.2P2

Thanks for the quick replies sorry for the delay in e responding but I was working on this since 5am so had to go sleep.

I have a call open with netapp but have had the coockie cutter response of it isn’t on the Interoperability Matrix Tool as a supported version (It wasn’t when on 9.1 anyway)

A third party we have contact with have sent me a link to details about fastpathing being removed but I don’t think we were using it so maybe another false line to look down.

The mount options were kept fairly straight forward as

nfs nolock,_netdev,udp 0 0

and we have also tried the same as the one of the production servers which had tuned options, this is on another cluster so isn’t affected by this yet.

nfsvers=3,nolock,_netdev,rw,udp,rsize=32768,wsize=32768,timeo=600 0 0

How would I be able to tell if we are using DNFS ?

I will send you the support details tomorrow when I am back in the office.

Regards

Mark


From: Steiner, Jeffrey [mailto:Jeffrey.Steiner@netapp.com]
Sent: 23 January 2018 17:29
To: Fenn, Michael; Mark Saunders; toasters@teaparty.net<mailto:toasters@teaparty.net>
Subject: RE: NFS issue after upgrading filers to 9.2P2

It takes a lot for an ONTAP system to flat-out be unable to respond. Unless the timeout parameters are exceedingly short, you shouldn't reach that point, especially with anything capable of running ONTAP 9.2.

I'd open a support case on this one. In addition, if you want to trigger an autosupport and send me the serial numbers directly I can take a glance at a few stats to see if anything looks odd.

From: Fenn, Michael [mailto:fennm@DEShawResearch.com]
Sent: Tuesday, January 23, 2018 6:23 PM
To: Steiner, Jeffrey <Jeffrey.Steiner@netapp.com<mailto:Jeffrey.Steiner@netapp.com>>; Mark Saunders <Mark.Saunders@pcmsgroup.com<mailto:Mark.Saunders@pcmsgroup.com>>; toasters@teaparty.net<mailto:toasters@teaparty.net>
Subject: Re: NFS issue after upgrading filers to 9.2P2

The messages are not necessarily indicative of a network problem.

The kernel prints "nfs: server … not responding, still trying" when an operation times out (timeo deciseconds) for the configured (retrans) number of tries. Once the server responds, then it prints "nfs: server … OK".

Networking problems are certainly one reason that an operation would time out, but not the only reason. An overloaded or down file server will cause the same effect.

Thanks,
Michael

From: <toasters-bounces@teaparty.net<mailto:toasters-bounces@teaparty.net>> on behalf of "Steiner, Jeffrey" <Jeffrey.Steiner@netapp.com<mailto:Jeffrey.Steiner@netapp.com>>
Date: Tuesday, January 23, 2018 at 10:38 AM
To: Mark Saunders <Mark.Saunders@pcmsgroup.com<mailto:Mark.Saunders@pcmsgroup.com>>, "toasters@teaparty.net<mailto:toasters@teaparty.net>" <toasters@teaparty.net<mailto:toasters@teaparty.net>>
Subject: RE: NFS issue after upgrading filers to 9.2P2

Those messages are indicative of a network problem. The packets are going through, then they succeed when the NFS client retries, then they pause again.

I can't think why an ONTAP upgrade of this type would cause such a problem. If it was working before, it should be working now. If you had any kind of a locking, firewall, or general configuration problem you should have no access whatsoever.

I've seen some weird NFS bug sin SUSE, but that RHEL version should be fine.

What are the mount options used, and are you using DNFS?

From: toasters-bounces@teaparty.net<mailto:toasters-bounces@teaparty.net> [mailto:toasters-bounces@teaparty.net] On Behalf Of Mark Saunders
Sent: Tuesday, January 23, 2018 4:29 PM
To: toasters@teaparty.net<mailto:toasters@teaparty.net>
Subject: NFS issue after upgrading filers to 9.2P2

Hi gents today we have upgraded our Coventry cluster from 9.1P6 to 9.2P2 and we are about 99% working we just have a strange issue with SAP database servers NFS mounts. When the server is bounced the mounts are attached with no problems but after a few minutes a df –h starts to be very slow reporting the NFS mounted directories and if the databases are started up they hang and a df –h then also hangs. This sometimes recovers enough to then allow a df –h to work again but the databases are a lost cause right now.

In the server messages we get lots of entries for the SVM

Jan 23 07:01:27 jwukccsbci kernel: nfs: server JWUKCSVM01 not responding, still trying
Jan 23 07:01:47 jwukccsbci last message repeated 5 times
Jan 23 07:02:07 jwukccsbci kernel: nfs: server JWUKCSVM01 OK
Jan 23 07:02:07 jwukccsbci last message repeated 5 times
Jan 23 07:02:27 jwukccsbci kernel: nfs: server JWUKCSVM01 not responding, still trying
Jan 23 07:02:47 jwukccsbci last message repeated 5 times
Jan 23 07:02:48 jwukccsbci kernel: nfs: server JWUKCSVM01 OK

Is there anything that would of changed in the upgrade to lock down NFS or changes options that we might need to change back.

The redhat servers are an old kernel version 2.6.18-371.el5 that has some bugs but this was working fine before the filer upgrade was carried out.


Regards
Mark
Data Centre Sysadmin Team
Managed Services
Phone:- 02476 694455 Ext 2567
The Sysadmin Team promoting PCMS Values ~Integrity~Respect~Commitment~ ~Continuous Improvement~
The information contained in this e-mail is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. If you are not the intended recipient of this e-mail, the use of this information or any disclosure, copying or distribution is prohibited and may be unlawful. If you received this in error, please contact the sender and delete the material from any computer. The views expressed in this e-mail may not necessarily be the views of the PCMS Group plc and should not be taken as authority to carry out any instruction contained. The PCMS Group reserves the right to monitor and examine the content of all e-mails.

The PCMS Group plc is a company registered in England and Wales with company number 1459419 whose registered office is at PCMS House, Torwood Close, Westwood Business Park, Coventry CV4 8HX, United Kingdom. VAT No: GB 705338743
RE: NFS issue after upgrading filers to 9.2P2 [ In reply to ]
What's the bug number?

I can't find an ASUP in the system, but if the problem persists you can run "node run local netstat -sp tcp" and send output. That might indicate whether flow control is happening.

From: Mark Saunders [mailto:Mark.Saunders@pcmsgroup.com]
Sent: Wednesday, January 24, 2018 12:48 PM
To: Steiner, Jeffrey <Jeffrey.Steiner@netapp.com>; Parisi, Justin <Justin.Parisi@netapp.com>; Fenn, Michael <fennm@DEShawResearch.com>; toasters@teaparty.net
Subject: RE: NFS issue after upgrading filers to 9.2P2

The is SAP on Oracle

I have found that on our production servers there is a redhat kernel bug so a network restart has been put into the boot sequence we are going to replicate that on one of the servers that is having issues.

We were using udp for the mount options as it was giving better performance than tcp we have put in the things to test today changing it back to tcp.

Regards

Mark

From: Steiner, Jeffrey [mailto:Jeffrey.Steiner@netapp.com]
Sent: 23 January 2018 22:37
To: Parisi, Justin; Mark Saunders; Fenn, Michael; toasters@teaparty.net<mailto:toasters@teaparty.net>
Subject: RE: NFS issue after upgrading filers to 9.2P2

I faintly remember a customer or two who had issues with their network that were somewhat remediated by fastpath, and when fastpath went away they got bit by the weirdness in their network config.

Also having udp in the mount options doesn't make sense.

Justin - I thought UDP was totally desupported in cDOT, and it's probably risky to use anyway.

When you finish reading your 250 emails on the subject after you wake up, let us know whether this is SAP HANA or SAP on Oracle.

From: Parisi, Justin
Sent: Tuesday, January 23, 2018 11:33 PM
To: Steiner, Jeffrey <Jeffrey.Steiner@netapp.com<mailto:Jeffrey.Steiner@netapp.com>>; Mark Saunders <Mark.Saunders@pcmsgroup.com<mailto:Mark.Saunders@pcmsgroup.com>>; Fenn, Michael <fennm@DEShawResearch.com<mailto:fennm@DEShawResearch.com>>; toasters@teaparty.net<mailto:toasters@teaparty.net>
Subject: RE: NFS issue after upgrading filers to 9.2P2

In fact, maybe look at this as a root cause… do your NFS interfaces share nodes with admin interfaces?

“NFS issues were caused by using a NAS interface on the same node as the SVM admin interface, once I realised we moved all servers NFS to the node without the admin interface.”

From: Parisi, Justin
Sent: Tuesday, January 23, 2018 5:30 PM
To: Parisi, Justin <Justin.Parisi@netapp.com<mailto:Justin.Parisi@netapp.com>>; Steiner, Jeffrey <Jeffrey.Steiner@netapp.com<mailto:Jeffrey.Steiner@netapp.com>>; Mark Saunders <Mark.Saunders@pcmsgroup.com<mailto:Mark.Saunders@pcmsgroup.com>>; Fenn, Michael <fennm@DEShawResearch.com<mailto:fennm@DEShawResearch.com>>; toasters@teaparty.net<mailto:toasters@teaparty.net>
Subject: RE: NFS issue after upgrading filers to 9.2P2

This community post also does a good job explaining it:

https://community.netapp.com/t5/Data-ONTAP-Discussions/NetApp-Ontap-9-2-Upgrade-review-your-network-first/td-p/136657

From: toasters-bounces@teaparty.net<mailto:toasters-bounces@teaparty.net> [mailto:toasters-bounces@teaparty.net] On Behalf Of Parisi, Justin
Sent: Tuesday, January 23, 2018 5:28 PM
To: Steiner, Jeffrey <Jeffrey.Steiner@netapp.com<mailto:Jeffrey.Steiner@netapp.com>>; Mark Saunders <Mark.Saunders@pcmsgroup.com<mailto:Mark.Saunders@pcmsgroup.com>>; Fenn, Michael <fennm@DEShawResearch.com<mailto:fennm@DEShawResearch.com>>; toasters@teaparty.net<mailto:toasters@teaparty.net>
Subject: RE: NFS issue after upgrading filers to 9.2P2

The network stack changed in 9.2 and IP fastpath was removed. But fastpath was mainly for more efficient routing.

https://library.netapp.com/ecmdocs/ECMP1114171/html/GUID-8276014A-16EB-4902-9EDC-868C5292381B.html

The stack was changed to a more standard BSD stack, so fastpath was no longer needed. It’s possible that’s an issue here, but I’d suggest getting network sniffs on each endpoint of the network to see where the packet is being dropped.

From: toasters-bounces@teaparty.net<mailto:toasters-bounces@teaparty.net> [mailto:toasters-bounces@teaparty.net] On Behalf Of Steiner, Jeffrey
Sent: Tuesday, January 23, 2018 5:24 PM
To: Mark Saunders <Mark.Saunders@pcmsgroup.com<mailto:Mark.Saunders@pcmsgroup.com>>; Fenn, Michael <fennm@DEShawResearch.com<mailto:fennm@DEShawResearch.com>>; toasters@teaparty.net<mailto:toasters@teaparty.net>
Subject: RE: NFS issue after upgrading filers to 9.2P2

I should have asked - is this SAP HANA or something like SAP on an Oracle database?

Also, what do they mean "it's not on the IMT?" Virtually everything NFS is on the IMT. We support any NFSv3 and NFSv4 client that obeys the specification. There's a tiny number of exceptions, but generally speaking we'll support linux, Solaris, AIX, mainframe, OpenVMS, HP-UX, Oracle DNFS, AS/400, etc. There really should be no issue there.

The thing about fastpath does ring a few bells.

From: Mark Saunders [mailto:Mark.Saunders@pcmsgroup.com]
Sent: Tuesday, January 23, 2018 11:18 PM
To: Steiner, Jeffrey <Jeffrey.Steiner@netapp.com<mailto:Jeffrey.Steiner@netapp.com>>; Fenn, Michael <fennm@DEShawResearch.com<mailto:fennm@DEShawResearch.com>>; toasters@teaparty.net<mailto:toasters@teaparty.net>
Subject: RE: NFS issue after upgrading filers to 9.2P2

Thanks for the quick replies sorry for the delay in e responding but I was working on this since 5am so had to go sleep.

I have a call open with netapp but have had the coockie cutter response of it isn’t on the Interoperability Matrix Tool as a supported version (It wasn’t when on 9.1 anyway)

A third party we have contact with have sent me a link to details about fastpathing being removed but I don’t think we were using it so maybe another false line to look down.

The mount options were kept fairly straight forward as

nfs nolock,_netdev,udp 0 0

and we have also tried the same as the one of the production servers which had tuned options, this is on another cluster so isn’t affected by this yet.

nfsvers=3,nolock,_netdev,rw,udp,rsize=32768,wsize=32768,timeo=600 0 0

How would I be able to tell if we are using DNFS ?

I will send you the support details tomorrow when I am back in the office.

Regards

Mark


From: Steiner, Jeffrey [mailto:Jeffrey.Steiner@netapp.com]
Sent: 23 January 2018 17:29
To: Fenn, Michael; Mark Saunders; toasters@teaparty.net<mailto:toasters@teaparty.net>
Subject: RE: NFS issue after upgrading filers to 9.2P2

It takes a lot for an ONTAP system to flat-out be unable to respond. Unless the timeout parameters are exceedingly short, you shouldn't reach that point, especially with anything capable of running ONTAP 9.2.

I'd open a support case on this one. In addition, if you want to trigger an autosupport and send me the serial numbers directly I can take a glance at a few stats to see if anything looks odd.

From: Fenn, Michael [mailto:fennm@DEShawResearch.com]
Sent: Tuesday, January 23, 2018 6:23 PM
To: Steiner, Jeffrey <Jeffrey.Steiner@netapp.com<mailto:Jeffrey.Steiner@netapp.com>>; Mark Saunders <Mark.Saunders@pcmsgroup.com<mailto:Mark.Saunders@pcmsgroup.com>>; toasters@teaparty.net<mailto:toasters@teaparty.net>
Subject: Re: NFS issue after upgrading filers to 9.2P2

The messages are not necessarily indicative of a network problem.

The kernel prints "nfs: server … not responding, still trying" when an operation times out (timeo deciseconds) for the configured (retrans) number of tries. Once the server responds, then it prints "nfs: server … OK".

Networking problems are certainly one reason that an operation would time out, but not the only reason. An overloaded or down file server will cause the same effect.

Thanks,
Michael

From: <toasters-bounces@teaparty.net<mailto:toasters-bounces@teaparty.net>> on behalf of "Steiner, Jeffrey" <Jeffrey.Steiner@netapp.com<mailto:Jeffrey.Steiner@netapp.com>>
Date: Tuesday, January 23, 2018 at 10:38 AM
To: Mark Saunders <Mark.Saunders@pcmsgroup.com<mailto:Mark.Saunders@pcmsgroup.com>>, "toasters@teaparty.net<mailto:toasters@teaparty.net>" <toasters@teaparty.net<mailto:toasters@teaparty.net>>
Subject: RE: NFS issue after upgrading filers to 9.2P2

Those messages are indicative of a network problem. The packets are going through, then they succeed when the NFS client retries, then they pause again.

I can't think why an ONTAP upgrade of this type would cause such a problem. If it was working before, it should be working now. If you had any kind of a locking, firewall, or general configuration problem you should have no access whatsoever.

I've seen some weird NFS bug sin SUSE, but that RHEL version should be fine.

What are the mount options used, and are you using DNFS?

From: toasters-bounces@teaparty.net<mailto:toasters-bounces@teaparty.net> [mailto:toasters-bounces@teaparty.net] On Behalf Of Mark Saunders
Sent: Tuesday, January 23, 2018 4:29 PM
To: toasters@teaparty.net<mailto:toasters@teaparty.net>
Subject: NFS issue after upgrading filers to 9.2P2

Hi gents today we have upgraded our Coventry cluster from 9.1P6 to 9.2P2 and we are about 99% working we just have a strange issue with SAP database servers NFS mounts. When the server is bounced the mounts are attached with no problems but after a few minutes a df –h starts to be very slow reporting the NFS mounted directories and if the databases are started up they hang and a df –h then also hangs. This sometimes recovers enough to then allow a df –h to work again but the databases are a lost cause right now.

In the server messages we get lots of entries for the SVM

Jan 23 07:01:27 jwukccsbci kernel: nfs: server JWUKCSVM01 not responding, still trying
Jan 23 07:01:47 jwukccsbci last message repeated 5 times
Jan 23 07:02:07 jwukccsbci kernel: nfs: server JWUKCSVM01 OK
Jan 23 07:02:07 jwukccsbci last message repeated 5 times
Jan 23 07:02:27 jwukccsbci kernel: nfs: server JWUKCSVM01 not responding, still trying
Jan 23 07:02:47 jwukccsbci last message repeated 5 times
Jan 23 07:02:48 jwukccsbci kernel: nfs: server JWUKCSVM01 OK

Is there anything that would of changed in the upgrade to lock down NFS or changes options that we might need to change back.

The redhat servers are an old kernel version 2.6.18-371.el5 that has some bugs but this was working fine before the filer upgrade was carried out.


Regards
Mark
Data Centre Sysadmin Team
Managed Services
Phone:- 02476 694455 Ext 2567
The Sysadmin Team promoting PCMS Values ~Integrity~Respect~Commitment~ ~Continuous Improvement~
The information contained in this e-mail is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. If you are not the intended recipient of this e-mail, the use of this information or any disclosure, copying or distribution is prohibited and may be unlawful. If you received this in error, please contact the sender and delete the material from any computer. The views expressed in this e-mail may not necessarily be the views of the PCMS Group plc and should not be taken as authority to carry out any instruction contained. The PCMS Group reserves the right to monitor and examine the content of all e-mails.

The PCMS Group plc is a company registered in England and Wales with company number 1459419 whose registered office is at PCMS House, Torwood Close, Westwood Business Park, Coventry CV4 8HX, United Kingdom. VAT No: GB 705338743
RE: NFS issue after upgrading filers to 9.2P2 [ In reply to ]
Could this be TCP slot tables? Flow control capabilities on ONTAP continue to improve. If you don't have TCP slot tables capped at 128 you could see quasi-hangs like this.

Complete details are in TR-3633, but these are the two that you want to watch:

[root@stlrx300s7-145 mkdb]# sysctl -a | grep slot
sunrpc.tcp_max_slot_table_entries = 128
sunrpc.tcp_slot_table_entries = 128

Newer versions of linux will allow a ridiculous number of unacknowledged RPC operations to build up. The result can be sending ONTAP into a flow control mode until the OS catches up. We see problems mostly in slow clients. For example, if you're trying to read a lot of data from a host with 1Gb connectivity on a high-end ONTAP system the OS can ask for data quicker than it can process the responses.

From: Mark Saunders [mailto:Mark.Saunders@pcmsgroup.com]
Sent: Wednesday, January 24, 2018 12:26 PM
To: Parisi, Justin <Justin.Parisi@netapp.com>; Steiner, Jeffrey <Jeffrey.Steiner@netapp.com>; Fenn, Michael <fennm@DEShawResearch.com>; toasters@teaparty.net
Subject: RE: NFS issue after upgrading filers to 9.2P2

Justin

I have just checked the SVM and there are no admin/management interfaces configured for it there are three data lifs for different vlans. I have checked through our other systems this morning and there are no issues in vmware (5.5) or SLES 11/12 so this is just with the redhat servers.

I have checked the interfaces at the server end and it is not showing errors or dropped packets. On the filer end we have 4 physical ports in an interface group with vlans on top. I have run “statistics start –obeject nfs_exports_access_cache” which when checked doesn’t report any errors.

On the server interface

eth1 Link encap:Ethernet HWaddr 00:50:56:A5:0D:6A
inet addr:10.240.1.30 Bcast:10.240.1.31 Mask:255.255.255.224
inet6 addr: fe80::250:56ff:fea5:d6a/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:127209 errors:0 dropped:0 overruns:0 frame:0
TX packets:26100 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:104158360 (99.3 MiB) TX bytes:14489402 (13.8 MiB)



While investigating we have found that the file system is fine just after a reboot and you can ls each mount so they are initially all OK. It is when starting the application so putting a bigger load over the network that the file systems stop responding.


Regards

Mark

From: Parisi, Justin [mailto:Justin.Parisi@netapp.com]
Sent: 23 January 2018 22:33
To: Steiner, Jeffrey; Mark Saunders; Fenn, Michael; toasters@teaparty.net<mailto:toasters@teaparty.net>
Subject: RE: NFS issue after upgrading filers to 9.2P2

In fact, maybe look at this as a root cause… do your NFS interfaces share nodes with admin interfaces?

“NFS issues were caused by using a NAS interface on the same node as the SVM admin interface, once I realised we moved all servers NFS to the node without the admin interface.”

From: Parisi, Justin
Sent: Tuesday, January 23, 2018 5:30 PM
To: Parisi, Justin <Justin.Parisi@netapp.com<mailto:Justin.Parisi@netapp.com>>; Steiner, Jeffrey <Jeffrey.Steiner@netapp.com<mailto:Jeffrey.Steiner@netapp.com>>; Mark Saunders <Mark.Saunders@pcmsgroup.com<mailto:Mark.Saunders@pcmsgroup.com>>; Fenn, Michael <fennm@DEShawResearch.com<mailto:fennm@DEShawResearch.com>>; toasters@teaparty.net<mailto:toasters@teaparty.net>
Subject: RE: NFS issue after upgrading filers to 9.2P2

This community post also does a good job explaining it:

https://community.netapp.com/t5/Data-ONTAP-Discussions/NetApp-Ontap-9-2-Upgrade-review-your-network-first/td-p/136657

From: toasters-bounces@teaparty.net<mailto:toasters-bounces@teaparty.net> [mailto:toasters-bounces@teaparty.net] On Behalf Of Parisi, Justin
Sent: Tuesday, January 23, 2018 5:28 PM
To: Steiner, Jeffrey <Jeffrey.Steiner@netapp.com<mailto:Jeffrey.Steiner@netapp.com>>; Mark Saunders <Mark.Saunders@pcmsgroup.com<mailto:Mark.Saunders@pcmsgroup.com>>; Fenn, Michael <fennm@DEShawResearch.com<mailto:fennm@DEShawResearch.com>>; toasters@teaparty.net<mailto:toasters@teaparty.net>
Subject: RE: NFS issue after upgrading filers to 9.2P2

The network stack changed in 9.2 and IP fastpath was removed. But fastpath was mainly for more efficient routing.

https://library.netapp.com/ecmdocs/ECMP1114171/html/GUID-8276014A-16EB-4902-9EDC-868C5292381B.html

The stack was changed to a more standard BSD stack, so fastpath was no longer needed. It’s possible that’s an issue here, but I’d suggest getting network sniffs on each endpoint of the network to see where the packet is being dropped.

From: toasters-bounces@teaparty.net<mailto:toasters-bounces@teaparty.net> [mailto:toasters-bounces@teaparty.net] On Behalf Of Steiner, Jeffrey
Sent: Tuesday, January 23, 2018 5:24 PM
To: Mark Saunders <Mark.Saunders@pcmsgroup.com<mailto:Mark.Saunders@pcmsgroup.com>>; Fenn, Michael <fennm@DEShawResearch.com<mailto:fennm@DEShawResearch.com>>; toasters@teaparty.net<mailto:toasters@teaparty.net>
Subject: RE: NFS issue after upgrading filers to 9.2P2

I should have asked - is this SAP HANA or something like SAP on an Oracle database?

Also, what do they mean "it's not on the IMT?" Virtually everything NFS is on the IMT. We support any NFSv3 and NFSv4 client that obeys the specification. There's a tiny number of exceptions, but generally speaking we'll support linux, Solaris, AIX, mainframe, OpenVMS, HP-UX, Oracle DNFS, AS/400, etc. There really should be no issue there.

The thing about fastpath does ring a few bells.

From: Mark Saunders [mailto:Mark.Saunders@pcmsgroup.com]
Sent: Tuesday, January 23, 2018 11:18 PM
To: Steiner, Jeffrey <Jeffrey.Steiner@netapp.com<mailto:Jeffrey.Steiner@netapp.com>>; Fenn, Michael <fennm@DEShawResearch.com<mailto:fennm@DEShawResearch.com>>; toasters@teaparty.net<mailto:toasters@teaparty.net>
Subject: RE: NFS issue after upgrading filers to 9.2P2

Thanks for the quick replies sorry for the delay in e responding but I was working on this since 5am so had to go sleep.

I have a call open with netapp but have had the coockie cutter response of it isn’t on the Interoperability Matrix Tool as a supported version (It wasn’t when on 9.1 anyway)

A third party we have contact with have sent me a link to details about fastpathing being removed but I don’t think we were using it so maybe another false line to look down.

The mount options were kept fairly straight forward as

nfs nolock,_netdev,udp 0 0

and we have also tried the same as the one of the production servers which had tuned options, this is on another cluster so isn’t affected by this yet.

nfsvers=3,nolock,_netdev,rw,udp,rsize=32768,wsize=32768,timeo=600 0 0

How would I be able to tell if we are using DNFS ?

I will send you the support details tomorrow when I am back in the office.

Regards

Mark


From: Steiner, Jeffrey [mailto:Jeffrey.Steiner@netapp.com]
Sent: 23 January 2018 17:29
To: Fenn, Michael; Mark Saunders; toasters@teaparty.net<mailto:toasters@teaparty.net>
Subject: RE: NFS issue after upgrading filers to 9.2P2

It takes a lot for an ONTAP system to flat-out be unable to respond. Unless the timeout parameters are exceedingly short, you shouldn't reach that point, especially with anything capable of running ONTAP 9.2.

I'd open a support case on this one. In addition, if you want to trigger an autosupport and send me the serial numbers directly I can take a glance at a few stats to see if anything looks odd.

From: Fenn, Michael [mailto:fennm@DEShawResearch.com]
Sent: Tuesday, January 23, 2018 6:23 PM
To: Steiner, Jeffrey <Jeffrey.Steiner@netapp.com<mailto:Jeffrey.Steiner@netapp.com>>; Mark Saunders <Mark.Saunders@pcmsgroup.com<mailto:Mark.Saunders@pcmsgroup.com>>; toasters@teaparty.net<mailto:toasters@teaparty.net>
Subject: Re: NFS issue after upgrading filers to 9.2P2

The messages are not necessarily indicative of a network problem.

The kernel prints "nfs: server … not responding, still trying" when an operation times out (timeo deciseconds) for the configured (retrans) number of tries. Once the server responds, then it prints "nfs: server … OK".

Networking problems are certainly one reason that an operation would time out, but not the only reason. An overloaded or down file server will cause the same effect.

Thanks,
Michael

From: <toasters-bounces@teaparty.net<mailto:toasters-bounces@teaparty.net>> on behalf of "Steiner, Jeffrey" <Jeffrey.Steiner@netapp.com<mailto:Jeffrey.Steiner@netapp.com>>
Date: Tuesday, January 23, 2018 at 10:38 AM
To: Mark Saunders <Mark.Saunders@pcmsgroup.com<mailto:Mark.Saunders@pcmsgroup.com>>, "toasters@teaparty.net<mailto:toasters@teaparty.net>" <toasters@teaparty.net<mailto:toasters@teaparty.net>>
Subject: RE: NFS issue after upgrading filers to 9.2P2

Those messages are indicative of a network problem. The packets are going through, then they succeed when the NFS client retries, then they pause again.

I can't think why an ONTAP upgrade of this type would cause such a problem. If it was working before, it should be working now. If you had any kind of a locking, firewall, or general configuration problem you should have no access whatsoever.

I've seen some weird NFS bug sin SUSE, but that RHEL version should be fine.

What are the mount options used, and are you using DNFS?

From: toasters-bounces@teaparty.net<mailto:toasters-bounces@teaparty.net> [mailto:toasters-bounces@teaparty.net] On Behalf Of Mark Saunders
Sent: Tuesday, January 23, 2018 4:29 PM
To: toasters@teaparty.net<mailto:toasters@teaparty.net>
Subject: NFS issue after upgrading filers to 9.2P2

Hi gents today we have upgraded our Coventry cluster from 9.1P6 to 9.2P2 and we are about 99% working we just have a strange issue with SAP database servers NFS mounts. When the server is bounced the mounts are attached with no problems but after a few minutes a df –h starts to be very slow reporting the NFS mounted directories and if the databases are started up they hang and a df –h then also hangs. This sometimes recovers enough to then allow a df –h to work again but the databases are a lost cause right now.

In the server messages we get lots of entries for the SVM

Jan 23 07:01:27 jwukccsbci kernel: nfs: server JWUKCSVM01 not responding, still trying
Jan 23 07:01:47 jwukccsbci last message repeated 5 times
Jan 23 07:02:07 jwukccsbci kernel: nfs: server JWUKCSVM01 OK
Jan 23 07:02:07 jwukccsbci last message repeated 5 times
Jan 23 07:02:27 jwukccsbci kernel: nfs: server JWUKCSVM01 not responding, still trying
Jan 23 07:02:47 jwukccsbci last message repeated 5 times
Jan 23 07:02:48 jwukccsbci kernel: nfs: server JWUKCSVM01 OK

Is there anything that would of changed in the upgrade to lock down NFS or changes options that we might need to change back.

The redhat servers are an old kernel version 2.6.18-371.el5 that has some bugs but this was working fine before the filer upgrade was carried out.


Regards
Mark
Data Centre Sysadmin Team
Managed Services
Phone:- 02476 694455 Ext 2567
The Sysadmin Team promoting PCMS Values ~Integrity~Respect~Commitment~ ~Continuous Improvement~
The information contained in this e-mail is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. If you are not the intended recipient of this e-mail, the use of this information or any disclosure, copying or distribution is prohibited and may be unlawful. If you received this in error, please contact the sender and delete the material from any computer. The views expressed in this e-mail may not necessarily be the views of the PCMS Group plc and should not be taken as authority to carry out any instruction contained. The PCMS Group reserves the right to monitor and examine the content of all e-mails.

The PCMS Group plc is a company registered in England and Wales with company number 1459419 whose registered office is at PCMS House, Torwood Close, Westwood Business Park, Coventry CV4 8HX, United Kingdom. VAT No: GB 705338743
RE: NFS issue after upgrading filers to 9.2P2 [ In reply to ]
I will try to find the kernel bug number as I cant see it in the documentation for the server there is just the following note.


RHEL 5.11 has a bug where NFS mounts mounted after network initialization at boot run with an increased number of TCP requests (approx 10x more) which causes rpc backlog and restricts network throughput on the NFS mounts.

To resolve this a script has been created to restart the networking before the NFS mounts are mounted by netfs at boot. By default netfs runs at boot s25 on runlevel 3, 4 and 5 so we will set the NFS fix to run at s24 on the same run levels.

PGUKCSTGCL01::*> node run -node PGUKCSTGCL01-01 -command netstat -sp tcp
---- Default IPSpace ----
tcp:
900103907 packets sent
476280230 data packets (4676048494764 bytes)
61984 data packets (82328048 bytes) retransmitted
2065 data packets unnecessarily retransmitted
0 resends initiated by MTU discovery
235945463 ack-only packets (517654 delayed)
0 URG only packets
0 window probe packets
187429557 window update packets
333130 control packets
1097649475 packets received
399065895 acks (for 4676054895668 bytes)
2174268 duplicate acks
0 acks for unsent data
723809875 packets (4886339861169 bytes) received in-sequence
1649638 completely duplicate packets (98637034 bytes)
2 old duplicate packets
990 packets with some dup. data (214519 bytes duped)
10872239 out-of-order packets (15192422547 bytes)
0 packets (0 bytes) of data after window
0 window probes
26845 window update packets
2 packets received after close
0 discarded for bad checksums
0 discarded for bad header offset fields
0 discarded because packet too short
37581 discarded due to memory problems
1441 connection requests
412966 connection accepts
0 bad connection attempts
0 listen queue overflows
305109 ignored RSTs in the windows
414403 connections established (including accepts)
443890 connections closed (including 139609 drops)
151376 connections updated cached RTT on close
151388 connections updated cached RTT variance on close
140203 connections updated cached ssthresh on close
0 embryonic connections dropped
388403781 segments updated rtt (of 258539924 attempts)
6843 retransmit timeouts
11 connections dropped by rexmit timeout
3 persist timeouts
0 connections dropped by persist timeout
0 Connections (fin_wait_2) dropped because of timeout
92323 keepalive timeouts
92323 keepalive probes sent
0 connections dropped by keepalive
351415606 correct ACK header predictions
684179955 correct data packet header predictions
412966 syncache entries added
155 retransmitted
302 dupsyn
0 dropped
412966 completed
0 bucket overflow
0 cache overflow
0 reset
0 stale
0 aborted
0 badack
0 unreach
0 zone failures
412966 cookies sent
0 cookies received
112 hostcache entries added
0 bucket overflow
16181 SACK recovery episodes
51541 segment rexmits in SACK recovery episodes
70735551 byte rexmits in SACK recovery episodes
277116 SACK options (SACK blocks) received
11457931 SACK options (SACK blocks) sent
0 SACK scoreboard overflow
0 packets with ECN CE bit set
0 packets with ECN ECT(0) bit set
0 packets with ECN ECT(1) bit set
0 successful ECN handshakes
0 times ECN reduced the congestion window
251543 times in ONTAP flow control
0 times exited ONTAP flow control
0 times in ONTAP flow control for zero send window
251543 times in ONTAP flow control for non-zero send window
0 connection resets due to ONTAP extreme flow control
0 times in ONTAP extreme flow control
0 is the maximum flow control reset threshold reached during receive
4 is the maximum flow control reset threshold reached during send
0 bytes is send buffer value during last reset
0 bytes is send buffer hiwat mark during last reset
79 times the receive window was closed
44 dropped due to flowcontrol
188382441 segments sent using TSO
4595103991390 bytes sent using TSO
73883767 TSO segments truncated
1069 TSO wrapped sequence space segments
0 segments sent using TSO6
0 bytes sent using TSO6
0 TSO6 segments truncated
0 TSO6 wrapped sequence space segments
366670238 recv upcalls batched in HP
302647105 recv upcalls made in HP
296877004 recv upcalls made in HP because of PSH
2291336 recv upcalls made in HP because of sb_hiwat
3481239 recv upcalls made in HP because of both PSH and sb_hiwat
6733214 recv upcall batch timeouts
16594187 times recv upcall read partial sb_cc in HP
631681762 segments received using LRO
4816721023400 bytes received using LRO
0 segments received using LRO6
0 bytes received using LRO6
---- ANYVSERVER IPSpace ----
tcp:
0 packets sent
0 data packets (0 bytes)
0 data packets (0 bytes) retransmitted
0 data packets unnecessarily retransmitted
0 resends initiated by MTU discovery
0 ack-only packets (0 delayed)
0 URG only packets
0 window probe packets
0 window update packets
0 control packets
0 packets received
0 acks (for 0 bytes)
0 duplicate acks
0 acks for unsent data
0 packets (0 bytes) received in-sequence
0 completely duplicate packets (0 bytes)
0 old duplicate packets
0 packets with some dup. data (0 bytes duped)
0 out-of-order packets (0 bytes)
0 packets (0 bytes) of data after window
0 window probes
0 window update packets
0 packets received after close
0 discarded for bad checksums
0 discarded for bad header offset fields
0 discarded because packet too short
0 discarded due to memory problems
0 connection requests
0 connection accepts
0 bad connection attempts
0 listen queue overflows
0 ignored RSTs in the windows
0 connections established (including accepts)
7 connections closed (including 0 drops)
0 connections updated cached RTT on close
0 connections updated cached RTT variance on close
0 connections updated cached ssthresh on close
0 embryonic connections dropped
0 segments updated rtt (of 0 attempts)
0 retransmit timeouts
0 connections dropped by rexmit timeout
0 persist timeouts
0 connections dropped by persist timeout
0 Connections (fin_wait_2) dropped because of timeout
0 keepalive timeouts
0 keepalive probes sent
0 connections dropped by keepalive
0 correct ACK header predictions
0 correct data packet header predictions
0 syncache entries added
0 retransmitted
0 dupsyn
0 dropped
0 completed
0 bucket overflow
0 cache overflow
0 reset
0 stale
0 aborted
0 badack
0 unreach
0 zone failures
0 cookies sent
0 cookies received
0 hostcache entries added
0 bucket overflow
0 SACK recovery episodes
0 segment rexmits in SACK recovery episodes
0 byte rexmits in SACK recovery episodes
0 SACK options (SACK blocks) received
0 SACK options (SACK blocks) sent
0 SACK scoreboard overflow
0 packets with ECN CE bit set
0 packets with ECN ECT(0) bit set
0 packets with ECN ECT(1) bit set
0 successful ECN handshakes
0 times ECN reduced the congestion window
0 times in ONTAP flow control
0 times exited ONTAP flow control
0 times in ONTAP flow control for zero send window
0 times in ONTAP flow control for non-zero send window
0 connection resets due to ONTAP extreme flow control
0 times in ONTAP extreme flow control
0 is the maximum flow control reset threshold reached during receive
0 is the maximum flow control reset threshold reached during send
0 bytes is send buffer value during last reset
0 bytes is send buffer hiwat mark during last reset
0 times the receive window was closed
0 dropped due to flowcontrol
0 segments sent using TSO
0 bytes sent using TSO
0 TSO segments truncated
0 TSO wrapped sequence space segments
0 segments sent using TSO6
0 bytes sent using TSO6
0 TSO6 segments truncated
0 TSO6 wrapped sequence space segments
0 recv upcalls batched in HP
0 recv upcalls made in HP
0 recv upcalls made in HP because of PSH
0 recv upcalls made in HP because of sb_hiwat
0 recv upcalls made in HP because of both PSH and sb_hiwat
0 recv upcall batch timeouts
0 times recv upcall read partial sb_cc in HP
0 segments received using LRO
0 bytes received using LRO
0 segments received using LRO6
0 bytes received using LRO6
---- Cluster IPSpace ----
tcp:
350960787 packets sent
253625385 data packets (2042642509989 bytes)
11525 data packets (120517203 bytes) retransmitted
63 data packets unnecessarily retransmitted
0 resends initiated by MTU discovery
38550609 ack-only packets (15348627 delayed)
0 URG only packets
1 window probe packet
56728197 window update packets
2035396 control packets
341097715 packets received
224460892 acks (for 2042726150883 bytes)
6840725 duplicate acks
0 acks for unsent data
271870811 packets (3031038679110 bytes) received in-sequence
195650 completely duplicate packets (4506 bytes)
49 old duplicate packets
0 packets with some dup. data (0 bytes duped)
205398 out-of-order packets (565766073 bytes)
0 packets (0 bytes) of data after window
0 window probes
2011210 window update packets
123 packets received after close
0 discarded for bad checksums
0 discarded for bad header offset fields
0 discarded because packet too short
0 discarded due to memory problems
923539 connection requests
456892 connection accepts
0 bad connection attempts
0 listen queue overflows
529 ignored RSTs in the windows
1271558 connections established (including accepts)
1379180 connections closed (including 1101 drops)
369895 connections updated cached RTT on close
370750 connections updated cached RTT variance on close
12122 connections updated cached ssthresh on close
108207 embryonic connections dropped
224454663 segments updated rtt (of 207849890 attempts)
48471 retransmit timeouts
14 connections dropped by rexmit timeout
1 persist timeout
0 connections dropped by persist timeout
0 Connections (fin_wait_2) dropped because of timeout
152128 keepalive timeouts
147328 keepalive probes sent
4800 connections dropped by keepalive
45057764 correct ACK header predictions
104981779 correct data packet header predictions
457057 syncache entries added
61 retransmitted
0 dupsyn
0 dropped
456892 completed
0 bucket overflow
0 cache overflow
165 reset
0 stale
0 aborted
0 badack
0 unreach
0 zone failures
457057 cookies sent
0 cookies received
61 hostcache entries added
0 bucket overflow
1684 SACK recovery episodes
2491 segment rexmits in SACK recovery episodes
5618157 byte rexmits in SACK recovery episodes
17518 SACK options (SACK blocks) received
86946 SACK options (SACK blocks) sent
0 SACK scoreboard overflow
0 packets with ECN CE bit set
0 packets with ECN ECT(0) bit set
0 packets with ECN ECT(1) bit set
0 successful ECN handshakes
0 times ECN reduced the congestion window
0 times in ONTAP flow control
0 times exited ONTAP flow control
0 times in ONTAP flow control for zero send window
0 times in ONTAP flow control for non-zero send window
0 connection resets due to ONTAP extreme flow control
0 times in ONTAP extreme flow control
0 is the maximum flow control reset threshold reached during receive
0 is the maximum flow control reset threshold reached during send
0 bytes is send buffer value during last reset
0 bytes is send buffer hiwat mark during last reset
0 times the receive window was closed
0 dropped due to flowcontrol
56607835 segments sent using TSO
1679494142753 bytes sent using TSO
36473474 TSO segments truncated
394 TSO wrapped sequence space segments
0 segments sent using TSO6
0 bytes sent using TSO6
0 TSO6 segments truncated
0 TSO6 wrapped sequence space segments
4879278 recv upcalls batched in HP
90401291 recv upcalls made in HP
90401967 recv upcalls made in HP because of PSH
52 recv upcalls made in HP because of sb_hiwat
325 recv upcalls made in HP because of both PSH and sb_hiwat
32882 recv upcall batch timeouts
524 times recv upcall read partial sb_cc in HP
160827213 segments received using LRO
2789346524807 bytes received using LRO
0 segments received using LRO6
0 bytes received using LRO6
---- ips_4294967289 IPSpace ----
tcp:
0 packets sent
0 data packets (0 bytes)
0 data packets (0 bytes) retransmitted
0 data packets unnecessarily retransmitted
0 resends initiated by MTU discovery
0 ack-only packets (0 delayed)
0 URG only packets
0 window probe packets
0 window update packets
0 control packets
0 packets received
0 acks (for 0 bytes)
0 duplicate acks
0 acks for unsent data
0 packets (0 bytes) received in-sequence
0 completely duplicate packets (0 bytes)
0 old duplicate packets
0 packets with some dup. data (0 bytes duped)
0 out-of-order packets (0 bytes)
0 packets (0 bytes) of data after window
0 window probes
0 window update packets
0 packets received after close
0 discarded for bad checksums
0 discarded for bad header offset fields
0 discarded because packet too short
0 discarded due to memory problems
0 connection requests
0 connection accepts
0 bad connection attempts
0 listen queue overflows
0 ignored RSTs in the windows
0 connections established (including accepts)
0 connections closed (including 0 drops)
0 connections updated cached RTT on close
0 connections updated cached RTT variance on close
0 connections updated cached ssthresh on close
0 embryonic connections dropped
0 segments updated rtt (of 0 attempts)
0 retransmit timeouts
0 connections dropped by rexmit timeout
0 persist timeouts
0 connections dropped by persist timeout
0 Connections (fin_wait_2) dropped because of timeout
0 keepalive timeouts
0 keepalive probes sent
0 connections dropped by keepalive
0 correct ACK header predictions
0 correct data packet header predictions
0 syncache entries added
0 retransmitted
0 dupsyn
0 dropped
0 completed
0 bucket overflow
0 cache overflow
0 reset
0 stale
0 aborted
0 badack
0 unreach
0 zone failures
0 cookies sent
0 cookies received
0 hostcache entries added
0 bucket overflow
0 SACK recovery episodes
0 segment rexmits in SACK recovery episodes
0 byte rexmits in SACK recovery episodes
0 SACK options (SACK blocks) received
0 SACK options (SACK blocks) sent
0 SACK scoreboard overflow
0 packets with ECN CE bit set
0 packets with ECN ECT(0) bit set
0 packets with ECN ECT(1) bit set
0 successful ECN handshakes
0 times ECN reduced the congestion window
0 times in ONTAP flow control
0 times exited ONTAP flow control
0 times in ONTAP flow control for zero send window
0 times in ONTAP flow control for non-zero send window
0 connection resets due to ONTAP extreme flow control
0 times in ONTAP extreme flow control
0 is the maximum flow control reset threshold reached during receive
0 is the maximum flow control reset threshold reached during send
0 bytes is send buffer value during last reset
0 bytes is send buffer hiwat mark during last reset
0 times the receive window was closed
0 dropped due to flowcontrol
0 segments sent using TSO
0 bytes sent using TSO
0 TSO segments truncated
0 TSO wrapped sequence space segments
0 segments sent using TSO6
0 bytes sent using TSO6
0 TSO6 segments truncated
0 TSO6 wrapped sequence space segments
0 recv upcalls batched in HP
0 recv upcalls made in HP
0 recv upcalls made in HP because of PSH
0 recv upcalls made in HP because of sb_hiwat
0 recv upcalls made in HP because of both PSH and sb_hiwat
0 recv upcall batch timeouts
0 times recv upcall read partial sb_cc in HP
0 segments received using LRO
0 bytes received using LRO
0 segments received using LRO6
0 bytes received using LRO6
---- ACP IPSpace ----
tcp:
86643 packets sent
17496 data packets (419904 bytes)
0 data packets (0 bytes) retransmitted
0 data packets unnecessarily retransmitted
0 resends initiated by MTU discovery
33848 ack-only packets (0 delayed)
0 URG only packets
0 window probe packets
23 window update packets
35276 control packets
74406 packets received
51152 acks (for 436064 bytes)
4798 duplicate acks
0 acks for unsent data
20938 packets (1251746 bytes) received in-sequence
0 completely duplicate packets (0 bytes)
0 old duplicate packets
0 packets with some dup. data (0 bytes duped)
0 out-of-order packets (0 bytes)
0 packets (0 bytes) of data after window
0 window probes
0 window update packets
1686 packets received after close
0 discarded for bad checksums
0 discarded for bad header offset fields
0 discarded because packet too short
0 discarded due to memory problems
17605 connection requests
176 connection accepts
0 bad connection attempts
0 listen queue overflows
0 ignored RSTs in the windows
17672 connections established (including accepts)
17781 connections closed (including 2 drops)
0 connections updated cached RTT on close
0 connections updated cached RTT variance on close
0 connections updated cached ssthresh on close
0 embryonic connections dropped
51152 segments updated rtt (of 52750 attempts)
109 retransmit timeouts
0 connections dropped by rexmit timeout
0 persist timeouts
0 connections dropped by persist timeout
0 Connections (fin_wait_2) dropped because of timeout
0 keepalive timeouts
0 keepalive probes sent
0 connections dropped by keepalive
17474 correct ACK header predictions
4954 correct data packet header predictions
176 syncache entries added
0 retransmitted
0 dupsyn
0 dropped
176 completed
0 bucket overflow
0 cache overflow
0 reset
0 stale
0 aborted
0 badack
0 unreach
0 zone failures
176 cookies sent
0 cookies received
0 hostcache entries added
0 bucket overflow
0 SACK recovery episodes
0 segment rexmits in SACK recovery episodes
0 byte rexmits in SACK recovery episodes
0 SACK options (SACK blocks) received
0 SACK options (SACK blocks) sent
0 SACK scoreboard overflow
0 packets with ECN CE bit set
0 packets with ECN ECT(0) bit set
0 packets with ECN ECT(1) bit set
0 successful ECN handshakes
0 times ECN reduced the congestion window
0 times in ONTAP flow control
0 times exited ONTAP flow control
0 times in ONTAP flow control for zero send window
0 times in ONTAP flow control for non-zero send window
0 connection resets due to ONTAP extreme flow control
0 times in ONTAP extreme flow control
0 is the maximum flow control reset threshold reached during receive
0 is the maximum flow control reset threshold reached during send
0 bytes is send buffer value during last reset
0 bytes is send buffer hiwat mark during last reset
0 times the receive window was closed
0 dropped due to flowcontrol
0 segments sent using TSO
0 bytes sent using TSO
0 TSO segments truncated
0 TSO wrapped sequence space segments
0 segments sent using TSO6
0 bytes sent using TSO6
0 TSO6 segments truncated
0 TSO6 wrapped sequence space segments
0 recv upcalls batched in HP
0 recv upcalls made in HP
0 recv upcalls made in HP because of PSH
0 recv upcalls made in HP because of sb_hiwat
0 recv upcalls made in HP because of both PSH and sb_hiwat
0 recv upcall batch timeouts
0 times recv upcall read partial sb_cc in HP
0 segments received using LRO
0 bytes received using LRO
0 segments received using LRO6
0 bytes received using LRO6


Server tcp entries

[root@jwukccsbci ~]# sysctl -a | grep slot
sunrpc.tcp_slot_table_entries = 128
sunrpc.udp_slot_table_entries = 128
dev.cdrom.info = drive # of slots: 1

From: Steiner, Jeffrey [mailto:Jeffrey.Steiner@netapp.com]
Sent: 24 January 2018 11:53
To: Mark Saunders; Parisi, Justin; Fenn, Michael; toasters@teaparty.net
Subject: RE: NFS issue after upgrading filers to 9.2P2

Could this be TCP slot tables? Flow control capabilities on ONTAP continue to improve. If you don't have TCP slot tables capped at 128 you could see quasi-hangs like this.

Complete details are in TR-3633, but these are the two that you want to watch:

[root@stlrx300s7-145 mkdb]# sysctl -a | grep slot
sunrpc.tcp_max_slot_table_entries = 128
sunrpc.tcp_slot_table_entries = 128

Newer versions of linux will allow a ridiculous number of unacknowledged RPC operations to build up. The result can be sending ONTAP into a flow control mode until the OS catches up. We see problems mostly in slow clients. For example, if you're trying to read a lot of data from a host with 1Gb connectivity on a high-end ONTAP system the OS can ask for data quicker than it can process the responses.

From: Mark Saunders [mailto:Mark.Saunders@pcmsgroup.com]
Sent: Wednesday, January 24, 2018 12:26 PM
To: Parisi, Justin <Justin.Parisi@netapp.com<mailto:Justin.Parisi@netapp.com>>; Steiner, Jeffrey <Jeffrey.Steiner@netapp.com<mailto:Jeffrey.Steiner@netapp.com>>; Fenn, Michael <fennm@DEShawResearch.com<mailto:fennm@DEShawResearch.com>>; toasters@teaparty.net<mailto:toasters@teaparty.net>
Subject: RE: NFS issue after upgrading filers to 9.2P2

Justin

I have just checked the SVM and there are no admin/management interfaces configured for it there are three data lifs for different vlans. I have checked through our other systems this morning and there are no issues in vmware (5.5) or SLES 11/12 so this is just with the redhat servers.

I have checked the interfaces at the server end and it is not showing errors or dropped packets. On the filer end we have 4 physical ports in an interface group with vlans on top. I have run “statistics start –obeject nfs_exports_access_cache” which when checked doesn’t report any errors.

On the server interface

eth1 Link encap:Ethernet HWaddr 00:50:56:A5:0D:6A
inet addr:10.240.1.30 Bcast:10.240.1.31 Mask:255.255.255.224
inet6 addr: fe80::250:56ff:fea5:d6a/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:127209 errors:0 dropped:0 overruns:0 frame:0
TX packets:26100 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:104158360 (99.3 MiB) TX bytes:14489402 (13.8 MiB)



While investigating we have found that the file system is fine just after a reboot and you can ls each mount so they are initially all OK. It is when starting the application so putting a bigger load over the network that the file systems stop responding.


Regards

Mark

From: Parisi, Justin [mailto:Justin.Parisi@netapp.com]
Sent: 23 January 2018 22:33
To: Steiner, Jeffrey; Mark Saunders; Fenn, Michael; toasters@teaparty.net<mailto:toasters@teaparty.net>
Subject: RE: NFS issue after upgrading filers to 9.2P2

In fact, maybe look at this as a root cause… do your NFS interfaces share nodes with admin interfaces?

“NFS issues were caused by using a NAS interface on the same node as the SVM admin interface, once I realised we moved all servers NFS to the node without the admin interface.”

From: Parisi, Justin
Sent: Tuesday, January 23, 2018 5:30 PM
To: Parisi, Justin <Justin.Parisi@netapp.com<mailto:Justin.Parisi@netapp.com>>; Steiner, Jeffrey <Jeffrey.Steiner@netapp.com<mailto:Jeffrey.Steiner@netapp.com>>; Mark Saunders <Mark.Saunders@pcmsgroup.com<mailto:Mark.Saunders@pcmsgroup.com>>; Fenn, Michael <fennm@DEShawResearch.com<mailto:fennm@DEShawResearch.com>>; toasters@teaparty.net<mailto:toasters@teaparty.net>
Subject: RE: NFS issue after upgrading filers to 9.2P2

This community post also does a good job explaining it:

https://community.netapp.com/t5/Data-ONTAP-Discussions/NetApp-Ontap-9-2-Upgrade-review-your-network-first/td-p/136657

From: toasters-bounces@teaparty.net<mailto:toasters-bounces@teaparty.net> [mailto:toasters-bounces@teaparty.net] On Behalf Of Parisi, Justin
Sent: Tuesday, January 23, 2018 5:28 PM
To: Steiner, Jeffrey <Jeffrey.Steiner@netapp.com<mailto:Jeffrey.Steiner@netapp.com>>; Mark Saunders <Mark.Saunders@pcmsgroup.com<mailto:Mark.Saunders@pcmsgroup.com>>; Fenn, Michael <fennm@DEShawResearch.com<mailto:fennm@DEShawResearch.com>>; toasters@teaparty.net<mailto:toasters@teaparty.net>
Subject: RE: NFS issue after upgrading filers to 9.2P2

The network stack changed in 9.2 and IP fastpath was removed. But fastpath was mainly for more efficient routing.

https://library.netapp.com/ecmdocs/ECMP1114171/html/GUID-8276014A-16EB-4902-9EDC-868C5292381B.html

The stack was changed to a more standard BSD stack, so fastpath was no longer needed. It’s possible that’s an issue here, but I’d suggest getting network sniffs on each endpoint of the network to see where the packet is being dropped.

From: toasters-bounces@teaparty.net<mailto:toasters-bounces@teaparty.net> [mailto:toasters-bounces@teaparty.net] On Behalf Of Steiner, Jeffrey
Sent: Tuesday, January 23, 2018 5:24 PM
To: Mark Saunders <Mark.Saunders@pcmsgroup.com<mailto:Mark.Saunders@pcmsgroup.com>>; Fenn, Michael <fennm@DEShawResearch.com<mailto:fennm@DEShawResearch.com>>; toasters@teaparty.net<mailto:toasters@teaparty.net>
Subject: RE: NFS issue after upgrading filers to 9.2P2

I should have asked - is this SAP HANA or something like SAP on an Oracle database?

Also, what do they mean "it's not on the IMT?" Virtually everything NFS is on the IMT. We support any NFSv3 and NFSv4 client that obeys the specification. There's a tiny number of exceptions, but generally speaking we'll support linux, Solaris, AIX, mainframe, OpenVMS, HP-UX, Oracle DNFS, AS/400, etc. There really should be no issue there.

The thing about fastpath does ring a few bells.

From: Mark Saunders [mailto:Mark.Saunders@pcmsgroup.com]
Sent: Tuesday, January 23, 2018 11:18 PM
To: Steiner, Jeffrey <Jeffrey.Steiner@netapp.com<mailto:Jeffrey.Steiner@netapp.com>>; Fenn, Michael <fennm@DEShawResearch.com<mailto:fennm@DEShawResearch.com>>; toasters@teaparty.net<mailto:toasters@teaparty.net>
Subject: RE: NFS issue after upgrading filers to 9.2P2

Thanks for the quick replies sorry for the delay in e responding but I was working on this since 5am so had to go sleep.

I have a call open with netapp but have had the coockie cutter response of it isn’t on the Interoperability Matrix Tool as a supported version (It wasn’t when on 9.1 anyway)

A third party we have contact with have sent me a link to details about fastpathing being removed but I don’t think we were using it so maybe another false line to look down.

The mount options were kept fairly straight forward as

nfs nolock,_netdev,udp 0 0

and we have also tried the same as the one of the production servers which had tuned options, this is on another cluster so isn’t affected by this yet.

nfsvers=3,nolock,_netdev,rw,udp,rsize=32768,wsize=32768,timeo=600 0 0

How would I be able to tell if we are using DNFS ?

I will send you the support details tomorrow when I am back in the office.

Regards

Mark


From: Steiner, Jeffrey [mailto:Jeffrey.Steiner@netapp.com]
Sent: 23 January 2018 17:29
To: Fenn, Michael; Mark Saunders; toasters@teaparty.net<mailto:toasters@teaparty.net>
Subject: RE: NFS issue after upgrading filers to 9.2P2

It takes a lot for an ONTAP system to flat-out be unable to respond. Unless the timeout parameters are exceedingly short, you shouldn't reach that point, especially with anything capable of running ONTAP 9.2.

I'd open a support case on this one. In addition, if you want to trigger an autosupport and send me the serial numbers directly I can take a glance at a few stats to see if anything looks odd.

From: Fenn, Michael [mailto:fennm@DEShawResearch.com]
Sent: Tuesday, January 23, 2018 6:23 PM
To: Steiner, Jeffrey <Jeffrey.Steiner@netapp.com<mailto:Jeffrey.Steiner@netapp.com>>; Mark Saunders <Mark.Saunders@pcmsgroup.com<mailto:Mark.Saunders@pcmsgroup.com>>; toasters@teaparty.net<mailto:toasters@teaparty.net>
Subject: Re: NFS issue after upgrading filers to 9.2P2

The messages are not necessarily indicative of a network problem.

The kernel prints "nfs: server … not responding, still trying" when an operation times out (timeo deciseconds) for the configured (retrans) number of tries. Once the server responds, then it prints "nfs: server … OK".

Networking problems are certainly one reason that an operation would time out, but not the only reason. An overloaded or down file server will cause the same effect.

Thanks,
Michael

From: <toasters-bounces@teaparty.net<mailto:toasters-bounces@teaparty.net>> on behalf of "Steiner, Jeffrey" <Jeffrey.Steiner@netapp.com<mailto:Jeffrey.Steiner@netapp.com>>
Date: Tuesday, January 23, 2018 at 10:38 AM
To: Mark Saunders <Mark.Saunders@pcmsgroup.com<mailto:Mark.Saunders@pcmsgroup.com>>, "toasters@teaparty.net<mailto:toasters@teaparty.net>" <toasters@teaparty.net<mailto:toasters@teaparty.net>>
Subject: RE: NFS issue after upgrading filers to 9.2P2

Those messages are indicative of a network problem. The packets are going through, then they succeed when the NFS client retries, then they pause again.

I can't think why an ONTAP upgrade of this type would cause such a problem. If it was working before, it should be working now. If you had any kind of a locking, firewall, or general configuration problem you should have no access whatsoever.

I've seen some weird NFS bug sin SUSE, but that RHEL version should be fine.

What are the mount options used, and are you using DNFS?

From: toasters-bounces@teaparty.net<mailto:toasters-bounces@teaparty.net> [mailto:toasters-bounces@teaparty.net] On Behalf Of Mark Saunders
Sent: Tuesday, January 23, 2018 4:29 PM
To: toasters@teaparty.net<mailto:toasters@teaparty.net>
Subject: NFS issue after upgrading filers to 9.2P2

Hi gents today we have upgraded our Coventry cluster from 9.1P6 to 9.2P2 and we are about 99% working we just have a strange issue with SAP database servers NFS mounts. When the server is bounced the mounts are attached with no problems but after a few minutes a df –h starts to be very slow reporting the NFS mounted directories and if the databases are started up they hang and a df –h then also hangs. This sometimes recovers enough to then allow a df –h to work again but the databases are a lost cause right now.

In the server messages we get lots of entries for the SVM

Jan 23 07:01:27 jwukccsbci kernel: nfs: server JWUKCSVM01 not responding, still trying
Jan 23 07:01:47 jwukccsbci last message repeated 5 times
Jan 23 07:02:07 jwukccsbci kernel: nfs: server JWUKCSVM01 OK
Jan 23 07:02:07 jwukccsbci last message repeated 5 times
Jan 23 07:02:27 jwukccsbci kernel: nfs: server JWUKCSVM01 not responding, still trying
Jan 23 07:02:47 jwukccsbci last message repeated 5 times
Jan 23 07:02:48 jwukccsbci kernel: nfs: server JWUKCSVM01 OK

Is there anything that would of changed in the upgrade to lock down NFS or changes options that we might need to change back.

The redhat servers are an old kernel version 2.6.18-371.el5 that has some bugs but this was working fine before the filer upgrade was carried out.


Regards
Mark
Data Centre Sysadmin Team
Managed Services
Phone:- 02476 694455 Ext 2567
The Sysadmin Team promoting PCMS Values ~Integrity~Respect~Commitment~ ~Continuous Improvement~
The information contained in this e-mail is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. If you are not the intended recipient of this e-mail, the use of this information or any disclosure, copying or distribution is prohibited and may be unlawful. If you received this in error, please contact the sender and delete the material from any computer. The views expressed in this e-mail may not necessarily be the views of the PCMS Group plc and should not be taken as authority to carry out any instruction contained. The PCMS Group reserves the right to monitor and examine the content of all e-mails.

The PCMS Group plc is a company registered in England and Wales with company number 1459419 whose registered office is at PCMS House, Torwood Close, Westwood Business Park, Coventry CV4 8HX, United Kingdom. VAT No: GB 705338743
RE: NFS issue after upgrading filers to 9.2P2 [ In reply to ]
If that's 441463, I'm skeptical that's the problem. That might cause problems during boot, but I wouldn’t expect it to cause problems later. Also, an ONTAP upgrade shouldn't affect this.

I'll subscribe to the case and follow along. The stats below do show some possible problems. There was some flow control activity, and the SACK numbers look high to me.

From: Mark Saunders [mailto:Mark.Saunders@pcmsgroup.com]
Sent: Wednesday, January 24, 2018 1:02 PM
To: Steiner, Jeffrey <Jeffrey.Steiner@netapp.com>; Parisi, Justin <Justin.Parisi@netapp.com>; Fenn, Michael <fennm@DEShawResearch.com>; toasters@teaparty.net
Subject: RE: NFS issue after upgrading filers to 9.2P2

I will try to find the kernel bug number as I cant see it in the documentation for the server there is just the following note.


RHEL 5.11 has a bug where NFS mounts mounted after network initialization at boot run with an increased number of TCP requests (approx 10x more) which causes rpc backlog and restricts network throughput on the NFS mounts.

To resolve this a script has been created to restart the networking before the NFS mounts are mounted by netfs at boot. By default netfs runs at boot s25 on runlevel 3, 4 and 5 so we will set the NFS fix to run at s24 on the same run levels.

PGUKCSTGCL01::*> node run -node PGUKCSTGCL01-01 -command netstat -sp tcp
---- Default IPSpace ----
tcp:
900103907 packets sent
476280230 data packets (4676048494764 bytes)
61984 data packets (82328048 bytes) retransmitted
2065 data packets unnecessarily retransmitted
0 resends initiated by MTU discovery
235945463 ack-only packets (517654 delayed)
0 URG only packets
0 window probe packets
187429557 window update packets
333130 control packets
1097649475 packets received
399065895 acks (for 4676054895668 bytes)
2174268 duplicate acks
0 acks for unsent data
723809875 packets (4886339861169 bytes) received in-sequence
1649638 completely duplicate packets (98637034 bytes)
2 old duplicate packets
990 packets with some dup. data (214519 bytes duped)
10872239 out-of-order packets (15192422547 bytes)
0 packets (0 bytes) of data after window
0 window probes
26845 window update packets
2 packets received after close
0 discarded for bad checksums
0 discarded for bad header offset fields
0 discarded because packet too short
37581 discarded due to memory problems
1441 connection requests
412966 connection accepts
0 bad connection attempts
0 listen queue overflows
305109 ignored RSTs in the windows
414403 connections established (including accepts)
443890 connections closed (including 139609 drops)
151376 connections updated cached RTT on close
151388 connections updated cached RTT variance on close
140203 connections updated cached ssthresh on close
0 embryonic connections dropped
388403781 segments updated rtt (of 258539924 attempts)
6843 retransmit timeouts
11 connections dropped by rexmit timeout
3 persist timeouts
0 connections dropped by persist timeout
0 Connections (fin_wait_2) dropped because of timeout
92323 keepalive timeouts
92323 keepalive probes sent
0 connections dropped by keepalive
351415606 correct ACK header predictions
684179955 correct data packet header predictions
412966 syncache entries added
155 retransmitted
302 dupsyn
0 dropped
412966 completed
0 bucket overflow
0 cache overflow
0 reset
0 stale
0 aborted
0 badack
0 unreach
0 zone failures
412966 cookies sent
0 cookies received
112 hostcache entries added
0 bucket overflow
16181 SACK recovery episodes
51541 segment rexmits in SACK recovery episodes
70735551 byte rexmits in SACK recovery episodes
277116 SACK options (SACK blocks) received
11457931 SACK options (SACK blocks) sent
0 SACK scoreboard overflow
0 packets with ECN CE bit set
0 packets with ECN ECT(0) bit set
0 packets with ECN ECT(1) bit set
0 successful ECN handshakes
0 times ECN reduced the congestion window
251543 times in ONTAP flow control
0 times exited ONTAP flow control
0 times in ONTAP flow control for zero send window
251543 times in ONTAP flow control for non-zero send window
0 connection resets due to ONTAP extreme flow control
0 times in ONTAP extreme flow control
0 is the maximum flow control reset threshold reached during receive
4 is the maximum flow control reset threshold reached during send
0 bytes is send buffer value during last reset
0 bytes is send buffer hiwat mark during last reset
79 times the receive window was closed
44 dropped due to flowcontrol
188382441 segments sent using TSO
4595103991390 bytes sent using TSO
73883767 TSO segments truncated
1069 TSO wrapped sequence space segments
0 segments sent using TSO6
0 bytes sent using TSO6
0 TSO6 segments truncated
0 TSO6 wrapped sequence space segments
366670238 recv upcalls batched in HP
302647105 recv upcalls made in HP
296877004 recv upcalls made in HP because of PSH
2291336 recv upcalls made in HP because of sb_hiwat
3481239 recv upcalls made in HP because of both PSH and sb_hiwat
6733214 recv upcall batch timeouts
16594187 times recv upcall read partial sb_cc in HP
631681762 segments received using LRO
4816721023400 bytes received using LRO
0 segments received using LRO6
0 bytes received using LRO6
---- ANYVSERVER IPSpace ----
tcp:
0 packets sent
0 data packets (0 bytes)
0 data packets (0 bytes) retransmitted
0 data packets unnecessarily retransmitted
0 resends initiated by MTU discovery
0 ack-only packets (0 delayed)
0 URG only packets
0 window probe packets
0 window update packets
0 control packets
0 packets received
0 acks (for 0 bytes)
0 duplicate acks
0 acks for unsent data
0 packets (0 bytes) received in-sequence
0 completely duplicate packets (0 bytes)
0 old duplicate packets
0 packets with some dup. data (0 bytes duped)
0 out-of-order packets (0 bytes)
0 packets (0 bytes) of data after window
0 window probes
0 window update packets
0 packets received after close
0 discarded for bad checksums
0 discarded for bad header offset fields
0 discarded because packet too short
0 discarded due to memory problems
0 connection requests
0 connection accepts
0 bad connection attempts
0 listen queue overflows
0 ignored RSTs in the windows
0 connections established (including accepts)
7 connections closed (including 0 drops)
0 connections updated cached RTT on close
0 connections updated cached RTT variance on close
0 connections updated cached ssthresh on close
0 embryonic connections dropped
0 segments updated rtt (of 0 attempts)
0 retransmit timeouts
0 connections dropped by rexmit timeout
0 persist timeouts
0 connections dropped by persist timeout
0 Connections (fin_wait_2) dropped because of timeout
0 keepalive timeouts
0 keepalive probes sent
0 connections dropped by keepalive
0 correct ACK header predictions
0 correct data packet header predictions
0 syncache entries added
0 retransmitted
0 dupsyn
0 dropped
0 completed
0 bucket overflow
0 cache overflow
0 reset
0 stale
0 aborted
0 badack
0 unreach
0 zone failures
0 cookies sent
0 cookies received
0 hostcache entries added
0 bucket overflow
0 SACK recovery episodes
0 segment rexmits in SACK recovery episodes
0 byte rexmits in SACK recovery episodes
0 SACK options (SACK blocks) received
0 SACK options (SACK blocks) sent
0 SACK scoreboard overflow
0 packets with ECN CE bit set
0 packets with ECN ECT(0) bit set
0 packets with ECN ECT(1) bit set
0 successful ECN handshakes
0 times ECN reduced the congestion window
0 times in ONTAP flow control
0 times exited ONTAP flow control
0 times in ONTAP flow control for zero send window
0 times in ONTAP flow control for non-zero send window
0 connection resets due to ONTAP extreme flow control
0 times in ONTAP extreme flow control
0 is the maximum flow control reset threshold reached during receive
0 is the maximum flow control reset threshold reached during send
0 bytes is send buffer value during last reset
0 bytes is send buffer hiwat mark during last reset
0 times the receive window was closed
0 dropped due to flowcontrol
0 segments sent using TSO
0 bytes sent using TSO
0 TSO segments truncated
0 TSO wrapped sequence space segments
0 segments sent using TSO6
0 bytes sent using TSO6
0 TSO6 segments truncated
0 TSO6 wrapped sequence space segments
0 recv upcalls batched in HP
0 recv upcalls made in HP
0 recv upcalls made in HP because of PSH
0 recv upcalls made in HP because of sb_hiwat
0 recv upcalls made in HP because of both PSH and sb_hiwat
0 recv upcall batch timeouts
0 times recv upcall read partial sb_cc in HP
0 segments received using LRO
0 bytes received using LRO
0 segments received using LRO6
0 bytes received using LRO6
---- Cluster IPSpace ----
tcp:
350960787 packets sent
253625385 data packets (2042642509989 bytes)
11525 data packets (120517203 bytes) retransmitted
63 data packets unnecessarily retransmitted
0 resends initiated by MTU discovery
38550609 ack-only packets (15348627 delayed)
0 URG only packets
1 window probe packet
56728197 window update packets
2035396 control packets
341097715 packets received
224460892 acks (for 2042726150883 bytes)
6840725 duplicate acks
0 acks for unsent data
271870811 packets (3031038679110 bytes) received in-sequence
195650 completely duplicate packets (4506 bytes)
49 old duplicate packets
0 packets with some dup. data (0 bytes duped)
205398 out-of-order packets (565766073 bytes)
0 packets (0 bytes) of data after window
0 window probes
2011210 window update packets
123 packets received after close
0 discarded for bad checksums
0 discarded for bad header offset fields
0 discarded because packet too short
0 discarded due to memory problems
923539 connection requests
456892 connection accepts
0 bad connection attempts
0 listen queue overflows
529 ignored RSTs in the windows
1271558 connections established (including accepts)
1379180 connections closed (including 1101 drops)
369895 connections updated cached RTT on close
370750 connections updated cached RTT variance on close
12122 connections updated cached ssthresh on close
108207 embryonic connections dropped
224454663 segments updated rtt (of 207849890 attempts)
48471 retransmit timeouts
14 connections dropped by rexmit timeout
1 persist timeout
0 connections dropped by persist timeout
0 Connections (fin_wait_2) dropped because of timeout
152128 keepalive timeouts
147328 keepalive probes sent
4800 connections dropped by keepalive
45057764 correct ACK header predictions
104981779 correct data packet header predictions
457057 syncache entries added
61 retransmitted
0 dupsyn
0 dropped
456892 completed
0 bucket overflow
0 cache overflow
165 reset
0 stale
0 aborted
0 badack
0 unreach
0 zone failures
457057 cookies sent
0 cookies received
61 hostcache entries added
0 bucket overflow
1684 SACK recovery episodes
2491 segment rexmits in SACK recovery episodes
5618157 byte rexmits in SACK recovery episodes
17518 SACK options (SACK blocks) received
86946 SACK options (SACK blocks) sent
0 SACK scoreboard overflow
0 packets with ECN CE bit set
0 packets with ECN ECT(0) bit set
0 packets with ECN ECT(1) bit set
0 successful ECN handshakes
0 times ECN reduced the congestion window
0 times in ONTAP flow control
0 times exited ONTAP flow control
0 times in ONTAP flow control for zero send window
0 times in ONTAP flow control for non-zero send window
0 connection resets due to ONTAP extreme flow control
0 times in ONTAP extreme flow control
0 is the maximum flow control reset threshold reached during receive
0 is the maximum flow control reset threshold reached during send
0 bytes is send buffer value during last reset
0 bytes is send buffer hiwat mark during last reset
0 times the receive window was closed
0 dropped due to flowcontrol
56607835 segments sent using TSO
1679494142753 bytes sent using TSO
36473474 TSO segments truncated
394 TSO wrapped sequence space segments
0 segments sent using TSO6
0 bytes sent using TSO6
0 TSO6 segments truncated
0 TSO6 wrapped sequence space segments
4879278 recv upcalls batched in HP
90401291 recv upcalls made in HP
90401967 recv upcalls made in HP because of PSH
52 recv upcalls made in HP because of sb_hiwat
325 recv upcalls made in HP because of both PSH and sb_hiwat
32882 recv upcall batch timeouts
524 times recv upcall read partial sb_cc in HP
160827213 segments received using LRO
2789346524807 bytes received using LRO
0 segments received using LRO6
0 bytes received using LRO6
---- ips_4294967289 IPSpace ----
tcp:
0 packets sent
0 data packets (0 bytes)
0 data packets (0 bytes) retransmitted
0 data packets unnecessarily retransmitted
0 resends initiated by MTU discovery
0 ack-only packets (0 delayed)
0 URG only packets
0 window probe packets
0 window update packets
0 control packets
0 packets received
0 acks (for 0 bytes)
0 duplicate acks
0 acks for unsent data
0 packets (0 bytes) received in-sequence
0 completely duplicate packets (0 bytes)
0 old duplicate packets
0 packets with some dup. data (0 bytes duped)
0 out-of-order packets (0 bytes)
0 packets (0 bytes) of data after window
0 window probes
0 window update packets
0 packets received after close
0 discarded for bad checksums
0 discarded for bad header offset fields
0 discarded because packet too short
0 discarded due to memory problems
0 connection requests
0 connection accepts
0 bad connection attempts
0 listen queue overflows
0 ignored RSTs in the windows
0 connections established (including accepts)
0 connections closed (including 0 drops)
0 connections updated cached RTT on close
0 connections updated cached RTT variance on close
0 connections updated cached ssthresh on close
0 embryonic connections dropped
0 segments updated rtt (of 0 attempts)
0 retransmit timeouts
0 connections dropped by rexmit timeout
0 persist timeouts
0 connections dropped by persist timeout
0 Connections (fin_wait_2) dropped because of timeout
0 keepalive timeouts
0 keepalive probes sent
0 connections dropped by keepalive
0 correct ACK header predictions
0 correct data packet header predictions
0 syncache entries added
0 retransmitted
0 dupsyn
0 dropped
0 completed
0 bucket overflow
0 cache overflow
0 reset
0 stale
0 aborted
0 badack
0 unreach
0 zone failures
0 cookies sent
0 cookies received
0 hostcache entries added
0 bucket overflow
0 SACK recovery episodes
0 segment rexmits in SACK recovery episodes
0 byte rexmits in SACK recovery episodes
0 SACK options (SACK blocks) received
0 SACK options (SACK blocks) sent
0 SACK scoreboard overflow
0 packets with ECN CE bit set
0 packets with ECN ECT(0) bit set
0 packets with ECN ECT(1) bit set
0 successful ECN handshakes
0 times ECN reduced the congestion window
0 times in ONTAP flow control
0 times exited ONTAP flow control
0 times in ONTAP flow control for zero send window
0 times in ONTAP flow control for non-zero send window
0 connection resets due to ONTAP extreme flow control
0 times in ONTAP extreme flow control
0 is the maximum flow control reset threshold reached during receive
0 is the maximum flow control reset threshold reached during send
0 bytes is send buffer value during last reset
0 bytes is send buffer hiwat mark during last reset
0 times the receive window was closed
0 dropped due to flowcontrol
0 segments sent using TSO
0 bytes sent using TSO
0 TSO segments truncated
0 TSO wrapped sequence space segments
0 segments sent using TSO6
0 bytes sent using TSO6
0 TSO6 segments truncated
0 TSO6 wrapped sequence space segments
0 recv upcalls batched in HP
0 recv upcalls made in HP
0 recv upcalls made in HP because of PSH
0 recv upcalls made in HP because of sb_hiwat
0 recv upcalls made in HP because of both PSH and sb_hiwat
0 recv upcall batch timeouts
0 times recv upcall read partial sb_cc in HP
0 segments received using LRO
0 bytes received using LRO
0 segments received using LRO6
0 bytes received using LRO6
---- ACP IPSpace ----
tcp:
86643 packets sent
17496 data packets (419904 bytes)
0 data packets (0 bytes) retransmitted
0 data packets unnecessarily retransmitted
0 resends initiated by MTU discovery
33848 ack-only packets (0 delayed)
0 URG only packets
0 window probe packets
23 window update packets
35276 control packets
74406 packets received
51152 acks (for 436064 bytes)
4798 duplicate acks
0 acks for unsent data
20938 packets (1251746 bytes) received in-sequence
0 completely duplicate packets (0 bytes)
0 old duplicate packets
0 packets with some dup. data (0 bytes duped)
0 out-of-order packets (0 bytes)
0 packets (0 bytes) of data after window
0 window probes
0 window update packets
1686 packets received after close
0 discarded for bad checksums
0 discarded for bad header offset fields
0 discarded because packet too short
0 discarded due to memory problems
17605 connection requests
176 connection accepts
0 bad connection attempts
0 listen queue overflows
0 ignored RSTs in the windows
17672 connections established (including accepts)
17781 connections closed (including 2 drops)
0 connections updated cached RTT on close
0 connections updated cached RTT variance on close
0 connections updated cached ssthresh on close
0 embryonic connections dropped
51152 segments updated rtt (of 52750 attempts)
109 retransmit timeouts
0 connections dropped by rexmit timeout
0 persist timeouts
0 connections dropped by persist timeout
0 Connections (fin_wait_2) dropped because of timeout
0 keepalive timeouts
0 keepalive probes sent
0 connections dropped by keepalive
17474 correct ACK header predictions
4954 correct data packet header predictions
176 syncache entries added
0 retransmitted
0 dupsyn
0 dropped
176 completed
0 bucket overflow
0 cache overflow
0 reset
0 stale
0 aborted
0 badack
0 unreach
0 zone failures
176 cookies sent
0 cookies received
0 hostcache entries added
0 bucket overflow
0 SACK recovery episodes
0 segment rexmits in SACK recovery episodes
0 byte rexmits in SACK recovery episodes
0 SACK options (SACK blocks) received
0 SACK options (SACK blocks) sent
0 SACK scoreboard overflow
0 packets with ECN CE bit set
0 packets with ECN ECT(0) bit set
0 packets with ECN ECT(1) bit set
0 successful ECN handshakes
0 times ECN reduced the congestion window
0 times in ONTAP flow control
0 times exited ONTAP flow control
0 times in ONTAP flow control for zero send window
0 times in ONTAP flow control for non-zero send window
0 connection resets due to ONTAP extreme flow control
0 times in ONTAP extreme flow control
0 is the maximum flow control reset threshold reached during receive
0 is the maximum flow control reset threshold reached during send
0 bytes is send buffer value during last reset
0 bytes is send buffer hiwat mark during last reset
0 times the receive window was closed
0 dropped due to flowcontrol
0 segments sent using TSO
0 bytes sent using TSO
0 TSO segments truncated
0 TSO wrapped sequence space segments
0 segments sent using TSO6
0 bytes sent using TSO6
0 TSO6 segments truncated
0 TSO6 wrapped sequence space segments
0 recv upcalls batched in HP
0 recv upcalls made in HP
0 recv upcalls made in HP because of PSH
0 recv upcalls made in HP because of sb_hiwat
0 recv upcalls made in HP because of both PSH and sb_hiwat
0 recv upcall batch timeouts
0 times recv upcall read partial sb_cc in HP
0 segments received using LRO
0 bytes received using LRO
0 segments received using LRO6
0 bytes received using LRO6


Server tcp entries

[root@jwukccsbci ~]# sysctl -a | grep slot
sunrpc.tcp_slot_table_entries = 128
sunrpc.udp_slot_table_entries = 128
dev.cdrom.info = drive # of slots: 1

From: Steiner, Jeffrey [mailto:Jeffrey.Steiner@netapp.com]
Sent: 24 January 2018 11:53
To: Mark Saunders; Parisi, Justin; Fenn, Michael; toasters@teaparty.net<mailto:toasters@teaparty.net>
Subject: RE: NFS issue after upgrading filers to 9.2P2

Could this be TCP slot tables? Flow control capabilities on ONTAP continue to improve. If you don't have TCP slot tables capped at 128 you could see quasi-hangs like this.

Complete details are in TR-3633, but these are the two that you want to watch:

[root@stlrx300s7-145 mkdb]# sysctl -a | grep slot
sunrpc.tcp_max_slot_table_entries = 128
sunrpc.tcp_slot_table_entries = 128

Newer versions of linux will allow a ridiculous number of unacknowledged RPC operations to build up. The result can be sending ONTAP into a flow control mode until the OS catches up. We see problems mostly in slow clients. For example, if you're trying to read a lot of data from a host with 1Gb connectivity on a high-end ONTAP system the OS can ask for data quicker than it can process the responses.

From: Mark Saunders [mailto:Mark.Saunders@pcmsgroup.com]
Sent: Wednesday, January 24, 2018 12:26 PM
To: Parisi, Justin <Justin.Parisi@netapp.com<mailto:Justin.Parisi@netapp.com>>; Steiner, Jeffrey <Jeffrey.Steiner@netapp.com<mailto:Jeffrey.Steiner@netapp.com>>; Fenn, Michael <fennm@DEShawResearch.com<mailto:fennm@DEShawResearch.com>>; toasters@teaparty.net<mailto:toasters@teaparty.net>
Subject: RE: NFS issue after upgrading filers to 9.2P2

Justin

I have just checked the SVM and there are no admin/management interfaces configured for it there are three data lifs for different vlans. I have checked through our other systems this morning and there are no issues in vmware (5.5) or SLES 11/12 so this is just with the redhat servers.

I have checked the interfaces at the server end and it is not showing errors or dropped packets. On the filer end we have 4 physical ports in an interface group with vlans on top. I have run “statistics start –obeject nfs_exports_access_cache” which when checked doesn’t report any errors.

On the server interface

eth1 Link encap:Ethernet HWaddr 00:50:56:A5:0D:6A
inet addr:10.240.1.30 Bcast:10.240.1.31 Mask:255.255.255.224
inet6 addr: fe80::250:56ff:fea5:d6a/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:127209 errors:0 dropped:0 overruns:0 frame:0
TX packets:26100 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:104158360 (99.3 MiB) TX bytes:14489402 (13.8 MiB)



While investigating we have found that the file system is fine just after a reboot and you can ls each mount so they are initially all OK. It is when starting the application so putting a bigger load over the network that the file systems stop responding.


Regards

Mark

From: Parisi, Justin [mailto:Justin.Parisi@netapp.com]
Sent: 23 January 2018 22:33
To: Steiner, Jeffrey; Mark Saunders; Fenn, Michael; toasters@teaparty.net<mailto:toasters@teaparty.net>
Subject: RE: NFS issue after upgrading filers to 9.2P2

In fact, maybe look at this as a root cause… do your NFS interfaces share nodes with admin interfaces?

“NFS issues were caused by using a NAS interface on the same node as the SVM admin interface, once I realised we moved all servers NFS to the node without the admin interface.”

From: Parisi, Justin
Sent: Tuesday, January 23, 2018 5:30 PM
To: Parisi, Justin <Justin.Parisi@netapp.com<mailto:Justin.Parisi@netapp.com>>; Steiner, Jeffrey <Jeffrey.Steiner@netapp.com<mailto:Jeffrey.Steiner@netapp.com>>; Mark Saunders <Mark.Saunders@pcmsgroup.com<mailto:Mark.Saunders@pcmsgroup.com>>; Fenn, Michael <fennm@DEShawResearch.com<mailto:fennm@DEShawResearch.com>>; toasters@teaparty.net<mailto:toasters@teaparty.net>
Subject: RE: NFS issue after upgrading filers to 9.2P2

This community post also does a good job explaining it:

https://community.netapp.com/t5/Data-ONTAP-Discussions/NetApp-Ontap-9-2-Upgrade-review-your-network-first/td-p/136657

From: toasters-bounces@teaparty.net<mailto:toasters-bounces@teaparty.net> [mailto:toasters-bounces@teaparty.net] On Behalf Of Parisi, Justin
Sent: Tuesday, January 23, 2018 5:28 PM
To: Steiner, Jeffrey <Jeffrey.Steiner@netapp.com<mailto:Jeffrey.Steiner@netapp.com>>; Mark Saunders <Mark.Saunders@pcmsgroup.com<mailto:Mark.Saunders@pcmsgroup.com>>; Fenn, Michael <fennm@DEShawResearch.com<mailto:fennm@DEShawResearch.com>>; toasters@teaparty.net<mailto:toasters@teaparty.net>
Subject: RE: NFS issue after upgrading filers to 9.2P2

The network stack changed in 9.2 and IP fastpath was removed. But fastpath was mainly for more efficient routing.

https://library.netapp.com/ecmdocs/ECMP1114171/html/GUID-8276014A-16EB-4902-9EDC-868C5292381B.html

The stack was changed to a more standard BSD stack, so fastpath was no longer needed. It’s possible that’s an issue here, but I’d suggest getting network sniffs on each endpoint of the network to see where the packet is being dropped.

From: toasters-bounces@teaparty.net<mailto:toasters-bounces@teaparty.net> [mailto:toasters-bounces@teaparty.net] On Behalf Of Steiner, Jeffrey
Sent: Tuesday, January 23, 2018 5:24 PM
To: Mark Saunders <Mark.Saunders@pcmsgroup.com<mailto:Mark.Saunders@pcmsgroup.com>>; Fenn, Michael <fennm@DEShawResearch.com<mailto:fennm@DEShawResearch.com>>; toasters@teaparty.net<mailto:toasters@teaparty.net>
Subject: RE: NFS issue after upgrading filers to 9.2P2

I should have asked - is this SAP HANA or something like SAP on an Oracle database?

Also, what do they mean "it's not on the IMT?" Virtually everything NFS is on the IMT. We support any NFSv3 and NFSv4 client that obeys the specification. There's a tiny number of exceptions, but generally speaking we'll support linux, Solaris, AIX, mainframe, OpenVMS, HP-UX, Oracle DNFS, AS/400, etc. There really should be no issue there.

The thing about fastpath does ring a few bells.

From: Mark Saunders [mailto:Mark.Saunders@pcmsgroup.com]
Sent: Tuesday, January 23, 2018 11:18 PM
To: Steiner, Jeffrey <Jeffrey.Steiner@netapp.com<mailto:Jeffrey.Steiner@netapp.com>>; Fenn, Michael <fennm@DEShawResearch.com<mailto:fennm@DEShawResearch.com>>; toasters@teaparty.net<mailto:toasters@teaparty.net>
Subject: RE: NFS issue after upgrading filers to 9.2P2

Thanks for the quick replies sorry for the delay in e responding but I was working on this since 5am so had to go sleep.

I have a call open with netapp but have had the coockie cutter response of it isn’t on the Interoperability Matrix Tool as a supported version (It wasn’t when on 9.1 anyway)

A third party we have contact with have sent me a link to details about fastpathing being removed but I don’t think we were using it so maybe another false line to look down.

The mount options were kept fairly straight forward as

nfs nolock,_netdev,udp 0 0

and we have also tried the same as the one of the production servers which had tuned options, this is on another cluster so isn’t affected by this yet.

nfsvers=3,nolock,_netdev,rw,udp,rsize=32768,wsize=32768,timeo=600 0 0

How would I be able to tell if we are using DNFS ?

I will send you the support details tomorrow when I am back in the office.

Regards

Mark


From: Steiner, Jeffrey [mailto:Jeffrey.Steiner@netapp.com]
Sent: 23 January 2018 17:29
To: Fenn, Michael; Mark Saunders; toasters@teaparty.net<mailto:toasters@teaparty.net>
Subject: RE: NFS issue after upgrading filers to 9.2P2

It takes a lot for an ONTAP system to flat-out be unable to respond. Unless the timeout parameters are exceedingly short, you shouldn't reach that point, especially with anything capable of running ONTAP 9.2.

I'd open a support case on this one. In addition, if you want to trigger an autosupport and send me the serial numbers directly I can take a glance at a few stats to see if anything looks odd.

From: Fenn, Michael [mailto:fennm@DEShawResearch.com]
Sent: Tuesday, January 23, 2018 6:23 PM
To: Steiner, Jeffrey <Jeffrey.Steiner@netapp.com<mailto:Jeffrey.Steiner@netapp.com>>; Mark Saunders <Mark.Saunders@pcmsgroup.com<mailto:Mark.Saunders@pcmsgroup.com>>; toasters@teaparty.net<mailto:toasters@teaparty.net>
Subject: Re: NFS issue after upgrading filers to 9.2P2

The messages are not necessarily indicative of a network problem.

The kernel prints "nfs: server … not responding, still trying" when an operation times out (timeo deciseconds) for the configured (retrans) number of tries. Once the server responds, then it prints "nfs: server … OK".

Networking problems are certainly one reason that an operation would time out, but not the only reason. An overloaded or down file server will cause the same effect.

Thanks,
Michael

From: <toasters-bounces@teaparty.net<mailto:toasters-bounces@teaparty.net>> on behalf of "Steiner, Jeffrey" <Jeffrey.Steiner@netapp.com<mailto:Jeffrey.Steiner@netapp.com>>
Date: Tuesday, January 23, 2018 at 10:38 AM
To: Mark Saunders <Mark.Saunders@pcmsgroup.com<mailto:Mark.Saunders@pcmsgroup.com>>, "toasters@teaparty.net<mailto:toasters@teaparty.net>" <toasters@teaparty.net<mailto:toasters@teaparty.net>>
Subject: RE: NFS issue after upgrading filers to 9.2P2

Those messages are indicative of a network problem. The packets are going through, then they succeed when the NFS client retries, then they pause again.

I can't think why an ONTAP upgrade of this type would cause such a problem. If it was working before, it should be working now. If you had any kind of a locking, firewall, or general configuration problem you should have no access whatsoever.

I've seen some weird NFS bug sin SUSE, but that RHEL version should be fine.

What are the mount options used, and are you using DNFS?

From: toasters-bounces@teaparty.net<mailto:toasters-bounces@teaparty.net> [mailto:toasters-bounces@teaparty.net] On Behalf Of Mark Saunders
Sent: Tuesday, January 23, 2018 4:29 PM
To: toasters@teaparty.net<mailto:toasters@teaparty.net>
Subject: NFS issue after upgrading filers to 9.2P2

Hi gents today we have upgraded our Coventry cluster from 9.1P6 to 9.2P2 and we are about 99% working we just have a strange issue with SAP database servers NFS mounts. When the server is bounced the mounts are attached with no problems but after a few minutes a df –h starts to be very slow reporting the NFS mounted directories and if the databases are started up they hang and a df –h then also hangs. This sometimes recovers enough to then allow a df –h to work again but the databases are a lost cause right now.

In the server messages we get lots of entries for the SVM

Jan 23 07:01:27 jwukccsbci kernel: nfs: server JWUKCSVM01 not responding, still trying
Jan 23 07:01:47 jwukccsbci last message repeated 5 times
Jan 23 07:02:07 jwukccsbci kernel: nfs: server JWUKCSVM01 OK
Jan 23 07:02:07 jwukccsbci last message repeated 5 times
Jan 23 07:02:27 jwukccsbci kernel: nfs: server JWUKCSVM01 not responding, still trying
Jan 23 07:02:47 jwukccsbci last message repeated 5 times
Jan 23 07:02:48 jwukccsbci kernel: nfs: server JWUKCSVM01 OK

Is there anything that would of changed in the upgrade to lock down NFS or changes options that we might need to change back.

The redhat servers are an old kernel version 2.6.18-371.el5 that has some bugs but this was working fine before the filer upgrade was carried out.


Regards
Mark
Data Centre Sysadmin Team
Managed Services
Phone:- 02476 694455 Ext 2567
The Sysadmin Team promoting PCMS Values ~Integrity~Respect~Commitment~ ~Continuous Improvement~
The information contained in this e-mail is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. If you are not the intended recipient of this e-mail, the use of this information or any disclosure, copying or distribution is prohibited and may be unlawful. If you received this in error, please contact the sender and delete the material from any computer. The views expressed in this e-mail may not necessarily be the views of the PCMS Group plc and should not be taken as authority to carry out any instruction contained. The PCMS Group reserves the right to monitor and examine the content of all e-mails.

The PCMS Group plc is a company registered in England and Wales with company number 1459419 whose registered office is at PCMS House, Torwood Close, Westwood Business Park, Coventry CV4 8HX, United Kingdom. VAT No: GB 705338743
RE: NFS issue after upgrading filers to 9.2P2 [ In reply to ]
After a bit of an email search it was this bug

https://bugzilla.redhat.com/show_bug.cgi?id=321111


From: Steiner, Jeffrey [mailto:Jeffrey.Steiner@netapp.com]
Sent: 24 January 2018 12:07
To: Mark Saunders; Parisi, Justin; Fenn, Michael; toasters@teaparty.net
Subject: RE: NFS issue after upgrading filers to 9.2P2

If that's 441463, I'm skeptical that's the problem. That might cause problems during boot, but I wouldn’t expect it to cause problems later. Also, an ONTAP upgrade shouldn't affect this.

I'll subscribe to the case and follow along. The stats below do show some possible problems. There was some flow control activity, and the SACK numbers look high to me.

From: Mark Saunders [mailto:Mark.Saunders@pcmsgroup.com]
Sent: Wednesday, January 24, 2018 1:02 PM
To: Steiner, Jeffrey <Jeffrey.Steiner@netapp.com<mailto:Jeffrey.Steiner@netapp.com>>; Parisi, Justin <Justin.Parisi@netapp.com<mailto:Justin.Parisi@netapp.com>>; Fenn, Michael <fennm@DEShawResearch.com<mailto:fennm@DEShawResearch.com>>; toasters@teaparty.net<mailto:toasters@teaparty.net>
Subject: RE: NFS issue after upgrading filers to 9.2P2

I will try to find the kernel bug number as I cant see it in the documentation for the server there is just the following note.


RHEL 5.11 has a bug where NFS mounts mounted after network initialization at boot run with an increased number of TCP requests (approx 10x more) which causes rpc backlog and restricts network throughput on the NFS mounts.

To resolve this a script has been created to restart the networking before the NFS mounts are mounted by netfs at boot. By default netfs runs at boot s25 on runlevel 3, 4 and 5 so we will set the NFS fix to run at s24 on the same run levels.

PGUKCSTGCL01::*> node run -node PGUKCSTGCL01-01 -command netstat -sp tcp
---- Default IPSpace ----
tcp:
900103907 packets sent
476280230 data packets (4676048494764 bytes)
61984 data packets (82328048 bytes) retransmitted
2065 data packets unnecessarily retransmitted
0 resends initiated by MTU discovery
235945463 ack-only packets (517654 delayed)
0 URG only packets
0 window probe packets
187429557 window update packets
333130 control packets
1097649475 packets received
399065895 acks (for 4676054895668 bytes)
2174268 duplicate acks
0 acks for unsent data
723809875 packets (4886339861169 bytes) received in-sequence
1649638 completely duplicate packets (98637034 bytes)
2 old duplicate packets
990 packets with some dup. data (214519 bytes duped)
10872239 out-of-order packets (15192422547 bytes)
0 packets (0 bytes) of data after window
0 window probes
26845 window update packets
2 packets received after close
0 discarded for bad checksums
0 discarded for bad header offset fields
0 discarded because packet too short
37581 discarded due to memory problems
1441 connection requests
412966 connection accepts
0 bad connection attempts
0 listen queue overflows
305109 ignored RSTs in the windows
414403 connections established (including accepts)
443890 connections closed (including 139609 drops)
151376 connections updated cached RTT on close
151388 connections updated cached RTT variance on close
140203 connections updated cached ssthresh on close
0 embryonic connections dropped
388403781 segments updated rtt (of 258539924 attempts)
6843 retransmit timeouts
11 connections dropped by rexmit timeout
3 persist timeouts
0 connections dropped by persist timeout
0 Connections (fin_wait_2) dropped because of timeout
92323 keepalive timeouts
92323 keepalive probes sent
0 connections dropped by keepalive
351415606 correct ACK header predictions
684179955 correct data packet header predictions
412966 syncache entries added
155 retransmitted
302 dupsyn
0 dropped
412966 completed
0 bucket overflow
0 cache overflow
0 reset
0 stale
0 aborted
0 badack
0 unreach
0 zone failures
412966 cookies sent
0 cookies received
112 hostcache entries added
0 bucket overflow
16181 SACK recovery episodes
51541 segment rexmits in SACK recovery episodes
70735551 byte rexmits in SACK recovery episodes
277116 SACK options (SACK blocks) received
11457931 SACK options (SACK blocks) sent
0 SACK scoreboard overflow
0 packets with ECN CE bit set
0 packets with ECN ECT(0) bit set
0 packets with ECN ECT(1) bit set
0 successful ECN handshakes
0 times ECN reduced the congestion window
251543 times in ONTAP flow control
0 times exited ONTAP flow control
0 times in ONTAP flow control for zero send window
251543 times in ONTAP flow control for non-zero send window
0 connection resets due to ONTAP extreme flow control
0 times in ONTAP extreme flow control
0 is the maximum flow control reset threshold reached during receive
4 is the maximum flow control reset threshold reached during send
0 bytes is send buffer value during last reset
0 bytes is send buffer hiwat mark during last reset
79 times the receive window was closed
44 dropped due to flowcontrol
188382441 segments sent using TSO
4595103991390 bytes sent using TSO
73883767 TSO segments truncated
1069 TSO wrapped sequence space segments
0 segments sent using TSO6
0 bytes sent using TSO6
0 TSO6 segments truncated
0 TSO6 wrapped sequence space segments
366670238 recv upcalls batched in HP
302647105 recv upcalls made in HP
296877004 recv upcalls made in HP because of PSH
2291336 recv upcalls made in HP because of sb_hiwat
3481239 recv upcalls made in HP because of both PSH and sb_hiwat
6733214 recv upcall batch timeouts
16594187 times recv upcall read partial sb_cc in HP
631681762 segments received using LRO
4816721023400 bytes received using LRO
0 segments received using LRO6
0 bytes received using LRO6
---- ANYVSERVER IPSpace ----
tcp:
0 packets sent
0 data packets (0 bytes)
0 data packets (0 bytes) retransmitted
0 data packets unnecessarily retransmitted
0 resends initiated by MTU discovery
0 ack-only packets (0 delayed)
0 URG only packets
0 window probe packets
0 window update packets
0 control packets
0 packets received
0 acks (for 0 bytes)
0 duplicate acks
0 acks for unsent data
0 packets (0 bytes) received in-sequence
0 completely duplicate packets (0 bytes)
0 old duplicate packets
0 packets with some dup. data (0 bytes duped)
0 out-of-order packets (0 bytes)
0 packets (0 bytes) of data after window
0 window probes
0 window update packets
0 packets received after close
0 discarded for bad checksums
0 discarded for bad header offset fields
0 discarded because packet too short
0 discarded due to memory problems
0 connection requests
0 connection accepts
0 bad connection attempts
0 listen queue overflows
0 ignored RSTs in the windows
0 connections established (including accepts)
7 connections closed (including 0 drops)
0 connections updated cached RTT on close
0 connections updated cached RTT variance on close
0 connections updated cached ssthresh on close
0 embryonic connections dropped
0 segments updated rtt (of 0 attempts)
0 retransmit timeouts
0 connections dropped by rexmit timeout
0 persist timeouts
0 connections dropped by persist timeout
0 Connections (fin_wait_2) dropped because of timeout
0 keepalive timeouts
0 keepalive probes sent
0 connections dropped by keepalive
0 correct ACK header predictions
0 correct data packet header predictions
0 syncache entries added
0 retransmitted
0 dupsyn
0 dropped
0 completed
0 bucket overflow
0 cache overflow
0 reset
0 stale
0 aborted
0 badack
0 unreach
0 zone failures
0 cookies sent
0 cookies received
0 hostcache entries added
0 bucket overflow
0 SACK recovery episodes
0 segment rexmits in SACK recovery episodes
0 byte rexmits in SACK recovery episodes
0 SACK options (SACK blocks) received
0 SACK options (SACK blocks) sent
0 SACK scoreboard overflow
0 packets with ECN CE bit set
0 packets with ECN ECT(0) bit set
0 packets with ECN ECT(1) bit set
0 successful ECN handshakes
0 times ECN reduced the congestion window
0 times in ONTAP flow control
0 times exited ONTAP flow control
0 times in ONTAP flow control for zero send window
0 times in ONTAP flow control for non-zero send window
0 connection resets due to ONTAP extreme flow control
0 times in ONTAP extreme flow control
0 is the maximum flow control reset threshold reached during receive
0 is the maximum flow control reset threshold reached during send
0 bytes is send buffer value during last reset
0 bytes is send buffer hiwat mark during last reset
0 times the receive window was closed
0 dropped due to flowcontrol
0 segments sent using TSO
0 bytes sent using TSO
0 TSO segments truncated
0 TSO wrapped sequence space segments
0 segments sent using TSO6
0 bytes sent using TSO6
0 TSO6 segments truncated
0 TSO6 wrapped sequence space segments
0 recv upcalls batched in HP
0 recv upcalls made in HP
0 recv upcalls made in HP because of PSH
0 recv upcalls made in HP because of sb_hiwat
0 recv upcalls made in HP because of both PSH and sb_hiwat
0 recv upcall batch timeouts
0 times recv upcall read partial sb_cc in HP
0 segments received using LRO
0 bytes received using LRO
0 segments received using LRO6
0 bytes received using LRO6
---- Cluster IPSpace ----
tcp:
350960787 packets sent
253625385 data packets (2042642509989 bytes)
11525 data packets (120517203 bytes) retransmitted
63 data packets unnecessarily retransmitted
0 resends initiated by MTU discovery
38550609 ack-only packets (15348627 delayed)
0 URG only packets
1 window probe packet
56728197 window update packets
2035396 control packets
341097715 packets received
224460892 acks (for 2042726150883 bytes)
6840725 duplicate acks
0 acks for unsent data
271870811 packets (3031038679110 bytes) received in-sequence
195650 completely duplicate packets (4506 bytes)
49 old duplicate packets
0 packets with some dup. data (0 bytes duped)
205398 out-of-order packets (565766073 bytes)
0 packets (0 bytes) of data after window
0 window probes
2011210 window update packets
123 packets received after close
0 discarded for bad checksums
0 discarded for bad header offset fields
0 discarded because packet too short
0 discarded due to memory problems
923539 connection requests
456892 connection accepts
0 bad connection attempts
0 listen queue overflows
529 ignored RSTs in the windows
1271558 connections established (including accepts)
1379180 connections closed (including 1101 drops)
369895 connections updated cached RTT on close
370750 connections updated cached RTT variance on close
12122 connections updated cached ssthresh on close
108207 embryonic connections dropped
224454663 segments updated rtt (of 207849890 attempts)
48471 retransmit timeouts
14 connections dropped by rexmit timeout
1 persist timeout
0 connections dropped by persist timeout
0 Connections (fin_wait_2) dropped because of timeout
152128 keepalive timeouts
147328 keepalive probes sent
4800 connections dropped by keepalive
45057764 correct ACK header predictions
104981779 correct data packet header predictions
457057 syncache entries added
61 retransmitted
0 dupsyn
0 dropped
456892 completed
0 bucket overflow
0 cache overflow
165 reset
0 stale
0 aborted
0 badack
0 unreach
0 zone failures
457057 cookies sent
0 cookies received
61 hostcache entries added
0 bucket overflow
1684 SACK recovery episodes
2491 segment rexmits in SACK recovery episodes
5618157 byte rexmits in SACK recovery episodes
17518 SACK options (SACK blocks) received
86946 SACK options (SACK blocks) sent
0 SACK scoreboard overflow
0 packets with ECN CE bit set
0 packets with ECN ECT(0) bit set
0 packets with ECN ECT(1) bit set
0 successful ECN handshakes
0 times ECN reduced the congestion window
0 times in ONTAP flow control
0 times exited ONTAP flow control
0 times in ONTAP flow control for zero send window
0 times in ONTAP flow control for non-zero send window
0 connection resets due to ONTAP extreme flow control
0 times in ONTAP extreme flow control
0 is the maximum flow control reset threshold reached during receive
0 is the maximum flow control reset threshold reached during send
0 bytes is send buffer value during last reset
0 bytes is send buffer hiwat mark during last reset
0 times the receive window was closed
0 dropped due to flowcontrol
56607835 segments sent using TSO
1679494142753 bytes sent using TSO
36473474 TSO segments truncated
394 TSO wrapped sequence space segments
0 segments sent using TSO6
0 bytes sent using TSO6
0 TSO6 segments truncated
0 TSO6 wrapped sequence space segments
4879278 recv upcalls batched in HP
90401291 recv upcalls made in HP
90401967 recv upcalls made in HP because of PSH
52 recv upcalls made in HP because of sb_hiwat
325 recv upcalls made in HP because of both PSH and sb_hiwat
32882 recv upcall batch timeouts
524 times recv upcall read partial sb_cc in HP
160827213 segments received using LRO
2789346524807 bytes received using LRO
0 segments received using LRO6
0 bytes received using LRO6
---- ips_4294967289 IPSpace ----
tcp:
0 packets sent
0 data packets (0 bytes)
0 data packets (0 bytes) retransmitted
0 data packets unnecessarily retransmitted
0 resends initiated by MTU discovery
0 ack-only packets (0 delayed)
0 URG only packets
0 window probe packets
0 window update packets
0 control packets
0 packets received
0 acks (for 0 bytes)
0 duplicate acks
0 acks for unsent data
0 packets (0 bytes) received in-sequence
0 completely duplicate packets (0 bytes)
0 old duplicate packets
0 packets with some dup. data (0 bytes duped)
0 out-of-order packets (0 bytes)
0 packets (0 bytes) of data after window
0 window probes
0 window update packets
0 packets received after close
0 discarded for bad checksums
0 discarded for bad header offset fields
0 discarded because packet too short
0 discarded due to memory problems
0 connection requests
0 connection accepts
0 bad connection attempts
0 listen queue overflows
0 ignored RSTs in the windows
0 connections established (including accepts)
0 connections closed (including 0 drops)
0 connections updated cached RTT on close
0 connections updated cached RTT variance on close
0 connections updated cached ssthresh on close
0 embryonic connections dropped
0 segments updated rtt (of 0 attempts)
0 retransmit timeouts
0 connections dropped by rexmit timeout
0 persist timeouts
0 connections dropped by persist timeout
0 Connections (fin_wait_2) dropped because of timeout
0 keepalive timeouts
0 keepalive probes sent
0 connections dropped by keepalive
0 correct ACK header predictions
0 correct data packet header predictions
0 syncache entries added
0 retransmitted
0 dupsyn
0 dropped
0 completed
0 bucket overflow
0 cache overflow
0 reset
0 stale
0 aborted
0 badack
0 unreach
0 zone failures
0 cookies sent
0 cookies received
0 hostcache entries added
0 bucket overflow
0 SACK recovery episodes
0 segment rexmits in SACK recovery episodes
0 byte rexmits in SACK recovery episodes
0 SACK options (SACK blocks) received
0 SACK options (SACK blocks) sent
0 SACK scoreboard overflow
0 packets with ECN CE bit set
0 packets with ECN ECT(0) bit set
0 packets with ECN ECT(1) bit set
0 successful ECN handshakes
0 times ECN reduced the congestion window
0 times in ONTAP flow control
0 times exited ONTAP flow control
0 times in ONTAP flow control for zero send window
0 times in ONTAP flow control for non-zero send window
0 connection resets due to ONTAP extreme flow control
0 times in ONTAP extreme flow control
0 is the maximum flow control reset threshold reached during receive
0 is the maximum flow control reset threshold reached during send
0 bytes is send buffer value during last reset
0 bytes is send buffer hiwat mark during last reset
0 times the receive window was closed
0 dropped due to flowcontrol
0 segments sent using TSO
0 bytes sent using TSO
0 TSO segments truncated
0 TSO wrapped sequence space segments
0 segments sent using TSO6
0 bytes sent using TSO6
0 TSO6 segments truncated
0 TSO6 wrapped sequence space segments
0 recv upcalls batched in HP
0 recv upcalls made in HP
0 recv upcalls made in HP because of PSH
0 recv upcalls made in HP because of sb_hiwat
0 recv upcalls made in HP because of both PSH and sb_hiwat
0 recv upcall batch timeouts
0 times recv upcall read partial sb_cc in HP
0 segments received using LRO
0 bytes received using LRO
0 segments received using LRO6
0 bytes received using LRO6
---- ACP IPSpace ----
tcp:
86643 packets sent
17496 data packets (419904 bytes)
0 data packets (0 bytes) retransmitted
0 data packets unnecessarily retransmitted
0 resends initiated by MTU discovery
33848 ack-only packets (0 delayed)
0 URG only packets
0 window probe packets
23 window update packets
35276 control packets
74406 packets received
51152 acks (for 436064 bytes)
4798 duplicate acks
0 acks for unsent data
20938 packets (1251746 bytes) received in-sequence
0 completely duplicate packets (0 bytes)
0 old duplicate packets
0 packets with some dup. data (0 bytes duped)
0 out-of-order packets (0 bytes)
0 packets (0 bytes) of data after window
0 window probes
0 window update packets
1686 packets received after close
0 discarded for bad checksums
0 discarded for bad header offset fields
0 discarded because packet too short
0 discarded due to memory problems
17605 connection requests
176 connection accepts
0 bad connection attempts
0 listen queue overflows
0 ignored RSTs in the windows
17672 connections established (including accepts)
17781 connections closed (including 2 drops)
0 connections updated cached RTT on close
0 connections updated cached RTT variance on close
0 connections updated cached ssthresh on close
0 embryonic connections dropped
51152 segments updated rtt (of 52750 attempts)
109 retransmit timeouts
0 connections dropped by rexmit timeout
0 persist timeouts
0 connections dropped by persist timeout
0 Connections (fin_wait_2) dropped because of timeout
0 keepalive timeouts
0 keepalive probes sent
0 connections dropped by keepalive
17474 correct ACK header predictions
4954 correct data packet header predictions
176 syncache entries added
0 retransmitted
0 dupsyn
0 dropped
176 completed
0 bucket overflow
0 cache overflow
0 reset
0 stale
0 aborted
0 badack
0 unreach
0 zone failures
176 cookies sent
0 cookies received
0 hostcache entries added
0 bucket overflow
0 SACK recovery episodes
0 segment rexmits in SACK recovery episodes
0 byte rexmits in SACK recovery episodes
0 SACK options (SACK blocks) received
0 SACK options (SACK blocks) sent
0 SACK scoreboard overflow
0 packets with ECN CE bit set
0 packets with ECN ECT(0) bit set
0 packets with ECN ECT(1) bit set
0 successful ECN handshakes
0 times ECN reduced the congestion window
0 times in ONTAP flow control
0 times exited ONTAP flow control
0 times in ONTAP flow control for zero send window
0 times in ONTAP flow control for non-zero send window
0 connection resets due to ONTAP extreme flow control
0 times in ONTAP extreme flow control
0 is the maximum flow control reset threshold reached during receive
0 is the maximum flow control reset threshold reached during send
0 bytes is send buffer value during last reset
0 bytes is send buffer hiwat mark during last reset
0 times the receive window was closed
0 dropped due to flowcontrol
0 segments sent using TSO
0 bytes sent using TSO
0 TSO segments truncated
0 TSO wrapped sequence space segments
0 segments sent using TSO6
0 bytes sent using TSO6
0 TSO6 segments truncated
0 TSO6 wrapped sequence space segments
0 recv upcalls batched in HP
0 recv upcalls made in HP
0 recv upcalls made in HP because of PSH
0 recv upcalls made in HP because of sb_hiwat
0 recv upcalls made in HP because of both PSH and sb_hiwat
0 recv upcall batch timeouts
0 times recv upcall read partial sb_cc in HP
0 segments received using LRO
0 bytes received using LRO
0 segments received using LRO6
0 bytes received using LRO6


Server tcp entries

[root@jwukccsbci ~]# sysctl -a | grep slot
sunrpc.tcp_slot_table_entries = 128
sunrpc.udp_slot_table_entries = 128
dev.cdrom.info = drive # of slots: 1

From: Steiner, Jeffrey [mailto:Jeffrey.Steiner@netapp.com]
Sent: 24 January 2018 11:53
To: Mark Saunders; Parisi, Justin; Fenn, Michael; toasters@teaparty.net<mailto:toasters@teaparty.net>
Subject: RE: NFS issue after upgrading filers to 9.2P2

Could this be TCP slot tables? Flow control capabilities on ONTAP continue to improve. If you don't have TCP slot tables capped at 128 you could see quasi-hangs like this.

Complete details are in TR-3633, but these are the two that you want to watch:

[root@stlrx300s7-145 mkdb]# sysctl -a | grep slot
sunrpc.tcp_max_slot_table_entries = 128
sunrpc.tcp_slot_table_entries = 128

Newer versions of linux will allow a ridiculous number of unacknowledged RPC operations to build up. The result can be sending ONTAP into a flow control mode until the OS catches up. We see problems mostly in slow clients. For example, if you're trying to read a lot of data from a host with 1Gb connectivity on a high-end ONTAP system the OS can ask for data quicker than it can process the responses.

From: Mark Saunders [mailto:Mark.Saunders@pcmsgroup.com]
Sent: Wednesday, January 24, 2018 12:26 PM
To: Parisi, Justin <Justin.Parisi@netapp.com<mailto:Justin.Parisi@netapp.com>>; Steiner, Jeffrey <Jeffrey.Steiner@netapp.com<mailto:Jeffrey.Steiner@netapp.com>>; Fenn, Michael <fennm@DEShawResearch.com<mailto:fennm@DEShawResearch.com>>; toasters@teaparty.net<mailto:toasters@teaparty.net>
Subject: RE: NFS issue after upgrading filers to 9.2P2

Justin

I have just checked the SVM and there are no admin/management interfaces configured for it there are three data lifs for different vlans. I have checked through our other systems this morning and there are no issues in vmware (5.5) or SLES 11/12 so this is just with the redhat servers.

I have checked the interfaces at the server end and it is not showing errors or dropped packets. On the filer end we have 4 physical ports in an interface group with vlans on top. I have run “statistics start –obeject nfs_exports_access_cache” which when checked doesn’t report any errors.

On the server interface

eth1 Link encap:Ethernet HWaddr 00:50:56:A5:0D:6A
inet addr:10.240.1.30 Bcast:10.240.1.31 Mask:255.255.255.224
inet6 addr: fe80::250:56ff:fea5:d6a/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:127209 errors:0 dropped:0 overruns:0 frame:0
TX packets:26100 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:104158360 (99.3 MiB) TX bytes:14489402 (13.8 MiB)



While investigating we have found that the file system is fine just after a reboot and you can ls each mount so they are initially all OK. It is when starting the application so putting a bigger load over the network that the file systems stop responding.


Regards

Mark

From: Parisi, Justin [mailto:Justin.Parisi@netapp.com]
Sent: 23 January 2018 22:33
To: Steiner, Jeffrey; Mark Saunders; Fenn, Michael; toasters@teaparty.net<mailto:toasters@teaparty.net>
Subject: RE: NFS issue after upgrading filers to 9.2P2

In fact, maybe look at this as a root cause… do your NFS interfaces share nodes with admin interfaces?

“NFS issues were caused by using a NAS interface on the same node as the SVM admin interface, once I realised we moved all servers NFS to the node without the admin interface.”

From: Parisi, Justin
Sent: Tuesday, January 23, 2018 5:30 PM
To: Parisi, Justin <Justin.Parisi@netapp.com<mailto:Justin.Parisi@netapp.com>>; Steiner, Jeffrey <Jeffrey.Steiner@netapp.com<mailto:Jeffrey.Steiner@netapp.com>>; Mark Saunders <Mark.Saunders@pcmsgroup.com<mailto:Mark.Saunders@pcmsgroup.com>>; Fenn, Michael <fennm@DEShawResearch.com<mailto:fennm@DEShawResearch.com>>; toasters@teaparty.net<mailto:toasters@teaparty.net>
Subject: RE: NFS issue after upgrading filers to 9.2P2

This community post also does a good job explaining it:

https://community.netapp.com/t5/Data-ONTAP-Discussions/NetApp-Ontap-9-2-Upgrade-review-your-network-first/td-p/136657

From: toasters-bounces@teaparty.net<mailto:toasters-bounces@teaparty.net> [mailto:toasters-bounces@teaparty.net] On Behalf Of Parisi, Justin
Sent: Tuesday, January 23, 2018 5:28 PM
To: Steiner, Jeffrey <Jeffrey.Steiner@netapp.com<mailto:Jeffrey.Steiner@netapp.com>>; Mark Saunders <Mark.Saunders@pcmsgroup.com<mailto:Mark.Saunders@pcmsgroup.com>>; Fenn, Michael <fennm@DEShawResearch.com<mailto:fennm@DEShawResearch.com>>; toasters@teaparty.net<mailto:toasters@teaparty.net>
Subject: RE: NFS issue after upgrading filers to 9.2P2

The network stack changed in 9.2 and IP fastpath was removed. But fastpath was mainly for more efficient routing.

https://library.netapp.com/ecmdocs/ECMP1114171/html/GUID-8276014A-16EB-4902-9EDC-868C5292381B.html

The stack was changed to a more standard BSD stack, so fastpath was no longer needed. It’s possible that’s an issue here, but I’d suggest getting network sniffs on each endpoint of the network to see where the packet is being dropped.

From: toasters-bounces@teaparty.net<mailto:toasters-bounces@teaparty.net> [mailto:toasters-bounces@teaparty.net] On Behalf Of Steiner, Jeffrey
Sent: Tuesday, January 23, 2018 5:24 PM
To: Mark Saunders <Mark.Saunders@pcmsgroup.com<mailto:Mark.Saunders@pcmsgroup.com>>; Fenn, Michael <fennm@DEShawResearch.com<mailto:fennm@DEShawResearch.com>>; toasters@teaparty.net<mailto:toasters@teaparty.net>
Subject: RE: NFS issue after upgrading filers to 9.2P2

I should have asked - is this SAP HANA or something like SAP on an Oracle database?

Also, what do they mean "it's not on the IMT?" Virtually everything NFS is on the IMT. We support any NFSv3 and NFSv4 client that obeys the specification. There's a tiny number of exceptions, but generally speaking we'll support linux, Solaris, AIX, mainframe, OpenVMS, HP-UX, Oracle DNFS, AS/400, etc. There really should be no issue there.

The thing about fastpath does ring a few bells.

From: Mark Saunders [mailto:Mark.Saunders@pcmsgroup.com]
Sent: Tuesday, January 23, 2018 11:18 PM
To: Steiner, Jeffrey <Jeffrey.Steiner@netapp.com<mailto:Jeffrey.Steiner@netapp.com>>; Fenn, Michael <fennm@DEShawResearch.com<mailto:fennm@DEShawResearch.com>>; toasters@teaparty.net<mailto:toasters@teaparty.net>
Subject: RE: NFS issue after upgrading filers to 9.2P2

Thanks for the quick replies sorry for the delay in e responding but I was working on this since 5am so had to go sleep.

I have a call open with netapp but have had the coockie cutter response of it isn’t on the Interoperability Matrix Tool as a supported version (It wasn’t when on 9.1 anyway)

A third party we have contact with have sent me a link to details about fastpathing being removed but I don’t think we were using it so maybe another false line to look down.

The mount options were kept fairly straight forward as

nfs nolock,_netdev,udp 0 0

and we have also tried the same as the one of the production servers which had tuned options, this is on another cluster so isn’t affected by this yet.

nfsvers=3,nolock,_netdev,rw,udp,rsize=32768,wsize=32768,timeo=600 0 0

How would I be able to tell if we are using DNFS ?

I will send you the support details tomorrow when I am back in the office.

Regards

Mark


From: Steiner, Jeffrey [mailto:Jeffrey.Steiner@netapp.com]
Sent: 23 January 2018 17:29
To: Fenn, Michael; Mark Saunders; toasters@teaparty.net<mailto:toasters@teaparty.net>
Subject: RE: NFS issue after upgrading filers to 9.2P2

It takes a lot for an ONTAP system to flat-out be unable to respond. Unless the timeout parameters are exceedingly short, you shouldn't reach that point, especially with anything capable of running ONTAP 9.2.

I'd open a support case on this one. In addition, if you want to trigger an autosupport and send me the serial numbers directly I can take a glance at a few stats to see if anything looks odd.

From: Fenn, Michael [mailto:fennm@DEShawResearch.com]
Sent: Tuesday, January 23, 2018 6:23 PM
To: Steiner, Jeffrey <Jeffrey.Steiner@netapp.com<mailto:Jeffrey.Steiner@netapp.com>>; Mark Saunders <Mark.Saunders@pcmsgroup.com<mailto:Mark.Saunders@pcmsgroup.com>>; toasters@teaparty.net<mailto:toasters@teaparty.net>
Subject: Re: NFS issue after upgrading filers to 9.2P2

The messages are not necessarily indicative of a network problem.

The kernel prints "nfs: server … not responding, still trying" when an operation times out (timeo deciseconds) for the configured (retrans) number of tries. Once the server responds, then it prints "nfs: server … OK".

Networking problems are certainly one reason that an operation would time out, but not the only reason. An overloaded or down file server will cause the same effect.

Thanks,
Michael

From: <toasters-bounces@teaparty.net<mailto:toasters-bounces@teaparty.net>> on behalf of "Steiner, Jeffrey" <Jeffrey.Steiner@netapp.com<mailto:Jeffrey.Steiner@netapp.com>>
Date: Tuesday, January 23, 2018 at 10:38 AM
To: Mark Saunders <Mark.Saunders@pcmsgroup.com<mailto:Mark.Saunders@pcmsgroup.com>>, "toasters@teaparty.net<mailto:toasters@teaparty.net>" <toasters@teaparty.net<mailto:toasters@teaparty.net>>
Subject: RE: NFS issue after upgrading filers to 9.2P2

Those messages are indicative of a network problem. The packets are going through, then they succeed when the NFS client retries, then they pause again.

I can't think why an ONTAP upgrade of this type would cause such a problem. If it was working before, it should be working now. If you had any kind of a locking, firewall, or general configuration problem you should have no access whatsoever.

I've seen some weird NFS bug sin SUSE, but that RHEL version should be fine.

What are the mount options used, and are you using DNFS?

From: toasters-bounces@teaparty.net<mailto:toasters-bounces@teaparty.net> [mailto:toasters-bounces@teaparty.net] On Behalf Of Mark Saunders
Sent: Tuesday, January 23, 2018 4:29 PM
To: toasters@teaparty.net<mailto:toasters@teaparty.net>
Subject: NFS issue after upgrading filers to 9.2P2

Hi gents today we have upgraded our Coventry cluster from 9.1P6 to 9.2P2 and we are about 99% working we just have a strange issue with SAP database servers NFS mounts. When the server is bounced the mounts are attached with no problems but after a few minutes a df –h starts to be very slow reporting the NFS mounted directories and if the databases are started up they hang and a df –h then also hangs. This sometimes recovers enough to then allow a df –h to work again but the databases are a lost cause right now.

In the server messages we get lots of entries for the SVM

Jan 23 07:01:27 jwukccsbci kernel: nfs: server JWUKCSVM01 not responding, still trying
Jan 23 07:01:47 jwukccsbci last message repeated 5 times
Jan 23 07:02:07 jwukccsbci kernel: nfs: server JWUKCSVM01 OK
Jan 23 07:02:07 jwukccsbci last message repeated 5 times
Jan 23 07:02:27 jwukccsbci kernel: nfs: server JWUKCSVM01 not responding, still trying
Jan 23 07:02:47 jwukccsbci last message repeated 5 times
Jan 23 07:02:48 jwukccsbci kernel: nfs: server JWUKCSVM01 OK

Is there anything that would of changed in the upgrade to lock down NFS or changes options that we might need to change back.

The redhat servers are an old kernel version 2.6.18-371.el5 that has some bugs but this was working fine before the filer upgrade was carried out.


Regards
Mark
Data Centre Sysadmin Team
Managed Services
Phone:- 02476 694455 Ext 2567
The Sysadmin Team promoting PCMS Values ~Integrity~Respect~Commitment~ ~Continuous Improvement~
The information contained in this e-mail is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. If you are not the intended recipient of this e-mail, the use of this information or any disclosure, copying or distribution is prohibited and may be unlawful. If you received this in error, please contact the sender and delete the material from any computer. The views expressed in this e-mail may not necessarily be the views of the PCMS Group plc and should not be taken as authority to carry out any instruction contained. The PCMS Group reserves the right to monitor and examine the content of all e-mails.

The PCMS Group plc is a company registered in England and Wales with company number 1459419 whose registered office is at PCMS House, Torwood Close, Westwood Business Park, Coventry CV4 8HX, United Kingdom. VAT No: GB 705338743
RE: NFS issue after upgrading filers to 9.2P2 [ In reply to ]
Jeffrey> Could this be TCP slot tables? Flow control capabilities on
Jeffrey> ONTAP continue to improve. If you don't have TCP slot tables
Jeffrey> capped at 128 you could see quasi-hangs like this.

But the problem was with UDP NFS traffic, right?

I've run into wierd problems in the past. I think we had a problem
where the volumes holding the oracle tablespaces were mounted with
"forcedirectio", but if we had any executables on there, they just
wouldn't work and we'd have all kinds of problems.

Maybe it's something like that?

And just to confirm, other clients running newer RHEL versions, or
SLES using the exact same interfaceces/IPs/mountpoint from the
Netapp cluster don't show the problem?

Just because VMWare and SLES 11/12 aren't seeing the problem, doesn't
mean you don't have some wierd configuration issue somewhere. What
does "config advisor" say when run against your cluster?

Remember, change just one thing at atime, otherwise you're going to go
mad. Of course I'm sure the business is jumping up and down
screaming, which makes it hard to be methodical.

Good luck and let us know what you find please!

John

_______________________________________________
Toasters mailing list
Toasters@teaparty.net
http://www.teaparty.net/mailman/listinfo/toasters
RE: NFS issue after upgrading filers to 9.2P2 [ In reply to ]
Well as is always the way with these random issues it is the simplest thing to fix the problem. After a test to mount the storage on one of the SLES servers which worked fine we have changed the redhat server mounts to tcp as that was what we used for the SLES mounts and the file system is fine and all databases have started up and been running for a number of hours with no problems.

I have chased down the person who set the servers up and he was trying different options to see what gave the best performance and left udp in the options as while he wasn't sure that it was any better it hadn't got any worse.

Thank you to everyone with the quick replies as if I had been waiting on the ticket I logged I would be no further forward that Tuesday morning.



-----Original Message-----
From: John Stoffel [mailto:john@stoffel.org]
Sent: 24 January 2018 17:20
To: Steiner, Jeffrey
Cc: Mark Saunders; Parisi, Justin; Fenn, Michael; toasters@teaparty.net
Subject: RE: NFS issue after upgrading filers to 9.2P2


Jeffrey> Could this be TCP slot tables? Flow control capabilities on
Jeffrey> ONTAP continue to improve. If you don't have TCP slot tables
Jeffrey> capped at 128 you could see quasi-hangs like this.

But the problem was with UDP NFS traffic, right?

I've run into wierd problems in the past. I think we had a problem where the volumes holding the oracle tablespaces were mounted with "forcedirectio", but if we had any executables on there, they just wouldn't work and we'd have all kinds of problems.

Maybe it's something like that?

And just to confirm, other clients running newer RHEL versions, or SLES using the exact same interfaceces/IPs/mountpoint from the Netapp cluster don't show the problem?

Just because VMWare and SLES 11/12 aren't seeing the problem, doesn't mean you don't have some wierd configuration issue somewhere. What does "config advisor" say when run against your cluster?

Remember, change just one thing at atime, otherwise you're going to go mad. Of course I'm sure the business is jumping up and down screaming, which makes it hard to be methodical.

Good luck and let us know what you find please!

John


_______________________________________________
Toasters mailing list
Toasters@teaparty.net
http://www.teaparty.net/mailman/listinfo/toasters
RE: NFS issue after upgrading filers to 9.2P2 [ In reply to ]
Mark> Well as is always the way with these random issues it is the
Mark> simplest thing to fix the problem. After a test to mount the
Mark> storage on one of the SLES servers which worked fine we have
Mark> changed the redhat server mounts to tcp as that was what we used
Mark> for the SLES mounts and the file system is fine and all
Mark> databases have started up and been running for a number of hours
Mark> with no problems.

That's good to hear! The takeaway I have from this is that NFS over
UDP is not something you should ever be using.

Mark> I have chased down the person who set the servers up and he was
Mark> trying different options to see what gave the best performance
Mark> and left udp in the options as while he wasn't sure that it was
Mark> any better it hadn't got any worse.

I'm curious about how they did their testing? And what the cutoff was
for making changes and whether it was worth keeping or not. All the
docs I've read from Netapp and Oracle say to use NFS over tcp, with
large read/write block sizes and then some other options in special
cases.

In my mind, the advantages of TCP over UDP even for regular NFS
traffic make it a no brainer.

Mark> Thank you to everyone with the quick replies as if I had been
Mark> waiting on the ticket I logged I would be no further forward
Mark> that Tuesday morning.

This is why I love this mailing list, so many helpful people on here.

John
_______________________________________________
Toasters mailing list
Toasters@teaparty.net
http://www.teaparty.net/mailman/listinfo/toasters
Re: NFS issue after upgrading filers to 9.2P2 [ In reply to ]
On Thu, Jan 25, 2018 at 08:44:00AM -0500, John Stoffel wrote:
>
> That's good to hear! The takeaway I have from this is that NFS over
> UDP is not something you should ever be using.
>
> Mark> I have chased down the person who set the servers up and he was
> Mark> trying different options to see what gave the best performance
> Mark> and left udp in the options as while he wasn't sure that it was
> Mark> any better it hadn't got any worse.
>
> I'm curious about how they did their testing? And what the cutoff was
> for making changes and whether it was worth keeping or not. All the
> docs I've read from Netapp and Oracle say to use NFS over tcp, with
> large read/write block sizes and then some other options in special
> cases.
>
> In my mind, the advantages of TCP over UDP even for regular NFS
> traffic make it a no brainer.
>

Hmmmm..... disagree.

In the best of all possible worlds UDP wins. Its fast
and you can overlap multiple reads and write much
easier than TCP. Those guys who invented NFS used it
for a reason. If I wanted raw performance I would use UDP.

However, in lots of cases UDP has problems.
Network devices are often optimized for TCP (firewalls
are a prime example) and as you say
packet sizes can be larger with TCP.

I agree that TCP is a better bet in general but I do understand
why people may want to use UDP.

Its interesting that the new data transfer algorithms seem to
be UDP based. Aspera for example. I wonder if those
types of protocols could make sense for NFS?

Regards,
pdg
_______________________________________________
Toasters mailing list
Toasters@teaparty.net
http://www.teaparty.net/mailman/listinfo/toasters
Re: NFS issue after upgrading filers to 9.2P2 [ In reply to ]
>>>>> "Peter" == Peter D Gray <pdg@uow.edu.au> writes:

Peter> On Thu, Jan 25, 2018 at 08:44:00AM -0500, John Stoffel wrote:
>>
>> That's good to hear! The takeaway I have from this is that NFS over
>> UDP is not something you should ever be using.
>>
Mark> I have chased down the person who set the servers up and he was
Mark> trying different options to see what gave the best performance
Mark> and left udp in the options as while he wasn't sure that it was
Mark> any better it hadn't got any worse.
>>
>> I'm curious about how they did their testing? And what the cutoff was
>> for making changes and whether it was worth keeping or not. All the
>> docs I've read from Netapp and Oracle say to use NFS over tcp, with
>> large read/write block sizes and then some other options in special
>> cases.
>>
>> In my mind, the advantages of TCP over UDP even for regular NFS
>> traffic make it a no brainer.
>>

Peter> Hmmmm..... disagree.

Peter> In the best of all possible worlds UDP wins. Its fast and you
Peter> can overlap multiple reads and write much easier than
Peter> TCP. Those guys who invented NFS used it for a reason. If I
Peter> wanted raw performance I would use UDP.

They used UDP at the time because computers and networks were *slow*
and the TCP overhead was much higher then, esp since they mostly had
hubs back then. Under contention, NFS over TCP would slow way down.
I would agrue that this is a false economy today when we have 10g
networks. *grin*

Peter> However, in lots of cases UDP has problems. Network devices
Peter> are often optimized for TCP (firewalls are a prime example) and
Peter> as you say packet sizes can be larger with TCP.

Exactly. Chasing a few percent of speed (or even 10%!) by using UDP
is not a great idea. Esp since I suspect that NFS over UDP is a much
less tested version of the protocol these days.

Peter> I agree that TCP is a better bet in general but I do understand
Peter> why people may want to use UDP.

Peter> Its interesting that the new data transfer algorithms seem to
Peter> be UDP based. Aspera for example. I wonder if those types of
Peter> protocols could make sense for NFS?

If you're willing to do your own congestion control and packet
handling, then sure it can make sense. Esp if you're working over a
WAN link and you don't mind out of order packets and can handle it
better in your own server software. But how many filesystems are
doing this? Esp for POSIX compatibility?

John
_______________________________________________
Toasters mailing list
Toasters@teaparty.net
http://www.teaparty.net/mailman/listinfo/toasters
Re: NFS issue after upgrading filers to 9.2P2 [ In reply to ]
On 30 Jan 2018, at 7:57 am, John Stoffel <john@stoffel.org<mailto:john@stoffel.org>> wrote:

They used UDP at the time because computers and networks were *slow*
and the TCP overhead was much higher then, esp since they mostly had
hubs back then. Under contention, NFS over TCP would slow way down.
I would agrue that this is a false economy today when we have 10g
networks. *grin*

Indeed. NFS (over UDP) was invented when Ethernet meant a thick coaxial cable running around the building, shared between all machines, and the speed was 10Mb/s (that’s bits not bytes). Processor speeds were typically 10-20MHz in high end servers. I was around then.

Nowadays, network hardware is all optimised for TCP, whereas there is not much you can do with UDP without being aware of the application layer (7).

TCP offload engines in the network interfaces handle packet assembly/disassembly, checksum computations and other things. Network switches and routers can optimise TCP traffic.

UDP still has its place for things like VPN, media streaming and specialised applications like Aspera. But it doesn’t make sense for standard applications and certainly not file sharing, where data integrity is paramount.

Jeremy


--
Jeremy Webber
Senior Systems Engineer

T: +61 2 9383 4800 (main)
D: +61 2 8310 3577 (direct)
E: Jeremy.Webber@al.com.au

Building 54 / FSA #19, Fox Studios Australia, 38 Driver Avenue
Moore Park, NSW 2021
AUSTRALIA

[LinkedIn] <https://www.linkedin.com/company/animal-logic> [Facebook] <https://www.facebook.com/Animal-Logic-129284263808191/> [Twitter] <https://twitter.com/AnimalLogic> [Instagram] <https://www.instagram.com/animallogicstudios/>
[Animal Logic]<http://www.peterrabbit-movie.com>

Check out our awesome NEW website www.animallogic.com<http://www.animallogic.com>

CONFIDENTIALITY AND PRIVILEGE NOTICE
This email is intended only to be read or used by the addressee. It is confidential and may contain privileged information. If you are not the intended recipient, any use, distribution, disclosure or copying of this email is strictly prohibited. Confidentiality and legal privilege attached to this communication are not waived or lost by reason of the mistaken delivery to you. If you have received this email in error, please delete it and notify us immediately by telephone or email.