Mailing List Archive: Summary: Gig Enet and Filer NFS problems

> Problem:
> Clients using nfs (Sun and HP unix workstations) were taking much
> longer to open files.
> Home, Tools, and Project data was moved to the filer and it was
> running GigE
> To open a drawing, it took 2hrs vs 20min on a older Sun server
> running 100mb ethernet.
> Data OnTap v6.1.2R1
>
> Once we made changes, it now takes 18min to open the same drawing
> (this is anormal time for the types of schmetics they are working on).
>
> Solution (so far):
> Use options nfs.udp.xfersize 8192 (instead of the default 32768)
> Also, set the filer to match the cisco switch for flow control
> (cisco has send flow, filer now has receive)
>
> ToDo:
> Investigation to see if full flow control will improve performance
> Try enabling TCP over nfs
> Make sure that the xfersize does not serously affect the few
> machines we have on GigE that mount from the filer
>
> Thanks to the following for all the help and email (hope I didn't forget
> anyone).
>
> Philip Meidell
> John F. Detke
> Paul Jones
> Neil Stichbury
> Mike Ball
> Adam Fox
> devnull@adc.idt.com (whoever your are :-)
>
> ----------------- Details from some of the responses
>
> The major known issue is with NFS over UDP and packet drops causing the
> UDP cascading error problem. If you are running NFS over UDP I'd strongly
> suggest switching to NFS over TCP.
>
> I believe this is a buffering issue, where the Gig interface is
> "overloading" a buffer somewhere on its way to a 100M interface, and
> packets gets dropped. When that happens the entire UDP datagram gets
> resent. Also check rsize/wsize on the client, it may work better set to
> 8k, rather than the (default?) 32KB.
>
> This problem is related to network hardware, I've seen in most often on
> Cisco gear but it happens with others too. A quick nfsstat on client(s)
> can show the problem, look for retrans stats.
> ---------
> We had a similar problem and solved it by changing the UDP transfer size.
> Have a look at bug ID 29146 which includes the following :-
>
> "In Data ONTAP 5.3.4 and earlier releases the default UDP transfer size
> is
>
> 8192. In 5.3.5 and later releases the default is 32768. This larger value
> improves performance in some situations, but may cause problems in others.
>
> Problem 1 - 100-Mbit client is reading from Gigabit interface With the
> larger
> default, problems may result when the Gigabit interface on the filer is
> sending
> to a 100-Mbit client. Switches can drop packets on the outbound 100-Mbit
> port because too many packets become queued due to the difference in line
> speeds
> and the larger transfer size. Few complete 32K UDP datagrams are received
> by the client, and the client assumes the filer has gone away and
> increases its delay between retries. Poor performance may be seen. If a
> client does not specify
> the rsize and wsize parameters in the mount command, the filer's default
> UDP
> transfer size is used. If a client does specify rsize and wsize, and the
> values are larger than the default, then the default is used. This means
> that you
> may see problems after you upgrade from 5.3.4 (or earlier release) to
> 5.3.5 (or
> later release) if the resulting transfer size is 32768. Problem 2 -
> Multiple
> clients are reading from 100-Mbit interface When the UDP transfer size is
> 32768 and multiple clients are reading from a 100-Mbit interface, the
> driver
> transmit queue on the filer can overflow more readily and packets may be
> discarded.
> This can lead to erratic or poor performance on the clients."
>
> ---
> First thing to check in GigE is typically flow control settings on the
> switch, filer, and client side. Typically it's best to have both
> transmit and receive flow control turned on. On the filer, it's
> called full flowcontrol. That isn't the default for some switch
> vendors and OS NICs.
>
> It's at least a start. Like most performance issues, you may
> have to work several rounds of stuff to find the culprit. Before
> you open a case with NTAP, which I strongly suggest, you may need
> to quantify your performance in some way. MB/sec is a good one.
> By putting a value on your performance it's easier to figure out
> what areas could be wrong and also easier to figure out when
> you can't get anymore based on your environment. That's the thing
> with Gigabit, most clients can't max it out so you will hit a ceiling
> for a single client that is less than the capacity of the wire.
> Slower forms of Ethernet were easier in this way because you knew
> pretty much where the ceiling was.
>
> ---
>
> When you're using NFS over UDP with disparate speeds, i.e. gigabit to the
> switch, then fast ethernet to clients, you can address performance
> problems
> in
> one of two ways:
>
> 1) If all of your client machines are on fast ethernet (yep, that means NO
>
> gigabit clients) the simple approach is to set the filer option
>
> options nfs.udp.xfersize 8192
>
> If you do have gigabit clients, doing the above will impact performance to
> them,
> in which case you have to do the more management intensive step two...
>
> 2) If you have some gigabit client machines, add the following options to
> the
> mount command on each 100 Mbit client machine:
> wsize=8192,rsize=8192
>
>
>

> Problem:
> Clients using nfs (Sun and HP unix workstations) were taking much
> longer to open files.
> Home, Tools, and Project data was moved to the filer and it was
> running GigE
> To open a drawing, it took 2hrs vs 20min on a older Sun server
> running 100mb ethernet.
> Data OnTap v6.1.2R1
>
> Once we made changes, it now takes 18min to open the same drawing (this is
> a
> normal time for the types of schmetics they are working on).
>
> Solution (so far):
> Use options nfs.udp.xfersize 8192 (instead of the default 32768)
> Also, set the filer to match the cisco switch for flow control
> (cisco has send flow, filer now has receive)
>
> ToDo:
> Investigation to see if full flow control will improve performance
> even more.
> Try enabling TCP over nfs
> Make sure that the xfersize does not serously affect the few
> machines we have on GigE that mount from the filer
>
> Thanks to the following for all the help and email (hope I didn't forget
> anyone).
>
> Philip Meidell
> John F. Detke
> Paul Jones
> Neil Stichbury
> Mike Ball
> Adam Fox
> Neil Stichbury
> devnull@adc.idt.com (whoever your are :-)
>
> ----------------- Details from some of the responses
>
> The major known issue is with NFS over UDP and packet drops causing the
> UDP cascading error problem. If you are running NFS over UDP I'd strongly
> suggest switching to NFS over TCP.
>
> I believe this is a buffering issue, where the Gig interface is
> "overloading" a buffer somewhere on its way to a 100M interface, and
> packets gets dropped. When that happens the entire UDP datagram gets
> resent. Also check rsize/wsize on the client, it may work better set to
> 8k, rather than the (default?) 32KB.
>
> This problem is related to network hardware, I've seen in most often on
> Cisco gear but it happens with others too. A quick nfsstat on client(s)
> can show the problem, look for retrans stats.
> ---------
> We had a similar problem and solved it by changing the UDP transfer size.
> Have a look at bug ID 29146 which includes the following :-
>
> "In Data ONTAP 5.3.4 and earlier releases the default UDP transfer size
> is
>
> 8192. In 5.3.5 and later releases the default is 32768. This larger value
> improves performance in some situations, but may cause problems in others.
>
> Problem 1 - 100-Mbit client is reading from Gigabit interface With the
> larger
> default, problems may result when the Gigabit interface on the filer is
> sending
> to a 100-Mbit client. Switches can drop packets on the outbound 100-Mbit
> port
> because too many packets become queued due to the difference in line
> speeds
> and
> the larger transfer size. Few complete 32K UDP datagrams are received by
> the
>
> client, and the client assumes the filer has gone away and increases its
> delay
> between retries. Poor performance may be seen. If a client does not
> specify
> the
> rsize and wsize parameters in the mount command, the filer's default UDP
> transfer size is used. If a client does specify rsize and wsize, and the
> values
> are larger than the default, then the default is used. This means that you
> may
> see problems after you upgrade from 5.3.4 (or earlier release) to 5.3.5
> (or
> later release) if the resulting transfer size is 32768. Problem 2 -
> Multiple
>
> clients are reading from 100-Mbit interface When the UDP transfer size is
> 32768
> and multiple clients are reading from a 100-Mbit interface, the driver
> transmit
> queue on the filer can overflow more readily and packets may be discarded.
> This
> can lead to erratic or poor performance on the clients."
>
> ---
> First thing to check in GigE is typically flow control settings on the
> switch, filer, and client side. Typically it's best to have both
> transmit and receive flow control turned on. On the filer, it's
> called full flowcontrol. That isn't the default for some switch
> vendors and OS NICs.
>
> It's at least a start. Like most performance issues, you may
> have to work several rounds of stuff to find the culprit. Before
> you open a case with NTAP, which I strongly suggest, you may need
> to quantify your performance in some way. MB/sec is a good one.
> By putting a value on your performance it's easier to figure out
> what areas could be wrong and also easier to figure out when
> you can't get anymore based on your environment. That's the thing
> with Gigabit, most clients can't max it out so you will hit a ceiling
> for a single client that is less than the capacity of the wire.
> Slower forms of Ethernet were easier in this way because you knew
> pretty much where the ceiling was.
>
> ---
> now that you have mentioned running Sun boxes try this
> in the /etc/system file of one of them and see if you perceive
> a performance improvement.
>
> set sq_max_size=2048
> set nstrpush = 90
> set nfs:nfs3_max_threads = 24
> set nfs:nfs3_nra = 10
> set nfs:nfs_max_threads = 24
> set nfs:nfs_nra = 10
> ---
>
> When you're using NFS over UDP with disparate speeds, i.e. gigabit to the
> switch, then fast ethernet to clients, you can address performance
> problems
> in
> one of two ways:
>
> 1) If all of your client machines are on fast ethernet (yep, that means NO
>
> gigabit clients) the simple approach is to set the filer option
>
> options nfs.udp.xfersize 8192
>
> If you do have gigabit clients, doing the above will impact performance to
> them,
> in which case you have to do the more management intensive step two...
>
> 2) If you have some gigabit client machines, add the following options to
> the
> mount command on each 100 Mbit client machine:
> wsize=8192,rsize=8192
>
>
>