Mailing List Archive

1 2 3  View All
Re: Directional network performance issues with Neutron + OpenvSwitch [ In reply to ]
Hi Daniel,

I followed that page, my Instances MTU is lowered by DHCP Agent but, same
result: poor network performance (internal between Instances and when
trying to reach the Internet).

No matter if I use "dnsmasq_config_file=/etc/neutron/dnsmasq-neutron.conf +
"dhcp-option-force=26,1400"" for my Neutron DHCP agent, or not (i.e. MTU =
1500), the result is almost the same.

I'll try VXLAN (or just VLANs) this weekend to see if I can get better
results...

Thanks!
Thiago




On 24 October 2013 17:38, Speichert,Daniel <djs428@drexel.edu> wrote:

> We managed to bring the upload speed back to maximum on the instances
> through the use of this guide:****
>
>
> http://docs.openstack.org/trunk/openstack-network/admin/content/openvswitch_plugin.html
> ****
>
> ** **
>
> Basically, the MTU needs to be lowered for GRE tunnels. It can be done
> with DHCP as explained in the new trunk manual.****
>
> ** **
>
> Regards,****
>
> Daniel****
>
> ** **
>
> *From:* annegentle@justwriteclick.com [mailto:
> annegentle@justwriteclick.com] *On Behalf Of *Anne Gentle
> *Sent:* Thursday, October 24, 2013 12:08 PM
> *To:* Martinx - ジェームズ
> *Cc:* Speichert,Daniel; openstack@lists.openstack.org
>
> *Subject:* Re: [Openstack] Directional network performance issues with
> Neutron + OpenvSwitch****
>
> ** **
>
> ** **
>
> ** **
>
> On Thu, Oct 24, 2013 at 10:37 AM, Martinx - ジェームズ <
> thiagocmartinsc@gmail.com> wrote:****
>
> Precisely!****
>
> ** **
>
> The doc currently says to disable Namespace when using GRE, never did this
> before, look:****
>
> ** **
>
>
> http://docs.openstack.org/trunk/install-guide/install/apt/content/install-neutron.install-plugin.ovs.gre.html
> ****
>
> ** **
>
> But on this very same doc, they say to enable it... Who knows?! =P****
>
> ** **
>
>
> http://docs.openstack.org/trunk/install-guide/install/apt/content/section_networking-routers-with-private-networks.html
> ****
>
> ** **
>
> I stick with Namespace enabled...****
>
> ** **
>
> ** **
>
> Just a reminder, /trunk/ links are works in progress, thanks for bringing
> the mismatch to our attention, and we already have a doc bug filed:****
>
> ** **
>
> https://bugs.launchpad.net/openstack-manuals/+bug/1241056****
>
> ** **
>
> Review this patch: https://review.openstack.org/#/c/53380/****
>
> ** **
>
> Anne****
>
> ** **
>
> ** **
>
> ****
>
> Let me ask you something, when you enable ovs_use_veth, que Metadata and
> DHCP still works?!****
>
> ** **
>
> Cheers!****
>
> Thiago****
>
> ** **
>
> On 24 October 2013 12:22, Speichert,Daniel <djs428@drexel.edu> wrote:****
>
> Hello everyone,****
>
> ****
>
> It seems we also ran into the same issue.****
>
> ****
>
> We are running Ubuntu Saucy with OpenStack Havana from Ubuntu Cloud
> archives (precise-updates).****
>
> ****
>
> The download speed to the VMs increased from 5 Mbps to maximum after
> enabling ovs_use_veth. Upload speed from the VMs is still terrible (max 1
> Mbps, usually 0.04 Mbps).****
>
> ****
>
> Here is the iperf between the instance and L3 agent (network node) inside
> namespace.****
>
> ****
>
> root@cloud:~# ip netns exec qrouter-a29e0200-d390-40d1-8cf7-7ac1cef5863a
> iperf -c 10.1.0.24 -r****
>
> ------------------------------------------------------------****
>
> Server listening on TCP port 5001****
>
> TCP window size: 85.3 KByte (default)****
>
> ------------------------------------------------------------****
>
> ------------------------------------------------------------****
>
> Client connecting to 10.1.0.24, TCP port 5001****
>
> TCP window size: 585 KByte (default)****
>
> ------------------------------------------------------------****
>
> [ 7] local 10.1.0.1 port 37520 connected with 10.1.0.24 port 5001****
>
> [ ID] Interval Transfer Bandwidth****
>
> [ 7] 0.0-10.0 sec 845 MBytes 708 Mbits/sec****
>
> [ 6] local 10.1.0.1 port 5001 connected with 10.1.0.24 port 53006****
>
> [ 6] 0.0-31.4 sec 256 KBytes 66.7 Kbits/sec****
>
> ****
>
> We are using Neutron OpenVSwitch with GRE and namespaces.****
>
>
> A side question: the documentation says to disable namespaces with GRE and
> enable them with VLANs. It was always working well for us on Grizzly with
> GRE and namespaces and we could never get it to work without namespaces. Is
> there any specific reason why the documentation is advising to disable it?
> ****
>
> ****
>
> Regards,****
>
> Daniel****
>
> ****
>
> *From:* Martinx - ジェームズ [mailto:thiagocmartinsc@gmail.com]
> *Sent:* Thursday, October 24, 2013 3:58 AM
> *To:* Aaron Rosen
> *Cc:* openstack@lists.openstack.org****
>
>
> *Subject:* Re: [Openstack] Directional network performance issues with
> Neutron + OpenvSwitch****
>
> ****
>
> Hi Aaron,****
>
> ****
>
> Thanks for answering! =)****
>
> ****
>
> Lets work...****
>
> ****
>
> ---****
>
> ****
>
> TEST #1 - iperf between Network Node and its Uplink router (Data Center's
> gateway "Internet") - OVS br-ex / eth2****
>
> ****
>
> # Tenant Namespace route table****
>
> ****
>
> root@net-node-1:~# ip netns exec
> qrouter-46cb8f7a-a3c5-4da7-ad69-4de63f7c34f1 ip route****
>
> default via 172.16.0.1 dev qg-50b615b7-c2 ****
>
> 172.16.0.0/20 dev qg-50b615b7-c2 proto kernel scope link src
> 172.16.0.2 ****
>
> 192.168.210.0/24 dev qr-a1376f61-05 proto kernel scope link src
> 192.168.210.1 ****
>
> ****
>
> # there is a "iperf -s" running at 172.16.0.1 "Internet", testing it****
>
> ****
>
> root@net-node-1:~# ip netns exec
> qrouter-46cb8f7a-a3c5-4da7-ad69-4de63f7c34f1 iperf -c 172.16.0.1****
>
> ------------------------------------------------------------****
>
> Client connecting to 172.16.0.1, TCP port 5001****
>
> TCP window size: 22.9 KByte (default)****
>
> ------------------------------------------------------------****
>
> [ 5] local 172.16.0.2 port 58342 connected with 172.16.0.1 port 5001****
>
> [ ID] Interval Transfer Bandwidth****
>
> [ 5] 0.0-10.0 sec 668 MBytes 559 Mbits/sec****
>
> ---****
>
> ****
>
> ---****
>
> ****
>
> TEST #2 - iperf on one instance to the Namespace of the L3 agent + uplink
> router****
>
> ****
>
> # iperf server running within Tenant's Namespace router****
>
> ****
>
> root@net-node-1:~# ip netns exec
> qrouter-46cb8f7a-a3c5-4da7-ad69-4de63f7c34f1 iperf -s****
>
> ****
>
> -****
>
> ****
>
> # from instance-1****
>
> ****
>
> ubuntu@instance-1:~$ ip route****
>
> default via 192.168.210.1 dev eth0 metric 100 ****
>
> 192.168.210.0/24 dev eth0 proto kernel scope link src 192.168.210.2 ***
> *
>
> ****
>
> # instance-1 performing tests against net-node-1 Namespace above****
>
> ****
>
> ubuntu@instance-1:~$ iperf -c 192.168.210.1****
>
> ------------------------------------------------------------****
>
> Client connecting to 192.168.210.1, TCP port 5001****
>
> TCP window size: 21.0 KByte (default)****
>
> ------------------------------------------------------------****
>
> [ 3] local 192.168.210.2 port 43739 connected with 192.168.210.1 port
> 5001****
>
> [ ID] Interval Transfer Bandwidth****
>
> [ 3] 0.0-10.0 sec 484 MBytes 406 Mbits/sec****
>
> ****
>
> # still on instance-1, now against "External IP" of its own Namespace /
> Router****
>
> ****
>
> ubuntu@instance-1:~$ iperf -c 172.16.0.2****
>
> ------------------------------------------------------------****
>
> Client connecting to 172.16.0.2, TCP port 5001****
>
> TCP window size: 21.0 KByte (default)****
>
> ------------------------------------------------------------****
>
> [ 3] local 192.168.210.2 port 34703 connected with 172.16.0.2 port 5001**
> **
>
> [ ID] Interval Transfer Bandwidth****
>
> [ 3] 0.0-10.0 sec 520 MBytes 436 Mbits/sec****
>
> ****
>
> # still on instance-1, now against the Data Center UpLink Router****
>
> ****
>
> ubuntu@instance-1:~$ iperf -c 172.16.0.1****
>
> ------------------------------------------------------------****
>
> Client connecting to 172.16.0.1, TCP port 5001****
>
> TCP window size: 21.0 KByte (default)****
>
> ------------------------------------------------------------****
>
> [ 3] local 192.168.210.4 port 38401 connected with 172.16.0.1 port 5001**
> **
>
> [ ID] Interval Transfer Bandwidth****
>
> [ 3] 0.0-10.0 sec * 324 MBytes 271 Mbits/sec*****
>
> ---****
>
> ****
>
> This latest test shows only 271 Mbits/s! I think it should be at least,
> 400~430 MBits/s... Right?!****
>
> ****
>
> ---****
>
> ****
>
> TEST #3 - Two instances on the same hypervisor****
>
> ****
>
> # iperf server****
>
> ****
>
> ubuntu@instance-2:~$ ip route****
>
> default via 192.168.210.1 dev eth0 metric 100 ****
>
> 192.168.210.0/24 dev eth0 proto kernel scope link src 192.168.210.4 ***
> *
>
> ****
>
> ubuntu@instance-2:~$ iperf -s****
>
> ------------------------------------------------------------****
>
> Server listening on TCP port 5001****
>
> TCP window size: 85.3 KByte (default)****
>
> ------------------------------------------------------------****
>
> [ 4] local 192.168.210.4 port 5001 connected with 192.168.210.2 port
> 45800****
>
> [ ID] Interval Transfer Bandwidth****
>
> [ 4] 0.0-10.0 sec 4.61 GBytes 3.96 Gbits/sec****
>
> ****
>
> # iperf client****
>
> ****
>
> ubuntu@instance-1:~$ iperf -c 192.168.210.4****
>
> ------------------------------------------------------------****
>
> Client connecting to 192.168.210.4, TCP port 5001****
>
> TCP window size: 21.0 KByte (default)****
>
> ------------------------------------------------------------****
>
> [ 3] local 192.168.210.2 port 45800 connected with 192.168.210.4 port
> 5001****
>
> [ ID] Interval Transfer Bandwidth****
>
> [ 3] 0.0-10.0 sec 4.61 GBytes 3.96 Gbits/sec****
>
> ---****
>
> ****
>
> ---****
>
> ****
>
> TEST #4 - Two instances on different hypervisors - over GRE****
>
> ****
>
> root@instance-2:~# iperf -s****
>
> ------------------------------------------------------------****
>
> Server listening on TCP port 5001****
>
> TCP window size: 85.3 KByte (default)****
>
> ------------------------------------------------------------****
>
> [ 4] local 192.168.210.4 port 5001 connected with 192.168.210.2 port
> 34640****
>
> [ ID] Interval Transfer Bandwidth****
>
> [ 4] 0.0-10.0 sec 237 MBytes 198 Mbits/sec****
>
> ****
>
> ****
>
> root@instance-1:~# iperf -c 192.168.210.4****
>
> ------------------------------------------------------------****
>
> Client connecting to 192.168.210.4, TCP port 5001****
>
> TCP window size: 21.0 KByte (default)****
>
> ------------------------------------------------------------****
>
> [ 3] local 192.168.210.2 port 34640 connected with 192.168.210.4 port
> 5001****
>
> [ ID] Interval Transfer Bandwidth****
>
> [ 3] 0.0-10.0 sec 237 MBytes 198 Mbits/sec****
>
> ---****
>
> ****
>
> I just realized how slow is my intra-cloud (intra-VM) communication...
> :-/****
>
> ****
>
> ---****
>
> ****
>
> TEST #5 - Two hypervisors - "GRE TUNNEL LAN" - OVS local_ip / remote_ip***
> *
>
> ****
>
> # Same path of "TEST #4" but, testing the physical GRE path (where GRE
> traffic flows)****
>
> ****
>
> root@hypervisor-2:~$ iperf -s****
>
> ------------------------------------------------------------****
>
> Server listening on TCP port 5001****
>
> TCP window size: 85.3 KByte (default)****
>
> ------------------------------------------------------------****
>
> n[ 4] local 10.20.2.57 port 5001 connected with 10.20.2.53 port 51694****
>
> [ ID] Interval Transfer Bandwidth****
>
> [ 4] 0.0-10.0 sec 1.09 GBytes 939 Mbits/sec****
>
> ****
>
> root@hypervisor-1:~# iperf -c 10.20.2.57****
>
> ------------------------------------------------------------****
>
> Client connecting to 10.20.2.57, TCP port 5001****
>
> TCP window size: 22.9 KByte (default)****
>
> ------------------------------------------------------------****
>
> [ 3] local 10.20.2.53 port 51694 connected with 10.20.2.57 port 5001****
>
> [ ID] Interval Transfer Bandwidth****
>
> [ 3] 0.0-10.0 sec 1.09 GBytes 939 Mbits/sec****
>
> ---****
>
> ****
>
> About Test #5, I don't know why the GRE traffic (Test #4) doesn't reach
> 1Gbit/sec (only ~200Mbit/s ?), since its physical path is much faster
> (GIGALan). Plus, Test #3 shows a pretty fast speed when traffic flows only
> within a hypervisor (3.96Gbit/sec).****
>
> ****
>
> Tomorrow, I'll do this tests with netperf.****
>
> ****
>
> NOTE: I'm using Open vSwitch 1.11.0, compiled for Ubuntu 12.04.3, via
> "dpkg-buildpackage" and installed via "Debian / Ubuntu way". If I downgrade
> to 1.10.2 from Havana Cloud Archive, same results... I can downgrade it, if
> you guys tell me to do so.****
>
> ****
>
> BTW, I'll install another "Region", based on Havana on Ubuntu 13.10, with
> exactly the same configurations from my current Havana + Ubuntu 12.04.3, on
> top of the same hardware, to see if the problem still persist.****
>
> ****
>
> Regards,****
>
> Thiago****
>
> ****
>
> On 23 October 2013 22:40, Aaron Rosen <arosen@nicira.com> wrote:****
>
> ****
>
> ****
>
> On Mon, Oct 21, 2013 at 11:52 PM, Martinx - ジェームズ <
> thiagocmartinsc@gmail.com> wrote:****
>
> James,****
>
> ****
>
> I think I'm hitting this problem.****
>
> ****
>
> I'm using "Per-Tenant Routers with Private Networks", GRE tunnels and
> L3+DHCP Network Node.****
>
> ****
>
> The connectivity from behind my Instances is very slow. It takes an
> eternity to finish "apt-get update".****
>
> ****
>
> ****
>
> I'm curious if you can do the following tests to help pinpoint the bottle
> neck: ****
>
> ****
>
> Run iperf or netperf between:****
>
> two instances on the same hypervisor - this will determine if it's a
> virtualization driver issue if the performance is bad. ****
>
> two instances on different hypervisors.****
>
> one instance to the namespace of the l3 agent. ****
>
> ****
>
> ****
>
> ****
>
> ****
>
> ****
>
> ****
>
> If I run "apt-get update" from within tenant's Namespace, it goes fine.***
> *
>
> ****
>
> If I enable "ovs_use_veth", Metadata (and/or DHCP) stops working and I and
> unable to start new Ubuntu Instances and login into them... Look:****
>
> ****
>
> --****
>
> cloud-init start running: Tue, 22 Oct 2013 05:57:39 +0000. up 4.01 seconds
> ****
>
> 2013-10-22 06:01:42,989 - util.py[WARNING]: '
> http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [3/120s]:
> url error [[Errno 113] No route to host]****
>
> 2013-10-22 06:01:45,988 - util.py[WARNING]: '
> http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [6/120s]:
> url error [[Errno 113] No route to host]****
>
> --****
>
> ****
>
> ****
>
> Do you see anything interesting in the neutron-metadata-agent log? Or it
> looks like your instance doesn't have a route to the default gw? ****
>
> ****
>
> ****
>
> Is this problem still around?!****
>
> ****
>
> Should I stay away from GRE tunnels when with Havana + Ubuntu 12.04.3?****
>
> ****
>
> Is it possible to re-enable Metadata when ovs_use_veth = true ?****
>
> ****
>
> Thanks!****
>
> Thiago****
>
> ****
>
> On 3 October 2013 06:27, James Page <james.page@ubuntu.com> wrote:****
>
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256****
>
> On 02/10/13 22:49, James Page wrote:
> >> sudo ip netns exec qrouter-d3baf1b1-55ee-42cb-a3f6-9629288e3221
> >>> traceroute -n 10.5.0.2 -p 44444 --mtu traceroute to 10.5.0.2
> >>> (10.5.0.2), 30 hops max, 65000 byte packets 1 10.5.0.2 0.950
> >>> ms F=1500 0.598 ms 0.566 ms
> >>>
> >>> The PMTU from the l3 gateway to the instance looks OK to me.
> > I spent a bit more time debugging this; performance from within
> > the router netns on the L3 gateway node looks good in both
> > directions when accessing via the tenant network (10.5.0.2) over
> > the qr-XXXXX interface, but when accessing through the external
> > network from within the netns I see the same performance choke
> > upstream into the tenant network.
> >
> > Which would indicate that my problem lies somewhere around the
> > qg-XXXXX interface in the router netns - just trying to figure out
> > exactly what - maybe iptables is doing something wonky?****
>
> OK - I found a fix but I'm not sure why this makes a difference;
> neither my l3-agent or dhcp-agent configuration had 'ovs_use_veth =
> True'; I switched this on, clearing everything down, rebooted and now
> I seem symmetric good performance across all neutron routers.
>
> This would point to some sort of underlying bug when ovs_use_veth = False.
> ****
>
>
>
> - --
> James Page
> Ubuntu and Debian Developer
> james.page@ubuntu.com
> jamespage@debian.org
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.14 (GNU/Linux)
> Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/****
>
> iQIcBAEBCAAGBQJSTTh6AAoJEL/srsug59jDmpEP/jaB5/yn9+Xm12XrVu0Q3IV5
> fLGOuBboUgykVVsfkWccI/oygNlBaXIcDuak/E4jxPcoRhLAdY1zpX8MQ8wSsGKd
> CjSeuW8xxnXubdfzmsCKSs3FCIBhDkSYzyiJd/raLvCfflyy8Cl7KN2x22mGHJ6z
> qZ9APcYfm9qCVbEssA3BHcUL+st1iqMJ0YhVZBk03+QEXaWu3FFbjpjwx3X1ZvV5
> Vbac7enqy7Lr4DSAIJVldeVuRURfv3YE3iJZTIXjaoUCCVTQLm5OmP9TrwBNHLsA
> 7W+LceQri+Vh0s4dHPKx5MiHsV3RCydcXkSQFYhx7390CXypMQ6WwXEY/a8Egssg
> SuxXByHwEcQFa+9sCwPQ+RXCmC0O6kUi8EPmwadjI5Gc1LoKw5Wov/SEen86fDUW
> P9pRXonseYyWN9I4MT4aG1ez8Dqq/SiZyWBHtcITxKI2smD92G9CwWGo4L9oGqJJ
> UcHRwQaTHgzy3yETPO25hjax8ZWZGNccHBixMCZKegr9p2dhR+7qF8G7mRtRQLxL
> 0fgOAExn/SX59ZT4RaYi9fI6Gng13RtSyI87CJC/50vfTmqoraUUK1aoSjIY4Dt+
> DYEMMLp205uLEj2IyaNTzykR0yh3t6dvfpCCcRA/xPT9slfa0a7P8LafyiWa4/5c
> jkJM4Y1BUV+2L5Rrf3sc
> =4lO4****
>
> -----END PGP SIGNATURE-----
>
> _______________________________________________
> Mailing list:
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
> Post to : openstack@lists.openstack.org
> Unsubscribe :
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack****
>
> ****
>
>
> _______________________________________________
> Mailing list:
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
> Post to : openstack@lists.openstack.org
> Unsubscribe :
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack****
>
> ****
>
> ****
>
>
> _______________________________________________
> Mailing list:
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
> Post to : openstack@lists.openstack.org
> Unsubscribe :
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack****
>
> ** **
>
>
> _______________________________________________
> Mailing list:
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
> Post to : openstack@lists.openstack.org
> Unsubscribe :
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack****
>
> ** **
>
Re: Directional network performance issues with Neutron + OpenvSwitch [ In reply to ]
Hi Thiago,

you have configured DHCP to push out a MTU of 1400. Can you confirm that the 1400 MTU is actually getting out to the instances by running 'ip link' on them?

There is an open problem where the veth used to connect the OVS and Linux bridges causes a performance drop on some kernels - https://bugs.launchpad.net/nova-project/+bug/1223267 .  If you are using the LibvirtHybridOVSBridgeDriver VIF driver, can you try changing to LibvirtOpenVswitchDriver and repeat the iperf test between instances on different compute-nodes.

What NICs (maker+model) are you using? You could try disabling any off-load functionality - 'ethtool -k <iface-used-for-gre>'.

What kernal are you using: 'uname -a'?

Re, Darragh.

> Hi Daniel,

>
> I followed that page, my Instances MTU is lowered by DHCP Agent but, same
> result: poor network performance (internal between Instances and when
> trying to reach the Internet).
>
> No matter if I use "dnsmasq_config_file=/etc/neutron/dnsmasq-neutron.conf +
> "dhcp-option-force=26,1400"" for my Neutron DHCP agent, or not (i.e. MTU =
> 1500), the result is almost the same.
>
> I'll try VXLAN (or just VLANs) this weekend to see if I can get better
> results...
>
> Thanks!
> Thiago

_______________________________________________
Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
Post to : openstack@lists.openstack.org
Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
Re: Directional network performance issues with Neutron + OpenvSwitch [ In reply to ]
Thiago,
It looks like you have a slightly different problem. I didn$B!G(Bt have any slowdown in the connection between instances.

You might want to try this: https://ask.openstack.org/en/question/6140/quantum-neutron-gre-slow-performance/?answer=6320#post-id-6320

Regards,
Daniel

From: Martinx - $B%8%'!<%`%:(B [mailto:thiagocmartinsc@gmail.com]
Sent: Thursday, October 24, 2013 11:59 PM
To: Speichert,Daniel
Cc: Anne Gentle; openstack@lists.openstack.org
Subject: Re: [Openstack] Directional network performance issues with Neutron + OpenvSwitch

Hi Daniel,

I followed that page, my Instances MTU is lowered by DHCP Agent but, same result: poor network performance (internal between Instances and when trying to reach the Internet).

No matter if I use "dnsmasq_config_file=/etc/neutron/dnsmasq-neutron.conf + "dhcp-option-force=26,1400"" for my Neutron DHCP agent, or not (i.e. MTU = 1500), the result is almost the same.

I'll try VXLAN (or just VLANs) this weekend to see if I can get better results...

Thanks!
Thiago



On 24 October 2013 17:38, Speichert,Daniel <djs428@drexel.edu<mailto:djs428@drexel.edu>> wrote:
We managed to bring the upload speed back to maximum on the instances through the use of this guide:
http://docs.openstack.org/trunk/openstack-network/admin/content/openvswitch_plugin.html

Basically, the MTU needs to be lowered for GRE tunnels. It can be done with DHCP as explained in the new trunk manual.

Regards,
Daniel

From: annegentle@justwriteclick.com<mailto:annegentle@justwriteclick.com> [mailto:annegentle@justwriteclick.com<mailto:annegentle@justwriteclick.com>] On Behalf Of Anne Gentle
Sent: Thursday, October 24, 2013 12:08 PM
To: Martinx - $B%8%'!<%`%:(B
Cc: Speichert,Daniel; openstack@lists.openstack.org<mailto:openstack@lists.openstack.org>

Subject: Re: [Openstack] Directional network performance issues with Neutron + OpenvSwitch



On Thu, Oct 24, 2013 at 10:37 AM, Martinx - $B%8%'!<%`%:(B <thiagocmartinsc@gmail.com<mailto:thiagocmartinsc@gmail.com>> wrote:
Precisely!

The doc currently says to disable Namespace when using GRE, never did this before, look:

http://docs.openstack.org/trunk/install-guide/install/apt/content/install-neutron.install-plugin.ovs.gre.html

But on this very same doc, they say to enable it... Who knows?! =P

http://docs.openstack.org/trunk/install-guide/install/apt/content/section_networking-routers-with-private-networks.html

I stick with Namespace enabled...


Just a reminder, /trunk/ links are works in progress, thanks for bringing the mismatch to our attention, and we already have a doc bug filed:

https://bugs.launchpad.net/openstack-manuals/+bug/1241056

Review this patch: https://review.openstack.org/#/c/53380/

Anne



Let me ask you something, when you enable ovs_use_veth, que Metadata and DHCP still works?!

Cheers!
Thiago

On 24 October 2013 12:22, Speichert,Daniel <djs428@drexel.edu<mailto:djs428@drexel.edu>> wrote:
Hello everyone,

It seems we also ran into the same issue.

We are running Ubuntu Saucy with OpenStack Havana from Ubuntu Cloud archives (precise-updates).

The download speed to the VMs increased from 5 Mbps to maximum after enabling ovs_use_veth. Upload speed from the VMs is still terrible (max 1 Mbps, usually 0.04 Mbps).

Here is the iperf between the instance and L3 agent (network node) inside namespace.

root@cloud:~# ip netns exec qrouter-a29e0200-d390-40d1-8cf7-7ac1cef5863a iperf -c 10.1.0.24 -r
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 85.3 KByte (default)
------------------------------------------------------------
------------------------------------------------------------
Client connecting to 10.1.0.24, TCP port 5001
TCP window size: 585 KByte (default)
------------------------------------------------------------
[ 7] local 10.1.0.1 port 37520 connected with 10.1.0.24 port 5001
[ ID] Interval Transfer Bandwidth
[ 7] 0.0-10.0 sec 845 MBytes 708 Mbits/sec
[ 6] local 10.1.0.1 port 5001 connected with 10.1.0.24 port 53006
[ 6] 0.0-31.4 sec 256 KBytes 66.7 Kbits/sec

We are using Neutron OpenVSwitch with GRE and namespaces.

A side question: the documentation says to disable namespaces with GRE and enable them with VLANs. It was always working well for us on Grizzly with GRE and namespaces and we could never get it to work without namespaces. Is there any specific reason why the documentation is advising to disable it?

Regards,
Daniel

From: Martinx - $B%8%'!<%`%:(B [mailto:thiagocmartinsc@gmail.com<mailto:thiagocmartinsc@gmail.com>]
Sent: Thursday, October 24, 2013 3:58 AM
To: Aaron Rosen
Cc: openstack@lists.openstack.org<mailto:openstack@lists.openstack.org>

Subject: Re: [Openstack] Directional network performance issues with Neutron + OpenvSwitch

Hi Aaron,

Thanks for answering! =)

Lets work...

---

TEST #1 - iperf between Network Node and its Uplink router (Data Center's gateway "Internet") - OVS br-ex / eth2

# Tenant Namespace route table

root@net-node-1:~# ip netns exec qrouter-46cb8f7a-a3c5-4da7-ad69-4de63f7c34f1 ip route
default via 172.16.0.1 dev qg-50b615b7-c2
172.16.0.0/20<http://172.16.0.0/20> dev qg-50b615b7-c2 proto kernel scope link src 172.16.0.2
192.168.210.0/24<http://192.168.210.0/24> dev qr-a1376f61-05 proto kernel scope link src 192.168.210.1<tel:192.168.210.1>

# there is a "iperf -s" running at 172.16.0.1 "Internet", testing it

root@net-node-1:~# ip netns exec qrouter-46cb8f7a-a3c5-4da7-ad69-4de63f7c34f1 iperf -c 172.16.0.1
------------------------------------------------------------
Client connecting to 172.16.0.1, TCP port 5001
TCP window size: 22.9 KByte (default)
------------------------------------------------------------
[ 5] local 172.16.0.2 port 58342 connected with 172.16.0.1 port 5001
[ ID] Interval Transfer Bandwidth
[ 5] 0.0-10.0 sec 668 MBytes 559 Mbits/sec
---

---

TEST #2 - iperf on one instance to the Namespace of the L3 agent + uplink router

# iperf server running within Tenant's Namespace router

root@net-node-1:~# ip netns exec qrouter-46cb8f7a-a3c5-4da7-ad69-4de63f7c34f1 iperf -s

-

# from instance-1

ubuntu@instance-1:~$ ip route
default via 192.168.210.1<tel:192.168.210.1> dev eth0 metric 100
192.168.210.0/24<http://192.168.210.0/24> dev eth0 proto kernel scope link src 192.168.210.2<tel:192.168.210.2>

# instance-1 performing tests against net-node-1 Namespace above

ubuntu@instance-1:~$ iperf -c 192.168.210.1<tel:192.168.210.1>
------------------------------------------------------------
Client connecting to 192.168.210.1<tel:192.168.210.1>, TCP port 5001
TCP window size: 21.0 KByte (default)
------------------------------------------------------------
[ 3] local 192.168.210.2<tel:192.168.210.2> port 43739 connected with 192.168.210.1<tel:192.168.210.1> port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 484 MBytes 406 Mbits/sec

# still on instance-1, now against "External IP" of its own Namespace / Router

ubuntu@instance-1:~$ iperf -c 172.16.0.2
------------------------------------------------------------
Client connecting to 172.16.0.2, TCP port 5001
TCP window size: 21.0 KByte (default)
------------------------------------------------------------
[ 3] local 192.168.210.2<tel:192.168.210.2> port 34703 connected with 172.16.0.2 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 520 MBytes 436 Mbits/sec

# still on instance-1, now against the Data Center UpLink Router

ubuntu@instance-1:~$ iperf -c 172.16.0.1
------------------------------------------------------------
Client connecting to 172.16.0.1, TCP port 5001
TCP window size: 21.0 KByte (default)
------------------------------------------------------------
[ 3] local 192.168.210.4<tel:192.168.210.4> port 38401 connected with 172.16.0.1 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 324 MBytes 271 Mbits/sec
---

This latest test shows only 271 Mbits/s! I think it should be at least, 400~430 MBits/s... Right?!

---

TEST #3 - Two instances on the same hypervisor

# iperf server

ubuntu@instance-2:~$ ip route
default via 192.168.210.1<tel:192.168.210.1> dev eth0 metric 100
192.168.210.0/24<http://192.168.210.0/24> dev eth0 proto kernel scope link src 192.168.210.4<tel:192.168.210.4>

ubuntu@instance-2:~$ iperf -s
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 85.3 KByte (default)
------------------------------------------------------------
[ 4] local 192.168.210.4<tel:192.168.210.4> port 5001 connected with 192.168.210.2<tel:192.168.210.2> port 45800
[ ID] Interval Transfer Bandwidth
[ 4] 0.0-10.0 sec 4.61 GBytes 3.96 Gbits/sec

# iperf client

ubuntu@instance-1:~$ iperf -c 192.168.210.4<tel:192.168.210.4>
------------------------------------------------------------
Client connecting to 192.168.210.4<tel:192.168.210.4>, TCP port 5001
TCP window size: 21.0 KByte (default)
------------------------------------------------------------
[ 3] local 192.168.210.2<tel:192.168.210.2> port 45800 connected with 192.168.210.4<tel:192.168.210.4> port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 4.61 GBytes 3.96 Gbits/sec
---

---

TEST #4 - Two instances on different hypervisors - over GRE

root@instance-2:~# iperf -s
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 85.3 KByte (default)
------------------------------------------------------------
[ 4] local 192.168.210.4<tel:192.168.210.4> port 5001 connected with 192.168.210.2<tel:192.168.210.2> port 34640
[ ID] Interval Transfer Bandwidth
[ 4] 0.0-10.0 sec 237 MBytes 198 Mbits/sec


root@instance-1:~# iperf -c 192.168.210.4<tel:192.168.210.4>
------------------------------------------------------------
Client connecting to 192.168.210.4<tel:192.168.210.4>, TCP port 5001
TCP window size: 21.0 KByte (default)
------------------------------------------------------------
[ 3] local 192.168.210.2<tel:192.168.210.2> port 34640 connected with 192.168.210.4<tel:192.168.210.4> port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 237 MBytes 198 Mbits/sec
---

I just realized how slow is my intra-cloud (intra-VM) communication... :-/

---

TEST #5 - Two hypervisors - "GRE TUNNEL LAN" - OVS local_ip / remote_ip

# Same path of "TEST #4" but, testing the physical GRE path (where GRE traffic flows)

root@hypervisor-2:~$ iperf -s
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 85.3 KByte (default)
------------------------------------------------------------
n[ 4] local 10.20.2.57 port 5001 connected with 10.20.2.53 port 51694
[ ID] Interval Transfer Bandwidth
[ 4] 0.0-10.0 sec 1.09 GBytes 939 Mbits/sec

root@hypervisor-1:~# iperf -c 10.20.2.57
------------------------------------------------------------
Client connecting to 10.20.2.57, TCP port 5001
TCP window size: 22.9 KByte (default)
------------------------------------------------------------
[ 3] local 10.20.2.53 port 51694 connected with 10.20.2.57 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 1.09 GBytes 939 Mbits/sec
---

About Test #5, I don't know why the GRE traffic (Test #4) doesn't reach 1Gbit/sec (only ~200Mbit/s ?), since its physical path is much faster (GIGALan). Plus, Test #3 shows a pretty fast speed when traffic flows only within a hypervisor (3.96Gbit/sec).

Tomorrow, I'll do this tests with netperf.

NOTE: I'm using Open vSwitch 1.11.0, compiled for Ubuntu 12.04.3, via "dpkg-buildpackage" and installed via "Debian / Ubuntu way". If I downgrade to 1.10.2 from Havana Cloud Archive, same results... I can downgrade it, if you guys tell me to do so.

BTW, I'll install another "Region", based on Havana on Ubuntu 13.10, with exactly the same configurations from my current Havana + Ubuntu 12.04.3, on top of the same hardware, to see if the problem still persist.

Regards,
Thiago

On 23 October 2013 22:40, Aaron Rosen <arosen@nicira.com<mailto:arosen@nicira.com>> wrote:


On Mon, Oct 21, 2013 at 11:52 PM, Martinx - $B%8%'!<%`%:(B <thiagocmartinsc@gmail.com<mailto:thiagocmartinsc@gmail.com>> wrote:
James,

I think I'm hitting this problem.

I'm using "Per-Tenant Routers with Private Networks", GRE tunnels and L3+DHCP Network Node.

The connectivity from behind my Instances is very slow. It takes an eternity to finish "apt-get update".


I'm curious if you can do the following tests to help pinpoint the bottle neck:

Run iperf or netperf between:
two instances on the same hypervisor - this will determine if it's a virtualization driver issue if the performance is bad.
two instances on different hypervisors.
one instance to the namespace of the l3 agent.






If I run "apt-get update" from within tenant's Namespace, it goes fine.

If I enable "ovs_use_veth", Metadata (and/or DHCP) stops working and I and unable to start new Ubuntu Instances and login into them... Look:

--
cloud-init start running: Tue, 22 Oct 2013 05:57:39 +0000. up 4.01 seconds
2013-10-22 06:01:42,989 - util.py[WARNING]: 'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [3/120s]: url error [[Errno 113] No route to host]
2013-10-22 06:01:45,988 - util.py[WARNING]: 'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [6/120s]: url error [[Errno 113] No route to host]
--


Do you see anything interesting in the neutron-metadata-agent log? Or it looks like your instance doesn't have a route to the default gw?


Is this problem still around?!

Should I stay away from GRE tunnels when with Havana + Ubuntu 12.04.3?

Is it possible to re-enable Metadata when ovs_use_veth = true ?

Thanks!
Thiago

On 3 October 2013 06:27, James Page <james.page@ubuntu.com<mailto:james.page@ubuntu.com>> wrote:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256
On 02/10/13 22:49, James Page wrote:
>> sudo ip netns exec qrouter-d3baf1b1-55ee-42cb-a3f6-9629288e3221
>>> traceroute -n 10.5.0.2 -p 44444 --mtu traceroute to 10.5.0.2
>>> (10.5.0.2), 30 hops max, 65000 byte packets 1 10.5.0.2 0.950
>>> ms F=1500 0.598 ms 0.566 ms
>>>
>>> The PMTU from the l3 gateway to the instance looks OK to me.
> I spent a bit more time debugging this; performance from within
> the router netns on the L3 gateway node looks good in both
> directions when accessing via the tenant network (10.5.0.2) over
> the qr-XXXXX interface, but when accessing through the external
> network from within the netns I see the same performance choke
> upstream into the tenant network.
>
> Which would indicate that my problem lies somewhere around the
> qg-XXXXX interface in the router netns - just trying to figure out
> exactly what - maybe iptables is doing something wonky?
OK - I found a fix but I'm not sure why this makes a difference;
neither my l3-agent or dhcp-agent configuration had 'ovs_use_veth =
True'; I switched this on, clearing everything down, rebooted and now
I seem symmetric good performance across all neutron routers.

This would point to some sort of underlying bug when ovs_use_veth = False.


- --
James Page
Ubuntu and Debian Developer
james.page@ubuntu.com<mailto:james.page@ubuntu.com>
jamespage@debian.org<mailto:jamespage@debian.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.14 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
iQIcBAEBCAAGBQJSTTh6AAoJEL/srsug59jDmpEP/jaB5/yn9+Xm12XrVu0Q3IV5
fLGOuBboUgykVVsfkWccI/oygNlBaXIcDuak/E4jxPcoRhLAdY1zpX8MQ8wSsGKd
CjSeuW8xxnXubdfzmsCKSs3FCIBhDkSYzyiJd/raLvCfflyy8Cl7KN2x22mGHJ6z
qZ9APcYfm9qCVbEssA3BHcUL+st1iqMJ0YhVZBk03+QEXaWu3FFbjpjwx3X1ZvV5
Vbac7enqy7Lr4DSAIJVldeVuRURfv3YE3iJZTIXjaoUCCVTQLm5OmP9TrwBNHLsA
7W+LceQri+Vh0s4dHPKx5MiHsV3RCydcXkSQFYhx7390CXypMQ6WwXEY/a8Egssg
SuxXByHwEcQFa+9sCwPQ+RXCmC0O6kUi8EPmwadjI5Gc1LoKw5Wov/SEen86fDUW
P9pRXonseYyWN9I4MT4aG1ez8Dqq/SiZyWBHtcITxKI2smD92G9CwWGo4L9oGqJJ
UcHRwQaTHgzy3yETPO25hjax8ZWZGNccHBixMCZKegr9p2dhR+7qF8G7mRtRQLxL
0fgOAExn/SX59ZT4RaYi9fI6Gng13RtSyI87CJC/50vfTmqoraUUK1aoSjIY4Dt+
DYEMMLp205uLEj2IyaNTzykR0yh3t6dvfpCCcRA/xPT9slfa0a7P8LafyiWa4/5c
jkJM4Y1BUV+2L5Rrf3sc
=4lO4
-----END PGP SIGNATURE-----

_______________________________________________
Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
Post to : openstack@lists.openstack.org<mailto:openstack@lists.openstack.org>
Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack


_______________________________________________
Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
Post to : openstack@lists.openstack.org<mailto:openstack@lists.openstack.org>
Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack



_______________________________________________
Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
Post to : openstack@lists.openstack.org<mailto:openstack@lists.openstack.org>
Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack


_______________________________________________
Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
Post to : openstack@lists.openstack.org<mailto:openstack@lists.openstack.org>
Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
Re: Directional network performance issues with Neutron + OpenvSwitch [ In reply to ]
Hi Darragh,

Yes, Instances are getting MTU 1400.

I'm using LibvirtHybridOVSBridgeDriver at my Compute Nodes. I'll check BG
1223267 right now!


The LibvirtOpenVswitchDriver doesn't work, look:

http://paste.openstack.org/show/49709/

http://paste.openstack.org/show/49710/


My NICs are "RTL8111/8168/8411 PCI Express Gigabit Ethernet", Hypervisors
motherboard are MSI-890FXA-GD70.

The command "ethtool -K eth1 gro off" did not had any effect on the
communication between instances on different hypervisors, still poor,
around 248Mbit/sec, when its physical path reach 1Gbit/s (where GRE is
built).

My Linux version is "Linux hypervisor-1 3.8.0-32-generic
#47~precise1-Ubuntu", same kernel on Network Node" and others nodes too
(Ubuntu 12.04.3 installed from scratch for this Havana deployment).

The only difference I can see right now, between my two hypervisors, is
that my second is just a spare machine, with a slow CPU but, I don't think
it will have a negative impact at the network throughput, since I have only
1 Instance running into it (plus a qemu-nbd process eating 90% of its CPU).
I'll replace this CPU tomorrow, to redo this tests again but, I don't think
that this is the source of my problem. The MOBOs of two hypervisors
are identical, 1 3Com (manageable) switch connecting the two.

Thanks!
Thiago


On 25 October 2013 07:15, Darragh O'Reilly <dara2002-openstack@yahoo.com>wrote:

> Hi Thiago,
>
> you have configured DHCP to push out a MTU of 1400. Can you confirm that
> the 1400 MTU is actually getting out to the instances by running 'ip link'
> on them?
>
> There is an open problem where the veth used to connect the OVS and Linux
> bridges causes a performance drop on some kernels -
> https://bugs.launchpad.net/nova-project/+bug/1223267 . If you are using
> the LibvirtHybridOVSBridgeDriver VIF driver, can you try changing to
> LibvirtOpenVswitchDriver and repeat the iperf test between instances on
> different compute-nodes.
>
> What NICs (maker+model) are you using? You could try disabling any
> off-load functionality - 'ethtool -k <iface-used-for-gre>'.
>
> What kernal are you using: 'uname -a'?
>
> Re, Darragh.
>
> > Hi Daniel,
>
> >
> > I followed that page, my Instances MTU is lowered by DHCP Agent but, same
> > result: poor network performance (internal between Instances and when
> > trying to reach the Internet).
> >
> > No matter if I use
> "dnsmasq_config_file=/etc/neutron/dnsmasq-neutron.conf +
> > "dhcp-option-force=26,1400"" for my Neutron DHCP agent, or not (i.e. MTU
> =
> > 1500), the result is almost the same.
> >
> > I'll try VXLAN (or just VLANs) this weekend to see if I can get better
> > results...
> >
> > Thanks!
> > Thiago
>
> _______________________________________________
> Mailing list:
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
> Post to : openstack@lists.openstack.org
> Unsubscribe :
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
>
Re: Directional network performance issues with Neutron + OpenvSwitch [ In reply to ]
Hi Thiago,

for the VIF error: you will need to change qemu.conf as described here:
http://openvswitch.org/openstack/documentation/

Re, Darragh.




On Friday, 25 October 2013, 15:14, Martinx - ジェームズ <thiagocmartinsc@gmail.com> wrote:

Hi Darragh,
>
>
>Yes, Instances are getting MTU 1400.
>
>
>I'm using LibvirtHybridOVSBridgeDriver at my Compute Nodes. I'll check BG 1223267 right now! 
>
>
>
>
>The LibvirtOpenVswitchDriver doesn't work, look:
>
>
>http://paste.openstack.org/show/49709/
>
>
>
>http://paste.openstack.org/show/49710/
>
>
>
>
>
>My NICs are "RTL8111/8168/8411 PCI Express Gigabit Ethernet", Hypervisors motherboard are MSI-890FXA-GD70.
>
>
>The command "ethtool -K eth1 gro off" did not had any effect on the communication between instances on different hypervisors, still poor, around 248Mbit/sec, when its physical path reach 1Gbit/s (where GRE is built).
>
>
>My Linux version is "Linux hypervisor-1 3.8.0-32-generic #47~precise1-Ubuntu", same kernel on Network Node" and others nodes too (Ubuntu 12.04.3 installed from scratch for this Havana deployment).
>
>
>The only difference I can see right now, between my two hypervisors, is that my second is just a spare machine, with a slow CPU but, I don't think it will have a negative impact at the network throughput, since I have only 1 Instance running into it (plus a qemu-nbd process eating 90% of its CPU). I'll replace this CPU tomorrow, to redo this tests again but, I don't think that this is the source of my problem. The MOBOs of two hypervisors are identical, 1 3Com (manageable) switch connecting the two.
>
>
>Thanks!
>Thiago
>
>
>
>On 25 October 2013 07:15, Darragh O'Reilly <dara2002-openstack@yahoo.com> wrote:
>
>Hi Thiago,
>>
>>you have configured DHCP to push out a MTU of 1400. Can you confirm that the 1400 MTU is actually getting out to the instances by running 'ip link' on them?
>>
>>There is an open problem where the veth used to connect the OVS and Linux bridges causes a performance drop on some kernels - https://bugs.launchpad.net/nova-project/+bug/1223267 .  If you are using the LibvirtHybridOVSBridgeDriver VIF driver, can you try changing to LibvirtOpenVswitchDriver and repeat the iperf test between instances on different compute-nodes.
>>
>>What NICs (maker+model) are you using? You could try disabling any off-load functionality - 'ethtool -k <iface-used-for-gre>'.
>>
>>What kernal are you using: 'uname -a'?
>>
>>Re, Darragh.
>>
>>
>>> Hi Daniel,
>>
>>>
>>> I followed that page, my Instances MTU is lowered by DHCP Agent but, same
>>> result: poor network performance (internal between Instances and when
>>> trying to reach the Internet).
>>>
>>> No matter if I use "dnsmasq_config_file=/etc/neutron/dnsmasq-neutron.conf +
>>> "dhcp-option-force=26,1400"" for my Neutron DHCP agent, or not (i.e. MTU =
>>> 1500), the result is almost the same.
>>>
>>> I'll try VXLAN (or just VLANs) this weekend to see if I can get better
>>> results...
>>>
>>> Thanks!
>>> Thiago
>>
>>
>>_______________________________________________
>>Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
>>Post to     : openstack@lists.openstack.org
>>Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
>>
>
>
>
Re: Directional network performance issues with Neutron + OpenvSwitch [ In reply to ]
Daniel,

Honestly, I think I have two problems, first one is related to "instances
trying to reach the Internet", that traffic that pass trough Network Node
(L3 + Namespace), which is vey, very slow. It is impossible to run "apt-get
update" from within a Instance, for example, takes an eternity to finish,
no MTU problems detected with tcpdump at the L3, it must be something else.

The second problem, is related to the communication between two instances
on different hypervisors. Which I just realized after doing more tests.

Do you think that those two problems are, in fact, the same (or related)?

Thanks!
Thiago

On 25 October 2013 10:51, Speichert,Daniel <djs428@drexel.edu> wrote:

> Thiago,****
>
> It looks like you have a slightly different problem. I didn’t have any
> slowdown in the connection between instances.****
>
> ** **
>
> You might want to try this:
> https://ask.openstack.org/en/question/6140/quantum-neutron-gre-slow-performance/?answer=6320#post-id-6320
> ****
>
> ** **
>
> Regards,****
>
> Daniel****
>
> ** **
>
> *From:* Martinx - ジェームズ [mailto:thiagocmartinsc@gmail.com]
> *Sent:* Thursday, October 24, 2013 11:59 PM
> *To:* Speichert,Daniel
> *Cc:* Anne Gentle; openstack@lists.openstack.org
>
> *Subject:* Re: [Openstack] Directional network performance issues with
> Neutron + OpenvSwitch****
>
> ** **
>
> Hi Daniel,****
>
> ** **
>
> I followed that page, my Instances MTU is lowered by DHCP Agent but, same
> result: poor network performance (internal between Instances and when
> trying to reach the Internet).****
>
> ** **
>
> No matter if I use "dnsmasq_config_file=/etc/neutron/dnsmasq-neutron.conf
> + "dhcp-option-force=26,1400"" for my Neutron DHCP agent, or not (i.e. MTU
> = 1500), the result is almost the same.****
>
> ** **
>
> I'll try VXLAN (or just VLANs) this weekend to see if I can get better
> results...****
>
> ** **
>
> Thanks!****
>
> Thiago****
>
> ** **
>
> ** **
>
> ** **
>
> On 24 October 2013 17:38, Speichert,Daniel <djs428@drexel.edu> wrote:****
>
> We managed to bring the upload speed back to maximum on the instances
> through the use of this guide:****
>
>
> http://docs.openstack.org/trunk/openstack-network/admin/content/openvswitch_plugin.html
> ****
>
> ****
>
> Basically, the MTU needs to be lowered for GRE tunnels. It can be done
> with DHCP as explained in the new trunk manual.****
>
> ****
>
> Regards,****
>
> Daniel****
>
> ****
>
> *From:* annegentle@justwriteclick.com [mailto:
> annegentle@justwriteclick.com] *On Behalf Of *Anne Gentle
> *Sent:* Thursday, October 24, 2013 12:08 PM
> *To:* Martinx - ジェームズ
> *Cc:* Speichert,Daniel; openstack@lists.openstack.org****
>
>
> *Subject:* Re: [Openstack] Directional network performance issues with
> Neutron + OpenvSwitch****
>
> ****
>
> ****
>
> ****
>
> On Thu, Oct 24, 2013 at 10:37 AM, Martinx - ジェームズ <
> thiagocmartinsc@gmail.com> wrote:****
>
> Precisely!****
>
> ****
>
> The doc currently says to disable Namespace when using GRE, never did this
> before, look:****
>
> ****
>
>
> http://docs.openstack.org/trunk/install-guide/install/apt/content/install-neutron.install-plugin.ovs.gre.html
> ****
>
> ****
>
> But on this very same doc, they say to enable it... Who knows?! =P****
>
> ****
>
>
> http://docs.openstack.org/trunk/install-guide/install/apt/content/section_networking-routers-with-private-networks.html
> ****
>
> ****
>
> I stick with Namespace enabled...****
>
> ****
>
> ****
>
> Just a reminder, /trunk/ links are works in progress, thanks for bringing
> the mismatch to our attention, and we already have a doc bug filed:****
>
> ****
>
> https://bugs.launchpad.net/openstack-manuals/+bug/1241056****
>
> ****
>
> Review this patch: https://review.openstack.org/#/c/53380/****
>
> ****
>
> Anne****
>
> ****
>
> ****
>
> ****
>
> Let me ask you something, when you enable ovs_use_veth, que Metadata and
> DHCP still works?!****
>
> ****
>
> Cheers!****
>
> Thiago****
>
> ****
>
> On 24 October 2013 12:22, Speichert,Daniel <djs428@drexel.edu> wrote:****
>
> Hello everyone,****
>
> ****
>
> It seems we also ran into the same issue.****
>
> ****
>
> We are running Ubuntu Saucy with OpenStack Havana from Ubuntu Cloud
> archives (precise-updates).****
>
> ****
>
> The download speed to the VMs increased from 5 Mbps to maximum after
> enabling ovs_use_veth. Upload speed from the VMs is still terrible (max 1
> Mbps, usually 0.04 Mbps).****
>
> ****
>
> Here is the iperf between the instance and L3 agent (network node) inside
> namespace.****
>
> ****
>
> root@cloud:~# ip netns exec qrouter-a29e0200-d390-40d1-8cf7-7ac1cef5863a
> iperf -c 10.1.0.24 -r****
>
> ------------------------------------------------------------****
>
> Server listening on TCP port 5001****
>
> TCP window size: 85.3 KByte (default)****
>
> ------------------------------------------------------------****
>
> ------------------------------------------------------------****
>
> Client connecting to 10.1.0.24, TCP port 5001****
>
> TCP window size: 585 KByte (default)****
>
> ------------------------------------------------------------****
>
> [ 7] local 10.1.0.1 port 37520 connected with 10.1.0.24 port 5001****
>
> [ ID] Interval Transfer Bandwidth****
>
> [ 7] 0.0-10.0 sec 845 MBytes 708 Mbits/sec****
>
> [ 6] local 10.1.0.1 port 5001 connected with 10.1.0.24 port 53006****
>
> [ 6] 0.0-31.4 sec 256 KBytes 66.7 Kbits/sec****
>
> ****
>
> We are using Neutron OpenVSwitch with GRE and namespaces.****
>
>
> A side question: the documentation says to disable namespaces with GRE and
> enable them with VLANs. It was always working well for us on Grizzly with
> GRE and namespaces and we could never get it to work without namespaces. Is
> there any specific reason why the documentation is advising to disable it?
> ****
>
> ****
>
> Regards,****
>
> Daniel****
>
> ****
>
> *From:* Martinx - ジェームズ [mailto:thiagocmartinsc@gmail.com]
> *Sent:* Thursday, October 24, 2013 3:58 AM
> *To:* Aaron Rosen
> *Cc:* openstack@lists.openstack.org****
>
>
> *Subject:* Re: [Openstack] Directional network performance issues with
> Neutron + OpenvSwitch****
>
> ****
>
> Hi Aaron,****
>
> ****
>
> Thanks for answering! =)****
>
> ****
>
> Lets work...****
>
> ****
>
> ---****
>
> ****
>
> TEST #1 - iperf between Network Node and its Uplink router (Data Center's
> gateway "Internet") - OVS br-ex / eth2****
>
> ****
>
> # Tenant Namespace route table****
>
> ****
>
> root@net-node-1:~# ip netns exec
> qrouter-46cb8f7a-a3c5-4da7-ad69-4de63f7c34f1 ip route****
>
> default via 172.16.0.1 dev qg-50b615b7-c2 ****
>
> 172.16.0.0/20 dev qg-50b615b7-c2 proto kernel scope link src
> 172.16.0.2 ****
>
> 192.168.210.0/24 dev qr-a1376f61-05 proto kernel scope link src
> 192.168.210.1 ****
>
> ****
>
> # there is a "iperf -s" running at 172.16.0.1 "Internet", testing it****
>
> ****
>
> root@net-node-1:~# ip netns exec
> qrouter-46cb8f7a-a3c5-4da7-ad69-4de63f7c34f1 iperf -c 172.16.0.1****
>
> ------------------------------------------------------------****
>
> Client connecting to 172.16.0.1, TCP port 5001****
>
> TCP window size: 22.9 KByte (default)****
>
> ------------------------------------------------------------****
>
> [ 5] local 172.16.0.2 port 58342 connected with 172.16.0.1 port 5001****
>
> [ ID] Interval Transfer Bandwidth****
>
> [ 5] 0.0-10.0 sec 668 MBytes 559 Mbits/sec****
>
> ---****
>
> ****
>
> ---****
>
> ****
>
> TEST #2 - iperf on one instance to the Namespace of the L3 agent + uplink
> router****
>
> ****
>
> # iperf server running within Tenant's Namespace router****
>
> ****
>
> root@net-node-1:~# ip netns exec
> qrouter-46cb8f7a-a3c5-4da7-ad69-4de63f7c34f1 iperf -s****
>
> ****
>
> -****
>
> ****
>
> # from instance-1****
>
> ****
>
> ubuntu@instance-1:~$ ip route****
>
> default via 192.168.210.1 dev eth0 metric 100 ****
>
> 192.168.210.0/24 dev eth0 proto kernel scope link src 192.168.210.2 ***
> *
>
> ****
>
> # instance-1 performing tests against net-node-1 Namespace above****
>
> ****
>
> ubuntu@instance-1:~$ iperf -c 192.168.210.1****
>
> ------------------------------------------------------------****
>
> Client connecting to 192.168.210.1, TCP port 5001****
>
> TCP window size: 21.0 KByte (default)****
>
> ------------------------------------------------------------****
>
> [ 3] local 192.168.210.2 port 43739 connected with 192.168.210.1 port
> 5001****
>
> [ ID] Interval Transfer Bandwidth****
>
> [ 3] 0.0-10.0 sec 484 MBytes 406 Mbits/sec****
>
> ****
>
> # still on instance-1, now against "External IP" of its own Namespace /
> Router****
>
> ****
>
> ubuntu@instance-1:~$ iperf -c 172.16.0.2****
>
> ------------------------------------------------------------****
>
> Client connecting to 172.16.0.2, TCP port 5001****
>
> TCP window size: 21.0 KByte (default)****
>
> ------------------------------------------------------------****
>
> [ 3] local 192.168.210.2 port 34703 connected with 172.16.0.2 port 5001**
> **
>
> [ ID] Interval Transfer Bandwidth****
>
> [ 3] 0.0-10.0 sec 520 MBytes 436 Mbits/sec****
>
> ****
>
> # still on instance-1, now against the Data Center UpLink Router****
>
> ****
>
> ubuntu@instance-1:~$ iperf -c 172.16.0.1****
>
> ------------------------------------------------------------****
>
> Client connecting to 172.16.0.1, TCP port 5001****
>
> TCP window size: 21.0 KByte (default)****
>
> ------------------------------------------------------------****
>
> [ 3] local 192.168.210.4 port 38401 connected with 172.16.0.1 port 5001**
> **
>
> [ ID] Interval Transfer Bandwidth****
>
> [ 3] 0.0-10.0 sec * 324 MBytes 271 Mbits/sec*****
>
> ---****
>
> ****
>
> This latest test shows only 271 Mbits/s! I think it should be at least,
> 400~430 MBits/s... Right?!****
>
> ****
>
> ---****
>
> ****
>
> TEST #3 - Two instances on the same hypervisor****
>
> ****
>
> # iperf server****
>
> ****
>
> ubuntu@instance-2:~$ ip route****
>
> default via 192.168.210.1 dev eth0 metric 100 ****
>
> 192.168.210.0/24 dev eth0 proto kernel scope link src 192.168.210.4 ***
> *
>
> ****
>
> ubuntu@instance-2:~$ iperf -s****
>
> ------------------------------------------------------------****
>
> Server listening on TCP port 5001****
>
> TCP window size: 85.3 KByte (default)****
>
> ------------------------------------------------------------****
>
> [ 4] local 192.168.210.4 port 5001 connected with 192.168.210.2 port
> 45800****
>
> [ ID] Interval Transfer Bandwidth****
>
> [ 4] 0.0-10.0 sec 4.61 GBytes 3.96 Gbits/sec****
>
> ****
>
> # iperf client****
>
> ****
>
> ubuntu@instance-1:~$ iperf -c 192.168.210.4****
>
> ------------------------------------------------------------****
>
> Client connecting to 192.168.210.4, TCP port 5001****
>
> TCP window size: 21.0 KByte (default)****
>
> ------------------------------------------------------------****
>
> [ 3] local 192.168.210.2 port 45800 connected with 192.168.210.4 port
> 5001****
>
> [ ID] Interval Transfer Bandwidth****
>
> [ 3] 0.0-10.0 sec 4.61 GBytes 3.96 Gbits/sec****
>
> ---****
>
> ****
>
> ---****
>
> ****
>
> TEST #4 - Two instances on different hypervisors - over GRE****
>
> ****
>
> root@instance-2:~# iperf -s****
>
> ------------------------------------------------------------****
>
> Server listening on TCP port 5001****
>
> TCP window size: 85.3 KByte (default)****
>
> ------------------------------------------------------------****
>
> [ 4] local 192.168.210.4 port 5001 connected with 192.168.210.2 port
> 34640****
>
> [ ID] Interval Transfer Bandwidth****
>
> [ 4] 0.0-10.0 sec 237 MBytes 198 Mbits/sec****
>
> ****
>
> ****
>
> root@instance-1:~# iperf -c 192.168.210.4****
>
> ------------------------------------------------------------****
>
> Client connecting to 192.168.210.4, TCP port 5001****
>
> TCP window size: 21.0 KByte (default)****
>
> ------------------------------------------------------------****
>
> [ 3] local 192.168.210.2 port 34640 connected with 192.168.210.4 port
> 5001****
>
> [ ID] Interval Transfer Bandwidth****
>
> [ 3] 0.0-10.0 sec 237 MBytes 198 Mbits/sec****
>
> ---****
>
> ****
>
> I just realized how slow is my intra-cloud (intra-VM) communication...
> :-/****
>
> ****
>
> ---****
>
> ****
>
> TEST #5 - Two hypervisors - "GRE TUNNEL LAN" - OVS local_ip / remote_ip***
> *
>
> ****
>
> # Same path of "TEST #4" but, testing the physical GRE path (where GRE
> traffic flows)****
>
> ****
>
> root@hypervisor-2:~$ iperf -s****
>
> ------------------------------------------------------------****
>
> Server listening on TCP port 5001****
>
> TCP window size: 85.3 KByte (default)****
>
> ------------------------------------------------------------****
>
> n[ 4] local 10.20.2.57 port 5001 connected with 10.20.2.53 port 51694****
>
> [ ID] Interval Transfer Bandwidth****
>
> [ 4] 0.0-10.0 sec 1.09 GBytes 939 Mbits/sec****
>
> ****
>
> root@hypervisor-1:~# iperf -c 10.20.2.57****
>
> ------------------------------------------------------------****
>
> Client connecting to 10.20.2.57, TCP port 5001****
>
> TCP window size: 22.9 KByte (default)****
>
> ------------------------------------------------------------****
>
> [ 3] local 10.20.2.53 port 51694 connected with 10.20.2.57 port 5001****
>
> [ ID] Interval Transfer Bandwidth****
>
> [ 3] 0.0-10.0 sec 1.09 GBytes 939 Mbits/sec****
>
> ---****
>
> ****
>
> About Test #5, I don't know why the GRE traffic (Test #4) doesn't reach
> 1Gbit/sec (only ~200Mbit/s ?), since its physical path is much faster
> (GIGALan). Plus, Test #3 shows a pretty fast speed when traffic flows only
> within a hypervisor (3.96Gbit/sec).****
>
> ****
>
> Tomorrow, I'll do this tests with netperf.****
>
> ****
>
> NOTE: I'm using Open vSwitch 1.11.0, compiled for Ubuntu 12.04.3, via
> "dpkg-buildpackage" and installed via "Debian / Ubuntu way". If I downgrade
> to 1.10.2 from Havana Cloud Archive, same results... I can downgrade it, if
> you guys tell me to do so.****
>
> ****
>
> BTW, I'll install another "Region", based on Havana on Ubuntu 13.10, with
> exactly the same configurations from my current Havana + Ubuntu 12.04.3, on
> top of the same hardware, to see if the problem still persist.****
>
> ****
>
> Regards,****
>
> Thiago****
>
> ****
>
> On 23 October 2013 22:40, Aaron Rosen <arosen@nicira.com> wrote:****
>
> ****
>
> ****
>
> On Mon, Oct 21, 2013 at 11:52 PM, Martinx - ジェームズ <
> thiagocmartinsc@gmail.com> wrote:****
>
> James,****
>
> ****
>
> I think I'm hitting this problem.****
>
> ****
>
> I'm using "Per-Tenant Routers with Private Networks", GRE tunnels and
> L3+DHCP Network Node.****
>
> ****
>
> The connectivity from behind my Instances is very slow. It takes an
> eternity to finish "apt-get update".****
>
> ****
>
> ****
>
> I'm curious if you can do the following tests to help pinpoint the bottle
> neck: ****
>
> ****
>
> Run iperf or netperf between:****
>
> two instances on the same hypervisor - this will determine if it's a
> virtualization driver issue if the performance is bad. ****
>
> two instances on different hypervisors.****
>
> one instance to the namespace of the l3 agent. ****
>
> ****
>
> ****
>
> ****
>
> ****
>
> ****
>
> ****
>
> If I run "apt-get update" from within tenant's Namespace, it goes fine.***
> *
>
> ****
>
> If I enable "ovs_use_veth", Metadata (and/or DHCP) stops working and I and
> unable to start new Ubuntu Instances and login into them... Look:****
>
> ****
>
> --****
>
> cloud-init start running: Tue, 22 Oct 2013 05:57:39 +0000. up 4.01 seconds
> ****
>
> 2013-10-22 06:01:42,989 - util.py[WARNING]: '
> http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [3/120s]:
> url error [[Errno 113] No route to host]****
>
> 2013-10-22 06:01:45,988 - util.py[WARNING]: '
> http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [6/120s]:
> url error [[Errno 113] No route to host]****
>
> --****
>
> ****
>
> ****
>
> Do you see anything interesting in the neutron-metadata-agent log? Or it
> looks like your instance doesn't have a route to the default gw? ****
>
> ****
>
> ****
>
> Is this problem still around?!****
>
> ****
>
> Should I stay away from GRE tunnels when with Havana + Ubuntu 12.04.3?****
>
> ****
>
> Is it possible to re-enable Metadata when ovs_use_veth = true ?****
>
> ****
>
> Thanks!****
>
> Thiago****
>
> ****
>
> On 3 October 2013 06:27, James Page <james.page@ubuntu.com> wrote:****
>
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256****
>
> On 02/10/13 22:49, James Page wrote:
> >> sudo ip netns exec qrouter-d3baf1b1-55ee-42cb-a3f6-9629288e3221
> >>> traceroute -n 10.5.0.2 -p 44444 --mtu traceroute to 10.5.0.2
> >>> (10.5.0.2), 30 hops max, 65000 byte packets 1 10.5.0.2 0.950
> >>> ms F=1500 0.598 ms 0.566 ms
> >>>
> >>> The PMTU from the l3 gateway to the instance looks OK to me.
> > I spent a bit more time debugging this; performance from within
> > the router netns on the L3 gateway node looks good in both
> > directions when accessing via the tenant network (10.5.0.2) over
> > the qr-XXXXX interface, but when accessing through the external
> > network from within the netns I see the same performance choke
> > upstream into the tenant network.
> >
> > Which would indicate that my problem lies somewhere around the
> > qg-XXXXX interface in the router netns - just trying to figure out
> > exactly what - maybe iptables is doing something wonky?****
>
> OK - I found a fix but I'm not sure why this makes a difference;
> neither my l3-agent or dhcp-agent configuration had 'ovs_use_veth =
> True'; I switched this on, clearing everything down, rebooted and now
> I seem symmetric good performance across all neutron routers.
>
> This would point to some sort of underlying bug when ovs_use_veth = False.
> ****
>
>
>
> - --
> James Page
> Ubuntu and Debian Developer
> james.page@ubuntu.com
> jamespage@debian.org
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.14 (GNU/Linux)
> Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/****
>
> iQIcBAEBCAAGBQJSTTh6AAoJEL/srsug59jDmpEP/jaB5/yn9+Xm12XrVu0Q3IV5
> fLGOuBboUgykVVsfkWccI/oygNlBaXIcDuak/E4jxPcoRhLAdY1zpX8MQ8wSsGKd
> CjSeuW8xxnXubdfzmsCKSs3FCIBhDkSYzyiJd/raLvCfflyy8Cl7KN2x22mGHJ6z
> qZ9APcYfm9qCVbEssA3BHcUL+st1iqMJ0YhVZBk03+QEXaWu3FFbjpjwx3X1ZvV5
> Vbac7enqy7Lr4DSAIJVldeVuRURfv3YE3iJZTIXjaoUCCVTQLm5OmP9TrwBNHLsA
> 7W+LceQri+Vh0s4dHPKx5MiHsV3RCydcXkSQFYhx7390CXypMQ6WwXEY/a8Egssg
> SuxXByHwEcQFa+9sCwPQ+RXCmC0O6kUi8EPmwadjI5Gc1LoKw5Wov/SEen86fDUW
> P9pRXonseYyWN9I4MT4aG1ez8Dqq/SiZyWBHtcITxKI2smD92G9CwWGo4L9oGqJJ
> UcHRwQaTHgzy3yETPO25hjax8ZWZGNccHBixMCZKegr9p2dhR+7qF8G7mRtRQLxL
> 0fgOAExn/SX59ZT4RaYi9fI6Gng13RtSyI87CJC/50vfTmqoraUUK1aoSjIY4Dt+
> DYEMMLp205uLEj2IyaNTzykR0yh3t6dvfpCCcRA/xPT9slfa0a7P8LafyiWa4/5c
> jkJM4Y1BUV+2L5Rrf3sc
> =4lO4****
>
> -----END PGP SIGNATURE-----
>
> _______________________________________________
> Mailing list:
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
> Post to : openstack@lists.openstack.org
> Unsubscribe :
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack****
>
> ****
>
>
> _______________________________________________
> Mailing list:
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
> Post to : openstack@lists.openstack.org
> Unsubscribe :
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack****
>
> ****
>
> ****
>
>
> _______________________________________________
> Mailing list:
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
> Post to : openstack@lists.openstack.org
> Unsubscribe :
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack****
>
> ****
>
>
> _______________________________________________
> Mailing list:
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
> Post to : openstack@lists.openstack.org
> Unsubscribe :
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack****
>
> ****
>
> ** **
>
Re: Directional network performance issues with Neutron + OpenvSwitch [ In reply to ]
I think can say... "YAY!!" :-D

With "LibvirtOpenVswitchDriver" my internal communication is the double
now! It goes from ~200 (with LibvirtHybridOVSBridgeDriver) to
*400Mbit/s*(with LibvirtOpenVswitchDriver)! Still far from 1Gbit/s (my
physical path
limit) but, more acceptable now.

The command "ethtool -K eth1 gro off" still makes no difference.

So, there is only 1 remain problem, when traffic pass trough L3 /
Namespace, it is still useless. Even the SSH connection into my Instances,
via its Floating IPs, is slow as hell, sometimes it just stops responding
for a few seconds, and becomes online again "out-of-nothing"...

I just detect a weird "behavior", when I run "apt-get update" from
instance-1, it is slow as I said plus, its ssh connection (where I'm
running apt-get update), stops responding right after I run "apt-get
update" AND, *all my others ssh connections also stops working too!* For a
few seconds... This means that when I run "apt-get update" from within
instance-1, the SSH session of instance-2 is affected too!! There is
something pretty bad going on at L3 / Namespace.

BTW, do you think that a ~400MBit/sec intra-vm-communication (GRE tunnel)
on top of a 1Gbit ethernet is acceptable?! It is still less than a half...

Thank you!
Thiago

On 25 October 2013 12:28, Darragh O'Reilly <dara2002-openstack@yahoo.com>wrote:

> Hi Thiago,
>
> for the VIF error: you will need to change qemu.conf as described here:
> http://openvswitch.org/openstack/documentation/
>
> Re, Darragh.
>
>
> On Friday, 25 October 2013, 15:14, Martinx - $B%8%'!<%`%:(B <
> thiagocmartinsc@gmail.com> wrote:
>
> Hi Darragh,
>
> Yes, Instances are getting MTU 1400.
>
> I'm using LibvirtHybridOVSBridgeDriver at my Compute Nodes. I'll check BG
> 1223267 right now!
>
>
> The LibvirtOpenVswitchDriver doesn't work, look:
>
> http://paste.openstack.org/show/49709/
>
> http://paste.openstack.org/show/49710/
>
>
> My NICs are "RTL8111/8168/8411 PCI Express Gigabit Ethernet", Hypervisors
> motherboard are MSI-890FXA-GD70.
>
> The command "ethtool -K eth1 gro off" did not had any effect on the
> communication between instances on different hypervisors, still poor,
> around 248Mbit/sec, when its physical path reach 1Gbit/s (where GRE is
> built).
>
> My Linux version is "Linux hypervisor-1 3.8.0-32-generic
> #47~precise1-Ubuntu", same kernel on Network Node" and others nodes too
> (Ubuntu 12.04.3 installed from scratch for this Havana deployment).
>
> The only difference I can see right now, between my two hypervisors, is
> that my second is just a spare machine, with a slow CPU but, I don't think
> it will have a negative impact at the network throughput, since I have only
> 1 Instance running into it (plus a qemu-nbd process eating 90% of its CPU).
> I'll replace this CPU tomorrow, to redo this tests again but, I don't think
> that this is the source of my problem. The MOBOs of two hypervisors
> are identical, 1 3Com (manageable) switch connecting the two.
>
> Thanks!
> Thiago
>
>
> On 25 October 2013 07:15, Darragh O'Reilly <dara2002-openstack@yahoo.com>wrote:
>
> Hi Thiago,
>
> you have configured DHCP to push out a MTU of 1400. Can you confirm that
> the 1400 MTU is actually getting out to the instances by running 'ip link'
> on them?
>
> There is an open problem where the veth used to connect the OVS and Linux
> bridges causes a performance drop on some kernels -
> https://bugs.launchpad.net/nova-project/+bug/1223267 . If you are using
> the LibvirtHybridOVSBridgeDriver VIF driver, can you try changing to
> LibvirtOpenVswitchDriver and repeat the iperf test between instances on
> different compute-nodes.
>
> What NICs (maker+model) are you using? You could try disabling any
> off-load functionality - 'ethtool -k <iface-used-for-gre>'.
>
> What kernal are you using: 'uname -a'?
>
> Re, Darragh.
>
> > Hi Daniel,
>
> >
> > I followed that page, my Instances MTU is lowered by DHCP Agent but, same
> > result: poor network performance (internal between Instances and when
> > trying to reach the Internet).
> >
> > No matter if I use
> "dnsmasq_config_file=/etc/neutron/dnsmasq-neutron.conf +
> > "dhcp-option-force=26,1400"" for my Neutron DHCP agent, or not (i.e. MTU
> =
> > 1500), the result is almost the same.
> >
> > I'll try VXLAN (or just VLANs) this weekend to see if I can get better
> > results...
> >
> > Thanks!
> > Thiago
>
> _______________________________________________
> Mailing list:
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
> Post to : openstack@lists.openstack.org
> Unsubscribe :
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
>
>
>
>
>
Re: Directional network performance issues with Neutron + OpenvSwitch [ In reply to ]
On 10/25/2013 08:19 AM, Martinx - ジェームズ wrote:
> I think can say... "YAY!!" :-D
>
> With "LibvirtOpenVswitchDriver" my internal communication is the double
> now! It goes from ~200 (with LibvirtHybridOVSBridgeDriver) to
> *_400Mbit/s_* (with LibvirtOpenVswitchDriver)! Still far from 1Gbit/s
> (my physical path limit) but, more acceptable now.
>
> The command "ethtool -K eth1 gro off" still makes no difference.

Does GRO happen if there isn't RX CKO on the NIC? Can your NIC
peer-into a GRE tunnel (?) to do CKO on the encapsulated traffic?

> So, there is only 1 remain problem, when traffic pass trough L3 /
> Namespace, it is still useless. Even the SSH connection into my
> Instances, via its Floating IPs, is slow as hell, sometimes it just
> stops responding for a few seconds, and becomes online again
> "out-of-nothing"...
>
> I just detect a weird "behavior", when I run "apt-get update" from
> instance-1, it is slow as I said plus, its ssh connection (where I'm
> running apt-get update), stops responding right after I run "apt-get
> update" AND, _all my others ssh connections also stops working too!_ For
> a few seconds... This means that when I run "apt-get update" from within
> instance-1, the SSH session of instance-2 is affected too!! There is
> something pretty bad going on at L3 / Namespace.
>
> BTW, do you think that a ~400MBit/sec intra-vm-communication (GRE
> tunnel) on top of a 1Gbit ethernet is acceptable?! It is still less than
> a half...

I would suggest checking for individual CPUs maxing-out during the 400
Mbit/s transfers.

rick jones

_______________________________________________
Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
Post to : openstack@lists.openstack.org
Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
Re: Directional network performance issues with Neutron + OpenvSwitch [ In reply to ]
the uneven ssh performance is strange - maybe learning on the tunnel mesh is not stablizing. It is easy to mess it up by giving a wrong local_ip in the ovs-plugin config file. Check the tunnels ports on br-tun with 'ovs-vsctl show'. Is each one using the correct IPs? Br-tun should have N-1 gre-x ports - no more! Maybe you can put 'ovs-vsctl show' from the nodes on paste.openstack if there are not to many?

Re, Darragh.




On Friday, 25 October 2013, 16:20, Martinx - ジェームズ <thiagocmartinsc@gmail.com> wrote:

I think can say... "YAY!!"    :-D
>
>
>With "LibvirtOpenVswitchDriver" my internal communication is the double now! It goes from ~200 (with LibvirtHybridOVSBridgeDriver) to 400Mbit/s (with LibvirtOpenVswitchDriver)! Still far from 1Gbit/s (my physical path limit) but, more acceptable now.
>
>
>The command "ethtool -K eth1 gro off" still makes no difference.
>
>
>So, there is only 1 remain problem, when traffic pass trough L3 / Namespace, it is still useless. Even the SSH connection into my Instances, via its Floating IPs, is slow as hell, sometimes it just stops responding for a few seconds, and becomes online again "out-of-nothing"...
>
>
>I just detect a weird "behavior", when I run "apt-get update" from instance-1, it is slow as I said plus, its ssh connection (where I'm running apt-get update), stops responding right after I run "apt-get update" AND, all my others ssh connections also stops working too! For a few seconds... This means that when I run "apt-get update" from within instance-1, the SSH session of instance-2 is affected too!! There is something pretty bad going on at L3 / Namespace.
>
>
>BTW, do you think that a ~400MBit/sec intra-vm-communication (GRE tunnel) on top of a 1Gbit ethernet is acceptable?! It is still less than a half...
>
>
>Thank you!
>Thiago
>
>
>On 25 October 2013 12:28, Darragh O'Reilly <dara2002-openstack@yahoo.com> wrote:
>
>Hi Thiago,
>>
>>
>>for the VIF error: you will need to change qemu.conf as described here:
>>http://openvswitch.org/openstack/documentation/
>>
>>
>>Re, Darragh.
>>
>>
>>
>>
>>On Friday, 25 October 2013, 15:14, Martinx - ジェームズ <thiagocmartinsc@gmail.com> wrote:
>>
>>Hi Darragh,
>>>
>>>
>>>Yes, Instances are getting MTU 1400.
>>>
>>>
>>>I'm using LibvirtHybridOVSBridgeDriver at my Compute Nodes. I'll check BG 1223267 right now! 
>>>
>>>
>>>
>>>
>>>The LibvirtOpenVswitchDriver doesn't work, look:
>>>
>>>
>>>http://paste.openstack.org/show/49709/
>>>
>>>
>>>
>>>http://paste.openstack.org/show/49710/
>>>
>>>
>>>
>>>
>>>
>>>My NICs are "RTL8111/8168/8411 PCI Express Gigabit Ethernet", Hypervisors motherboard are MSI-890FXA-GD70.
>>>
>>>
>>>The command "ethtool -K eth1 gro off" did not had any effect on the communication between instances on different hypervisors, still poor, around 248Mbit/sec, when its physical path reach 1Gbit/s (where GRE is built).
>>>
>>>
>>>My Linux version is "Linux hypervisor-1 3.8.0-32-generic #47~precise1-Ubuntu", same kernel on Network Node" and others nodes too (Ubuntu 12.04.3 installed from scratch for this Havana deployment).
>>>
>>>
>>>The only difference I can see right now, between my two hypervisors, is that my second is just a spare machine, with a slow CPU but, I don't think it will have a negative impact at the network throughput, since I have only 1 Instance running into it (plus a qemu-nbd process eating 90% of its CPU). I'll replace this CPU tomorrow, to redo this tests again but, I don't think that this is the source of my problem. The MOBOs of two hypervisors are identical, 1 3Com (manageable) switch connecting the two.
>>>
>>>
>>>Thanks!
>>>Thiago
>>>
>>>
>>>
>>>On 25 October 2013 07:15, Darragh O'Reilly <dara2002-openstack@yahoo.com> wrote:
>>>
>>>Hi Thiago,
>>>>
>>>>you have configured DHCP to push out a MTU of 1400. Can you confirm that the 1400 MTU is actually getting out to the instances by running 'ip link' on them?
>>>>
>>>>There is an open problem where the veth used to connect the OVS and Linux bridges causes a performance drop on some kernels - https://bugs.launchpad.net/nova-project/+bug/1223267 .  If you are using the LibvirtHybridOVSBridgeDriver VIF driver, can you try changing to LibvirtOpenVswitchDriver and repeat the iperf test between instances on different compute-nodes.
>>>>
>>>>What NICs (maker+model) are you using? You could try disabling any off-load functionality - 'ethtool -k <iface-used-for-gre>'.
>>>>
>>>>What kernal are you using: 'uname -a'?
>>>>
>>>>Re, Darragh.
>>>>
>>>>
>>>>> Hi Daniel,
>>>>
>>>>>
>>>>> I followed that page, my Instances MTU is lowered by DHCP Agent but, same
>>>>> result: poor network performance (internal between Instances and when
>>>>> trying to reach the Internet).
>>>>>
>>>>> No matter if I use "dnsmasq_config_file=/etc/neutron/dnsmasq-neutron.conf +
>>>>> "dhcp-option-force=26,1400"" for my Neutron DHCP agent, or not (i.e. MTU =
>>>>> 1500), the result is almost the same.
>>>>>
>>>>> I'll try VXLAN (or just VLANs) this weekend to see if I can get better
>>>>> results...
>>>>>
>>>>> Thanks!
>>>>> Thiago
>>>>
>>>>
>>>>_______________________________________________
>>>>Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
>>>>Post to     : openstack@lists.openstack.org
>>>>Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
>>>>
>>>
>>>
>>>
>
>
>
Re: Directional network performance issues with Neutron + OpenvSwitch [ In reply to ]
Here we go:

---
root@net-node-1:~# grep local_ip
/etc/neutron/plugins/openvswitch/ovs_neutron_plugin.ini
local_ip = 10.20.2.52

root@net-node-1:~# ip r | grep 10.\20
10.20.2.0/24 dev eth1 proto kernel scope link src 10.20.2.52
---

---
root@hypervisor-1:~# grep local_ip
/etc/neutron/plugins/openvswitch/ovs_neutron_plugin.ini
local_ip = 10.20.2.53

root@hypervisor-1:~# ip r | grep 10.\20
10.20.2.0/24 dev eth1 proto kernel scope link src 10.20.2.53
---

---
root@hypervisor-2:~# grep local_ip
/etc/neutron/plugins/openvswitch/ovs_neutron_plugin.ini
local_ip = 10.20.2.57

root@hypervisor-2:~# ip r | grep 10.\20
10.20.2.0/24 dev eth1 proto kernel scope link src 10.20.2.57
---

Each "ovs-vsctl show":

net-node-1: http://paste.openstack.org/show/49727/

hypervisor-1: http://paste.openstack.org/show/49728/

hypervisor-2: http://paste.openstack.org/show/49729/


Best,
Thiago


On 25 October 2013 14:11, Darragh O'Reilly <dara2002-openstack@yahoo.com>wrote:

>
> the uneven ssh performance is strange - maybe learning on the tunnel mesh
> is not stablizing. It is easy to mess it up by giving a wrong local_ip in
> the ovs-plugin config file. Check the tunnels ports on br-tun with
> 'ovs-vsctl show'. Is each one using the correct IPs? Br-tun should have N-1
> gre-x ports - no more! Maybe you can put 'ovs-vsctl show' from the nodes on
> paste.openstack if there are not to many?
>
> Re, Darragh.
>
>
> On Friday, 25 October 2013, 16:20, Martinx - $B%8%'!<%`%:(B <
> thiagocmartinsc@gmail.com> wrote:
>
> I think can say... "YAY!!" :-D
>
> With "LibvirtOpenVswitchDriver" my internal communication is the double
> now! It goes from ~200 (with LibvirtHybridOVSBridgeDriver) to *400Mbit/s*(with LibvirtOpenVswitchDriver)! Still far from 1Gbit/s (my physical path
> limit) but, more acceptable now.
>
> The command "ethtool -K eth1 gro off" still makes no difference.
>
> So, there is only 1 remain problem, when traffic pass trough L3 /
> Namespace, it is still useless. Even the SSH connection into my Instances,
> via its Floating IPs, is slow as hell, sometimes it just stops responding
> for a few seconds, and becomes online again "out-of-nothing"...
>
> I just detect a weird "behavior", when I run "apt-get update" from
> instance-1, it is slow as I said plus, its ssh connection (where I'm
> running apt-get update), stops responding right after I run "apt-get
> update" AND, *all my others ssh connections also stops working too!* For
> a few seconds... This means that when I run "apt-get update" from within
> instance-1, the SSH session of instance-2 is affected too!! There is
> something pretty bad going on at L3 / Namespace.
>
> BTW, do you think that a ~400MBit/sec intra-vm-communication (GRE tunnel)
> on top of a 1Gbit ethernet is acceptable?! It is still less than a half...
>
> Thank you!
> Thiago
>
> On 25 October 2013 12:28, Darragh O'Reilly <dara2002-openstack@yahoo.com>wrote:
>
> Hi Thiago,
>
> for the VIF error: you will need to change qemu.conf as described here:
> http://openvswitch.org/openstack/documentation/
>
> Re, Darragh.
>
>
> On Friday, 25 October 2013, 15:14, Martinx - $B%8%'!<%`%:(B <
> thiagocmartinsc@gmail.com> wrote:
>
> Hi Darragh,
>
> Yes, Instances are getting MTU 1400.
>
> I'm using LibvirtHybridOVSBridgeDriver at my Compute Nodes. I'll check BG
> 1223267 right now!
>
>
> The LibvirtOpenVswitchDriver doesn't work, look:
>
> http://paste.openstack.org/show/49709/
>
> http://paste.openstack.org/show/49710/
>
>
> My NICs are "RTL8111/8168/8411 PCI Express Gigabit Ethernet", Hypervisors
> motherboard are MSI-890FXA-GD70.
>
> The command "ethtool -K eth1 gro off" did not had any effect on the
> communication between instances on different hypervisors, still poor,
> around 248Mbit/sec, when its physical path reach 1Gbit/s (where GRE is
> built).
>
> My Linux version is "Linux hypervisor-1 3.8.0-32-generic
> #47~precise1-Ubuntu", same kernel on Network Node" and others nodes too
> (Ubuntu 12.04.3 installed from scratch for this Havana deployment).
>
> The only difference I can see right now, between my two hypervisors, is
> that my second is just a spare machine, with a slow CPU but, I don't think
> it will have a negative impact at the network throughput, since I have only
> 1 Instance running into it (plus a qemu-nbd process eating 90% of its CPU).
> I'll replace this CPU tomorrow, to redo this tests again but, I don't think
> that this is the source of my problem. The MOBOs of two hypervisors
> are identical, 1 3Com (manageable) switch connecting the two.
>
> Thanks!
> Thiago
>
>
> On 25 October 2013 07:15, Darragh O'Reilly <dara2002-openstack@yahoo.com>wrote:
>
> Hi Thiago,
>
> you have configured DHCP to push out a MTU of 1400. Can you confirm that
> the 1400 MTU is actually getting out to the instances by running 'ip link'
> on them?
>
> There is an open problem where the veth used to connect the OVS and Linux
> bridges causes a performance drop on some kernels -
> https://bugs.launchpad.net/nova-project/+bug/1223267 . If you are using
> the LibvirtHybridOVSBridgeDriver VIF driver, can you try changing to
> LibvirtOpenVswitchDriver and repeat the iperf test between instances on
> different compute-nodes.
>
> What NICs (maker+model) are you using? You could try disabling any
> off-load functionality - 'ethtool -k <iface-used-for-gre>'.
>
> What kernal are you using: 'uname -a'?
>
> Re, Darragh.
>
> > Hi Daniel,
>
> >
> > I followed that page, my Instances MTU is lowered by DHCP Agent but, same
> > result: poor network performance (internal between Instances and when
> > trying to reach the Internet).
> >
> > No matter if I use
> "dnsmasq_config_file=/etc/neutron/dnsmasq-neutron.conf +
> > "dhcp-option-force=26,1400"" for my Neutron DHCP agent, or not (i.e. MTU
> =
> > 1500), the result is almost the same.
> >
> > I'll try VXLAN (or just VLANs) this weekend to see if I can get better
> > results...
> >
> > Thanks!
> > Thiago
>
> _______________________________________________
> Mailing list:
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
> Post to : openstack@lists.openstack.org
> Unsubscribe :
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
>
>
>
>
>
>
>
>
Re: Directional network performance issues with Neutron + OpenvSwitch [ In reply to ]
ok, the tunnels look fine. One thing that looks funny on the network node are these untagged tap* devices. I guess you switched to using veths and then  switched back to not using them. I don't know if they matter, but you should clean them up by stopping everthing, running neutron-ovs-cleanup (check bridges empty) and reboot.

Bridge br-int Port "tapa1376f61-05" Interface "tapa1376f61-05" ...
Port "qr-a1376f61-05"
tag: 1
Interface "qr-a1376f61-05"
type: internal

Re, Darragh.




On Friday, 25 October 2013, 17:28, Martinx - ジェームズ <thiagocmartinsc@gmail.com> wrote:

Here we go:
>
>
>---
>root@net-node-1:~# grep local_ip /etc/neutron/plugins/openvswitch/ovs_neutron_plugin.ini 
>local_ip = 10.20.2.52
>
>
>root@net-node-1:~# ip r | grep 10.\20
>10.20.2.0/24 dev eth1  proto kernel  scope link  src 10.20.2.52 
>---
>
>
>---
>root@hypervisor-1:~# grep local_ip /etc/neutron/plugins/openvswitch/ovs_neutron_plugin.ini
>local_ip = 10.20.2.53
>
>
>root@hypervisor-1:~# ip r | grep 10.\20
>10.20.2.0/24 dev eth1  proto kernel  scope link  src 10.20.2.53 
>---
>
>
>---
>root@hypervisor-2:~# grep local_ip /etc/neutron/plugins/openvswitch/ovs_neutron_plugin.ini
>local_ip = 10.20.2.57
>
>
>root@hypervisor-2:~# ip r | grep 10.\20
>10.20.2.0/24 dev eth1  proto kernel  scope link  src 10.20.2.57
>---
>
>
>Each "ovs-vsctl show":
>
>
>net-node-1: http://paste.openstack.org/show/49727/
>
>
>hypervisor-1: http://paste.openstack.org/show/49728/
>
>
>hypervisor-2: http://paste.openstack.org/show/49729/
>
>
>
>
>
>Best,
>Thiago
>
>
>
>On 25 October 2013 14:11, Darragh O'Reilly <dara2002-openstack@yahoo.com> wrote:
>
>
>>
>>the uneven ssh performance is strange - maybe learning on the tunnel mesh is not stablizing. It is easy to mess it up by giving a wrong local_ip in the ovs-plugin config file. Check the tunnels ports on br-tun with 'ovs-vsctl show'. Is each one using the correct IPs? Br-tun should have N-1 gre-x ports - no more! Maybe you can put 'ovs-vsctl show' from the nodes on paste.openstack if there are not to many?
>>
>>
>>Re, Darragh.
>>
>>
>>
>>
>>On Friday, 25 October 2013, 16:20, Martinx - ジェームズ <thiagocmartinsc@gmail.com> wrote:
>>
>>I think can say... "YAY!!"    :-D
>>>
>>>
>>>With "LibvirtOpenVswitchDriver" my internal communication is the double now! It goes from ~200 (with LibvirtHybridOVSBridgeDriver) to 400Mbit/s (with LibvirtOpenVswitchDriver)! Still far from 1Gbit/s (my physical path limit) but, more acceptable now.
>>>
>>>
>>>The command "ethtool -K eth1 gro off" still makes no difference.
>>>
>>>
>>>So, there is only 1 remain problem, when traffic pass trough L3 / Namespace, it is still useless. Even the SSH connection into my Instances, via its Floating IPs, is slow as hell, sometimes it just stops responding for a few seconds, and becomes online again "out-of-nothing"...
>>>
>>>
>>>I just detect a weird "behavior", when I run "apt-get update" from instance-1, it is slow as I said plus, its ssh connection (where I'm running apt-get update), stops responding right after I run "apt-get update" AND, all my others ssh connections also stops working too! For a few seconds... This means that when I run "apt-get update" from within instance-1, the SSH session of instance-2 is affected too!! There is something pretty bad going on at L3 / Namespace.
>>>
>>>
>>>BTW, do you think that a ~400MBit/sec intra-vm-communication (GRE tunnel) on top of a 1Gbit ethernet is acceptable?! It is still less than a half...
>>>
>>>
>>>Thank you!
>>>Thiago
>>>
>>>
>>>On 25 October 2013 12:28, Darragh O'Reilly <dara2002-openstack@yahoo.com> wrote:
>>>
>>>Hi Thiago,
>>>>
>>>>
>>>>for the VIF error: you will need to change qemu.conf as described here:
>>>>http://openvswitch.org/openstack/documentation/
>>>>
>>>>
>>>>Re, Darragh.
>>>>
>>>>
>>>>
>>>>
>>>>On Friday, 25 October 2013, 15:14, Martinx - ジェームズ <thiagocmartinsc@gmail.com> wrote:
>>>>
>>>>Hi Darragh,
>>>>>
>>>>>
>>>>>Yes, Instances are getting MTU 1400.
>>>>>
>>>>>
>>>>>I'm using LibvirtHybridOVSBridgeDriver at my Compute Nodes. I'll check BG 1223267 right now! 
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>The LibvirtOpenVswitchDriver doesn't work, look:
>>>>>
>>>>>
>>>>>http://paste.openstack.org/show/49709/
>>>>>
>>>>>
>>>>>
>>>>>http://paste.openstack.org/show/49710/
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>My NICs are "RTL8111/8168/8411 PCI Express Gigabit Ethernet", Hypervisors motherboard are MSI-890FXA-GD70.
>>>>>
>>>>>
>>>>>The command "ethtool -K eth1 gro off" did not had any effect on the communication between instances on different hypervisors, still poor, around 248Mbit/sec, when its physical path reach 1Gbit/s (where GRE is built).
>>>>>
>>>>>
>>>>>My Linux version is "Linux hypervisor-1 3.8.0-32-generic #47~precise1-Ubuntu", same kernel on Network Node" and others nodes too (Ubuntu 12.04.3 installed from scratch for this Havana deployment).
>>>>>
>>>>>
>>>>>The only difference I can see right now, between my two hypervisors, is that my second is just a spare machine, with a slow CPU but, I don't think it will have a negative impact at the network throughput, since I have only 1 Instance running into it (plus a qemu-nbd process eating 90% of its CPU). I'll replace this CPU tomorrow, to redo this tests again but, I don't think that this is the source of my problem. The MOBOs of two hypervisors are identical, 1 3Com (manageable) switch connecting the two.
>>>>>
>>>>>
>>>>>Thanks!
>>>>>Thiago
>>>>>
>>>>>
>>>>>
>>>>>On 25 October 2013 07:15, Darragh O'Reilly <dara2002-openstack@yahoo.com> wrote:
>>>>>
>>>>>Hi Thiago,
>>>>>>
>>>>>>you have configured DHCP to push out a MTU of 1400. Can you confirm that the 1400 MTU is actually getting out to the instances by running 'ip link' on them?
>>>>>>
>>>>>>There is an open problem where the veth used to connect the OVS and Linux bridges causes a performance drop on some kernels - https://bugs.launchpad.net/nova-project/+bug/1223267 .  If you are using the LibvirtHybridOVSBridgeDriver VIF driver, can you try changing to LibvirtOpenVswitchDriver and repeat the iperf test between instances on different compute-nodes.
>>>>>>
>>>>>>What NICs (maker+model) are you using? You could try disabling any off-load functionality - 'ethtool -k <iface-used-for-gre>'.
>>>>>>
>>>>>>What kernal are you using: 'uname -a'?
>>>>>>
>>>>>>Re, Darragh.
>>>>>>
>>>>>>
>>>>>>> Hi Daniel,
>>>>>>
>>>>>>>
>>>>>>> I followed that page, my Instances MTU is lowered by DHCP Agent but, same
>>>>>>> result: poor network performance (internal between Instances and when
>>>>>>> trying to reach the Internet).
>>>>>>>
>>>>>>> No matter if I use "dnsmasq_config_file=/etc/neutron/dnsmasq-neutron.conf +
>>>>>>> "dhcp-option-force=26,1400"" for my Neutron DHCP agent, or not (i.e. MTU =
>>>>>>> 1500), the result is almost the same.
>>>>>>>
>>>>>>> I'll try VXLAN (or just VLANs) this weekend to see if I can get better
>>>>>>> results...
>>>>>>>
>>>>>>> Thanks!
>>>>>>> Thiago
>>>>>>
>>>>>>
>>>>>>_______________________________________________
>>>>>>Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
>>>>>>Post to     : openstack@lists.openstack.org
>>>>>>Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
>>>>>>
>>>>>
>>>>>
>>>>>
>>>
>>>
>>>
>
>
>
Re: Directional network performance issues with Neutron + OpenvSwitch [ In reply to ]
Okay, cool!

tap** removed, neutron-ovs-cleanup ok, bridges empty, all nodes rebooted.

BUT, still poor performance when reaching "External" network from within a
Instance (plus SSH lags)... [?]

I'll install a new Network Node, in another hardware, to test it more...
Weird thing is, my Grizzly Network Node works perfectly on this very same
hardware (same OpenStack Network topology, of course)...

Hardware of my current "net-node-1":

* Grizzly - Okay
* Havana - Fails... ;-(

Best,
Thiago


On 25 October 2013 15:28, Darragh O'Reilly <dara2002-openstack@yahoo.com>wrote:

>
> ok, the tunnels look fine. One thing that looks funny on the network node
> are these untagged tap* devices. I guess you switched to using veths and
> then switched back to not using them. I don't know if they matter, but you
> should clean them up by stopping everthing, running neutron-ovs-cleanup
> (check bridges empty) and reboot.
>
> Bridge br-int
> Port "tapa1376f61-05"
> Interface "tapa1376f61-05"
> ...
> Port "qr-a1376f61-05"
> tag: 1
> Interface "qr-a1376f61-05"
> type: internal
>
> Re, Darragh.
>
>
>
> On Friday, 25 October 2013, 17:28, Martinx - $B%8%'!<%`%:(B <
> thiagocmartinsc@gmail.com> wrote:
>
> Here we go:
>
> ---
> root@net-node-1:~# grep local_ip
> /etc/neutron/plugins/openvswitch/ovs_neutron_plugin.ini
> local_ip = 10.20.2.52
>
> root@net-node-1:~# ip r | grep 10.\20
> 10.20.2.0/24 dev eth1 proto kernel scope link src 10.20.2.52
> ---
>
> ---
> root@hypervisor-1:~# grep local_ip
> /etc/neutron/plugins/openvswitch/ovs_neutron_plugin.ini
> local_ip = 10.20.2.53
>
> root@hypervisor-1:~# ip r | grep 10.\20
> 10.20.2.0/24 dev eth1 proto kernel scope link src 10.20.2.53
> ---
>
> ---
> root@hypervisor-2:~# grep local_ip
> /etc/neutron/plugins/openvswitch/ovs_neutron_plugin.ini
> local_ip = 10.20.2.57
>
> root@hypervisor-2:~# ip r | grep 10.\20
> 10.20.2.0/24 dev eth1 proto kernel scope link src 10.20.2.57
> ---
>
> Each "ovs-vsctl show":
>
> net-node-1: http://paste.openstack.org/show/49727/
>
> hypervisor-1: http://paste.openstack.org/show/49728/
>
> hypervisor-2: http://paste.openstack.org/show/49729/
>
>
> Best,
> Thiago
>
>
> On 25 October 2013 14:11, Darragh O'Reilly <dara2002-openstack@yahoo.com>wrote:
>
>
> the uneven ssh performance is strange - maybe learning on the tunnel mesh
> is not stablizing. It is easy to mess it up by giving a wrong local_ip in
> the ovs-plugin config file. Check the tunnels ports on br-tun with
> 'ovs-vsctl show'. Is each one using the correct IPs? Br-tun should have N-1
> gre-x ports - no more! Maybe you can put 'ovs-vsctl show' from the nodes on
> paste.openstack if there are not to many?
>
> Re, Darragh.
>
>
> On Friday, 25 October 2013, 16:20, Martinx - $B%8%'!<%`%:(B <
> thiagocmartinsc@gmail.com> wrote:
>
> I think can say... "YAY!!" :-D
>
> With "LibvirtOpenVswitchDriver" my internal communication is the double
> now! It goes from ~200 (with LibvirtHybridOVSBridgeDriver) to *400Mbit/s*(with LibvirtOpenVswitchDriver)! Still far from 1Gbit/s (my physical path
> limit) but, more acceptable now.
>
> The command "ethtool -K eth1 gro off" still makes no difference.
>
> So, there is only 1 remain problem, when traffic pass trough L3 /
> Namespace, it is still useless. Even the SSH connection into my Instances,
> via its Floating IPs, is slow as hell, sometimes it just stops responding
> for a few seconds, and becomes online again "out-of-nothing"...
>
> I just detect a weird "behavior", when I run "apt-get update" from
> instance-1, it is slow as I said plus, its ssh connection (where I'm
> running apt-get update), stops responding right after I run "apt-get
> update" AND, *all my others ssh connections also stops working too!* For
> a few seconds... This means that when I run "apt-get update" from within
> instance-1, the SSH session of instance-2 is affected too!! There is
> something pretty bad going on at L3 / Namespace.
>
> BTW, do you think that a ~400MBit/sec intra-vm-communication (GRE tunnel)
> on top of a 1Gbit ethernet is acceptable?! It is still less than a half...
>
> Thank you!
> Thiago
>
> On 25 October 2013 12:28, Darragh O'Reilly <dara2002-openstack@yahoo.com>wrote:
>
> Hi Thiago,
>
> for the VIF error: you will need to change qemu.conf as described here:
> http://openvswitch.org/openstack/documentation/
>
> Re, Darragh.
>
>
> On Friday, 25 October 2013, 15:14, Martinx - $B%8%'!<%`%:(B <
> thiagocmartinsc@gmail.com> wrote:
>
> Hi Darragh,
>
> Yes, Instances are getting MTU 1400.
>
> I'm using LibvirtHybridOVSBridgeDriver at my Compute Nodes. I'll check BG
> 1223267 right now!
>
>
> The LibvirtOpenVswitchDriver doesn't work, look:
>
> http://paste.openstack.org/show/49709/
>
> http://paste.openstack.org/show/49710/
>
>
> My NICs are "RTL8111/8168/8411 PCI Express Gigabit Ethernet", Hypervisors
> motherboard are MSI-890FXA-GD70.
>
> The command "ethtool -K eth1 gro off" did not had any effect on the
> communication between instances on different hypervisors, still poor,
> around 248Mbit/sec, when its physical path reach 1Gbit/s (where GRE is
> built).
>
> My Linux version is "Linux hypervisor-1 3.8.0-32-generic
> #47~precise1-Ubuntu", same kernel on Network Node" and others nodes too
> (Ubuntu 12.04.3 installed from scratch for this Havana deployment).
>
> The only difference I can see right now, between my two hypervisors, is
> that my second is just a spare machine, with a slow CPU but, I don't think
> it will have a negative impact at the network throughput, since I have only
> 1 Instance running into it (plus a qemu-nbd process eating 90% of its CPU).
> I'll replace this CPU tomorrow, to redo this tests again but, I don't think
> that this is the source of my problem. The MOBOs of two hypervisors
> are identical, 1 3Com (manageable) switch connecting the two.
>
> Thanks!
> Thiago
>
>
> On 25 October 2013 07:15, Darragh O'Reilly <dara2002-openstack@yahoo.com>wrote:
>
> Hi Thiago,
>
> you have configured DHCP to push out a MTU of 1400. Can you confirm that
> the 1400 MTU is actually getting out to the instances by running 'ip link'
> on them?
>
> There is an open problem where the veth used to connect the OVS and Linux
> bridges causes a performance drop on some kernels -
> https://bugs.launchpad.net/nova-project/+bug/1223267 . If you are using
> the LibvirtHybridOVSBridgeDriver VIF driver, can you try changing to
> LibvirtOpenVswitchDriver and repeat the iperf test between instances on
> different compute-nodes.
>
> What NICs (maker+model) are you using? You could try disabling any
> off-load functionality - 'ethtool -k <iface-used-for-gre>'.
>
> What kernal are you using: 'uname -a'?
>
> Re, Darragh.
>
> > Hi Daniel,
>
> >
> > I followed that page, my Instances MTU is lowered by DHCP Agent but, same
> > result: poor network performance (internal between Instances and when
> > trying to reach the Internet).
> >
> > No matter if I use
> "dnsmasq_config_file=/etc/neutron/dnsmasq-neutron.conf +
> > "dhcp-option-force=26,1400"" for my Neutron DHCP agent, or not (i.e. MTU
> =
> > 1500), the result is almost the same.
> >
> > I'll try VXLAN (or just VLANs) this weekend to see if I can get better
> > results...
> >
> > Thanks!
> > Thiago
>
> _______________________________________________
> Mailing list:
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
> Post to : openstack@lists.openstack.org
> Unsubscribe :
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
>
>
>
>
>
>
>
>
>
>
>
Re: Directional network performance issues with Neutron + OpenvSwitch [ In reply to ]
Hi Rick,

On 25 October 2013 13:44, Rick Jones <rick.jones2@hp.com> wrote:

> On 10/25/2013 08:19 AM, Martinx - $B%8%'!<%`%:(B wrote:
>
>> I think can say... "YAY!!" :-D
>>
>> With "LibvirtOpenVswitchDriver" my internal communication is the double
>> now! It goes from ~200 (with LibvirtHybridOVSBridgeDriver) to
>> *_400Mbit/s_* (with LibvirtOpenVswitchDriver)! Still far from 1Gbit/s
>>
>> (my physical path limit) but, more acceptable now.
>>
>> The command "ethtool -K eth1 gro off" still makes no difference.
>>
>
> Does GRO happen if there isn't RX CKO on the NIC?



Ouch! I missed that lesson... hehe

No idea, how can I check / test this?

If I "disable RX CKO" (using ethtool?) on the NIC, how can I verify if the
GRO is actually happening or not?

Anyway, I'm goggling about all this stuff right now. Thanks for pointing it
out!

Refs:

* JLS2009: Generic receive offload - http://lwn.net/Articles/358910/


Can your NIC peer-into a GRE tunnel (?) to do CKO on the encapsulated
> traffic?
>


Again, no idea... No idea... :-/

Listen, maybe this sounds too dumb from my part but, it is the first time
I'm talking about this stuff (like "NIC peer-into GRE" ?, or GRO / CKO...

GRE tunnels sounds too damn complex and problematic... I guess it is time
to try VXLAN (or NVP ?)...

If you guys say: VXLAN is a completely different beast (i.e. it does not
touch with ANY GRE tunnel), and it works smoothly (without GRO / CKO / MTU
/ lags / low speed troubles and issues), I'll move to it right now (is
VXLAN docs ready?).

NOTE: I don't want to hijack this thread because of other (internal
communication VS "Directional network performance issues with Neutron +
OpenvSwitch" thread subject) problems with my OpenStack environment,
please, let me know if this becomes a problem for you guys.



> So, there is only 1 remain problem, when traffic pass trough L3 /
>> Namespace, it is still useless. Even the SSH connection into my
>> Instances, via its Floating IPs, is slow as hell, sometimes it just
>> stops responding for a few seconds, and becomes online again
>> "out-of-nothing"...
>>
>> I just detect a weird "behavior", when I run "apt-get update" from
>> instance-1, it is slow as I said plus, its ssh connection (where I'm
>> running apt-get update), stops responding right after I run "apt-get
>> update" AND, _all my others ssh connections also stops working too!_ For
>>
>> a few seconds... This means that when I run "apt-get update" from within
>> instance-1, the SSH session of instance-2 is affected too!! There is
>> something pretty bad going on at L3 / Namespace.
>>
>> BTW, do you think that a ~400MBit/sec intra-vm-communication (GRE
>> tunnel) on top of a 1Gbit ethernet is acceptable?! It is still less than
>> a half...
>>
>
> I would suggest checking for individual CPUs maxing-out during the 400
> Mbit/s transfers.


Okay, I'll.


>
>
> rick jones
>

Thiago
Re: Directional network performance issues with Neutron + OpenvSwitch [ In reply to ]
You can use "ethtool -k eth0" to view the setting and use "ethtool -K eth0
gro off" to turn off GRO.


On Fri, Oct 25, 2013 at 3:03 PM, Martinx - $B%8%'!<%`%:(B
<thiagocmartinsc@gmail.com>wrote:

> Hi Rick,
>
> On 25 October 2013 13:44, Rick Jones <rick.jones2@hp.com> wrote:
>
>> On 10/25/2013 08:19 AM, Martinx - $B%8%'!<%`%:(B wrote:
>>
>>> I think can say... "YAY!!" :-D
>>>
>>> With "LibvirtOpenVswitchDriver" my internal communication is the double
>>> now! It goes from ~200 (with LibvirtHybridOVSBridgeDriver) to
>>> *_400Mbit/s_* (with LibvirtOpenVswitchDriver)! Still far from 1Gbit/s
>>>
>>> (my physical path limit) but, more acceptable now.
>>>
>>> The command "ethtool -K eth1 gro off" still makes no difference.
>>>
>>
>> Does GRO happen if there isn't RX CKO on the NIC?
>
>
>
> Ouch! I missed that lesson... hehe
>
> No idea, how can I check / test this?
>
> If I "disable RX CKO" (using ethtool?) on the NIC, how can I verify if the
> GRO is actually happening or not?
>
> Anyway, I'm goggling about all this stuff right now. Thanks for pointing
> it out!
>
> Refs:
>
> * JLS2009: Generic receive offload - http://lwn.net/Articles/358910/
>
>
> Can your NIC peer-into a GRE tunnel (?) to do CKO on the encapsulated
>> traffic?
>>
>
>
> Again, no idea... No idea... :-/
>
> Listen, maybe this sounds too dumb from my part but, it is the first time
> I'm talking about this stuff (like "NIC peer-into GRE" ?, or GRO / CKO...
>
> GRE tunnels sounds too damn complex and problematic... I guess it is time
> to try VXLAN (or NVP ?)...
>
> If you guys say: VXLAN is a completely different beast (i.e. it does not
> touch with ANY GRE tunnel), and it works smoothly (without GRO / CKO / MTU
> / lags / low speed troubles and issues), I'll move to it right now (is
> VXLAN docs ready?).
>
> NOTE: I don't want to hijack this thread because of other (internal
> communication VS "Directional network performance issues with Neutron +
> OpenvSwitch" thread subject) problems with my OpenStack environment,
> please, let me know if this becomes a problem for you guys.
>
>
>
>> So, there is only 1 remain problem, when traffic pass trough L3 /
>>> Namespace, it is still useless. Even the SSH connection into my
>>> Instances, via its Floating IPs, is slow as hell, sometimes it just
>>> stops responding for a few seconds, and becomes online again
>>> "out-of-nothing"...
>>>
>>> I just detect a weird "behavior", when I run "apt-get update" from
>>> instance-1, it is slow as I said plus, its ssh connection (where I'm
>>> running apt-get update), stops responding right after I run "apt-get
>>> update" AND, _all my others ssh connections also stops working too!_ For
>>>
>>> a few seconds... This means that when I run "apt-get update" from within
>>> instance-1, the SSH session of instance-2 is affected too!! There is
>>> something pretty bad going on at L3 / Namespace.
>>>
>>> BTW, do you think that a ~400MBit/sec intra-vm-communication (GRE
>>> tunnel) on top of a 1Gbit ethernet is acceptable?! It is still less than
>>> a half...
>>>
>>
>> I would suggest checking for individual CPUs maxing-out during the 400
>> Mbit/s transfers.
>
>
> Okay, I'll.
>
>
>>
>>
>> rick jones
>>
>
> Thiago
>
> _______________________________________________
> Mailing list:
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
> Post to : openstack@lists.openstack.org
> Unsubscribe :
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
>
>
Re: Directional network performance issues with Neutron + OpenvSwitch [ In reply to ]
> Listen, maybe this sounds too dumb from my part but, it is the first
> time I'm talking about this stuff (like "NIC peer-into GRE" ?, or GRO
> / CKO...

No worries.

So, a slightly brief history of stateless offloads in NICs. It may be
too basic, and I may get some details wrong, but it should give the gist.

Go back to the "old days" - 10 Mbit/s Ethernet was "it" (all you Token
Ring fans can keep quiet :). Systems got faster than 10 Mbit/s. By a
fair margin. 100 BT came out and it wasn't all that long before systems
were faster than that, but things like interrupt rates were starting to
get to be an issue for performance, so 100 BT NICs started implementing
interrupt avoidance heuristics. The next bump in network speed to 1000
Mbit/s managed to get well out ahead of the systems. All this time,
while the link speeds were increasing, the IEEE was doing little to
nothing to make sending and receiving Ethernet traffic any easier on the
end stations (eg increasing the MTU). It was taking just as many CPU
cycles to send/receive a frame over 1000BT as it did over 100BT as it
did over 10BT.

<insert segque about how FDDI was doing things to make life easier, as
well as what the FDDI NIC vendors were doing to enable copy-free
networking, here>

So the Ethernet NIC vendors started getting creative and started
borrowing some techniques from FDDI. The base of it all is CKO -
ChecKsum Offload. Offloading the checksum calculation for the TCP and
UDP checksums. In broad handwaving terms, for inbound packets, the NIC
is made either smart enough to recognize an incoming frame as TCP
segment (UDP datagram) or it performs the Internet Checksum across the
entire frame and leaves it to the driver to fixup. For outbound
traffic, the stack, via the driver, tells the NIC a starting value
(perhaps), where to start computing the checksum, how far to go, and
where to stick it...

So, we can save the CPU cycles used calculating/verifying the checksums.
In rough terms, in the presence of copies, that is perhaps 10% or 15%
savings. Systems still needed more. It was just as many trips up and
down the protocol stack in the host to send a MB of data as it was
before - the IEEE hanging-on to the 1500 byte MTU. So, some NIC vendors
came-up with Jumbo Frames - I think the first may have been Alteon and
their AceNICs and switches. A 9000 byte MTU allows one to send bulk
data across the network in ~1/6 the number of trips up and down the
protocol stack. But that has problems - in particular you have to have
support for Jumbo Frames from end to end.

So someone, I don't recall who, had the flash of inspiration - What
If... the NIC could perform the TCP segmentation on behalf of the
stack? When sending a big chunk of data over TCP in one direction, the
only things which change from TCP segment to TCP segment are the
sequence number, and the checksum <insert some handwaving about the IP
datagram ID here>. The NIC already knows how to compute the checksum,
so let's teach it how to very simply increment the TCP sequence number.
Now we can give it A Lot of Data (tm) in one trip down the protocol
stack and save even more CPU cycles than Jumbo Frames. Now the NIC has
to know a little bit more about the traffic - it has to know that it is
TCP so it can know where the TCP sequence number goes. We also tell it
the MSS to use when it is doing the segmentation on our behalf. Thus
was born TCP Segmentation Offload, aka TSO or "Poor Man's Jumbo Frames"

That works pretty well for servers at the time - they tend to send more
data than they receive. The clients receiving the data don't need to be
able to keep up at 1000 Mbit/s and the server can be sending to multiple
clients. However, we get another order of magnitude bump in link
speeds, to 10000 Mbit/s. Now people need/want to receive at the higher
speeds too. So some 10 Gbit/s NIC vendors come up with the mirror image
of TSO and call it LRO - Large Receive Offload. The LRO NIC will
coalesce several, consequtive TCP segments into one uber segment and
hand that to the host. There are some "issues" with LRO though - for
example when a system is acting as a router, so in Linux, and perhaps
other stacks, LRO is taken out of the hands of the NIC and given to the
stack in the form of 'GRO" - Generic Receive Offload. GRO operates
above the NIC/driver, but below IP. It detects the consecutive
segments and coalesces them before passing them further up the stack. It
becomes possible to receive data at link-rate over 10 GbE. All is
happiness and joy.

OK, so now we have all these "stateless" offloads that know about the
basic traffic flow. They are all built on the foundation of CKO. They
are all dealing with *un* encapsulated traffic. (They also don't to
anything for small packets.)

Now, toss-in some encapsulation. Take your pick, in the abstract it
doesn't really matter which I suspect, at least for a little longer.
What is arriving at the NIC on inbound is no longer a TCP segment in an
IP datagram in an Ethernet frame, it is all that wrapped-up in the
encapsulation protocol. Unless the NIC knows about the encapsulation
protocol, all the NIC knows it has is some slightly alien packet. It
will probably know it is IP, but it won't know more than that.

It could, perhaps, simply compute an Internet Checksum across the entire
IP datagram and leave it to the driver to fix-up. It could simply punt
and not perform any CKO at all. But CKO is the foundation of the
stateless offloads. So, certainly no LRO and (I think but could be
wrong) no GRO. (At least not until the Linux stack learns how to look
beyond the encapsulation headers.)

Similarly, consider the outbound path. We could change the constants we
tell the NIC for doing CKO perhaps, but unless it knows about the
encapsulation protocol, we cannot ask it to do the TCP segmentation of
TSO - it would have to start replicating not only the TCP and IP
headers, but also the headers of the encapsulation protocol. So, there
goes TSO.

In essence, using an encapsulation protocol takes us all the way back to
the days of 100BT in so far as stateless offloads are concerned.
Perhaps to the early days of 1000BT.

We do have a bit more CPU grunt these days, but for the last several
years that has come primarily in the form of more cores per processor,
not in the form of processors with higher and higher frequencies. In
broad handwaving terms, single-threaded performance is not growing all
that much. If at all.

That is why we have things like multiple queues per NIC port now and
Receive Side Scaling (RSS) or Receive Packet Scaling/Receive Flow
Scaling in Linux (or Inbound Packet Scheduling/Thread Optimized Packet
Scheduling in HP-UX etc etc). RSS works by having the NIC compute a
hash over selected headers of the arriving packet - perhaps the source
and destination MAC addresses, perhaps the source and destination IP
addresses, and perhaps the source and destination TCP ports. But now
the arrving traffic is all wrapped up in this encapsulation protocol
that the NIC might not know about. Over what should the NIC compute the
hash with which to pick the queue that then picks the CPU to interrupt?
It may just punt and send all the traffic up one queue.

There are similar sorts of hashes being computed at either end of a
bond/aggregate/trunk. And the switches or bonding drivers making those
calculations may not know about the encapsulation protocol, so they may
not be able to spread traffic across multiple links. The information
they used to use is now hidden from them by the encapsulation protocol.

That then is what I was getting at when talking about NICs peering into GRE.

rick jones
All I want for Christmas is a 32 bit VLAN ID and NICs and switches which
understand it... :)

_______________________________________________
Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
Post to : openstack@lists.openstack.org
Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
Re: Directional network performance issues with Neutron + OpenvSwitch [ In reply to ]
WOW!! Thank you for your time Rick! Awesome answer!! =D

I'll do this tests (with ethtool GRO / CKO) tonight but, do you think that
this is the main root of the problem?!


I mean, I'm seeing two distinct problems here:

1- Slow connectivity to the External network plus SSH lags all over the
cloud (everything that pass trough L3 / Namespace is problematic), and;

2- Communication between two Instances on different hypervisors (i.e. maybe
it is related to this GRO / CKO thing).


So, two different problems, right?!

Thanks!
Thiago


On 25 October 2013 18:56, Rick Jones <rick.jones2@hp.com> wrote:

> > Listen, maybe this sounds too dumb from my part but, it is the first
> > time I'm talking about this stuff (like "NIC peer-into GRE" ?, or GRO
> > / CKO...
>
> No worries.
>
> So, a slightly brief history of stateless offloads in NICs. It may be
> too basic, and I may get some details wrong, but it should give the gist.
>
> Go back to the "old days" - 10 Mbit/s Ethernet was "it" (all you Token
> Ring fans can keep quiet :). Systems got faster than 10 Mbit/s. By a
> fair margin. 100 BT came out and it wasn't all that long before systems
> were faster than that, but things like interrupt rates were starting to
> get to be an issue for performance, so 100 BT NICs started implementing
> interrupt avoidance heuristics. The next bump in network speed to 1000
> Mbit/s managed to get well out ahead of the systems. All this time,
> while the link speeds were increasing, the IEEE was doing little to
> nothing to make sending and receiving Ethernet traffic any easier on the
> end stations (eg increasing the MTU). It was taking just as many CPU
> cycles to send/receive a frame over 1000BT as it did over 100BT as it
> did over 10BT.
>
> <insert segque about how FDDI was doing things to make life easier, as
> well as what the FDDI NIC vendors were doing to enable copy-free
> networking, here>
>
> So the Ethernet NIC vendors started getting creative and started
> borrowing some techniques from FDDI. The base of it all is CKO -
> ChecKsum Offload. Offloading the checksum calculation for the TCP and
> UDP checksums. In broad handwaving terms, for inbound packets, the NIC
> is made either smart enough to recognize an incoming frame as TCP
> segment (UDP datagram) or it performs the Internet Checksum across the
> entire frame and leaves it to the driver to fixup. For outbound
> traffic, the stack, via the driver, tells the NIC a starting value
> (perhaps), where to start computing the checksum, how far to go, and
> where to stick it...
>
> So, we can save the CPU cycles used calculating/verifying the checksums.
> In rough terms, in the presence of copies, that is perhaps 10% or 15%
> savings. Systems still needed more. It was just as many trips up and
> down the protocol stack in the host to send a MB of data as it was
> before - the IEEE hanging-on to the 1500 byte MTU. So, some NIC vendors
> came-up with Jumbo Frames - I think the first may have been Alteon and
> their AceNICs and switches. A 9000 byte MTU allows one to send bulk
> data across the network in ~1/6 the number of trips up and down the
> protocol stack. But that has problems - in particular you have to have
> support for Jumbo Frames from end to end.
>
> So someone, I don't recall who, had the flash of inspiration - What
> If... the NIC could perform the TCP segmentation on behalf of the
> stack? When sending a big chunk of data over TCP in one direction, the
> only things which change from TCP segment to TCP segment are the
> sequence number, and the checksum <insert some handwaving about the IP
> datagram ID here>. The NIC already knows how to compute the checksum,
> so let's teach it how to very simply increment the TCP sequence number.
> Now we can give it A Lot of Data (tm) in one trip down the protocol
> stack and save even more CPU cycles than Jumbo Frames. Now the NIC has
> to know a little bit more about the traffic - it has to know that it is
> TCP so it can know where the TCP sequence number goes. We also tell it
> the MSS to use when it is doing the segmentation on our behalf. Thus
> was born TCP Segmentation Offload, aka TSO or "Poor Man's Jumbo Frames"
>
> That works pretty well for servers at the time - they tend to send more
> data than they receive. The clients receiving the data don't need to be
> able to keep up at 1000 Mbit/s and the server can be sending to multiple
> clients. However, we get another order of magnitude bump in link
> speeds, to 10000 Mbit/s. Now people need/want to receive at the higher
> speeds too. So some 10 Gbit/s NIC vendors come up with the mirror image
> of TSO and call it LRO - Large Receive Offload. The LRO NIC will
> coalesce several, consequtive TCP segments into one uber segment and
> hand that to the host. There are some "issues" with LRO though - for
> example when a system is acting as a router, so in Linux, and perhaps
> other stacks, LRO is taken out of the hands of the NIC and given to the
> stack in the form of 'GRO" - Generic Receive Offload. GRO operates
> above the NIC/driver, but below IP. It detects the consecutive
> segments and coalesces them before passing them further up the stack. It
> becomes possible to receive data at link-rate over 10 GbE. All is
> happiness and joy.
>
> OK, so now we have all these "stateless" offloads that know about the
> basic traffic flow. They are all built on the foundation of CKO. They
> are all dealing with *un* encapsulated traffic. (They also don't to
> anything for small packets.)
>
> Now, toss-in some encapsulation. Take your pick, in the abstract it
> doesn't really matter which I suspect, at least for a little longer.
> What is arriving at the NIC on inbound is no longer a TCP segment in an
> IP datagram in an Ethernet frame, it is all that wrapped-up in the
> encapsulation protocol. Unless the NIC knows about the encapsulation
> protocol, all the NIC knows it has is some slightly alien packet. It
> will probably know it is IP, but it won't know more than that.
>
> It could, perhaps, simply compute an Internet Checksum across the entire
> IP datagram and leave it to the driver to fix-up. It could simply punt
> and not perform any CKO at all. But CKO is the foundation of the
> stateless offloads. So, certainly no LRO and (I think but could be
> wrong) no GRO. (At least not until the Linux stack learns how to look
> beyond the encapsulation headers.)
>
> Similarly, consider the outbound path. We could change the constants we
> tell the NIC for doing CKO perhaps, but unless it knows about the
> encapsulation protocol, we cannot ask it to do the TCP segmentation of
> TSO - it would have to start replicating not only the TCP and IP
> headers, but also the headers of the encapsulation protocol. So, there
> goes TSO.
>
> In essence, using an encapsulation protocol takes us all the way back to
> the days of 100BT in so far as stateless offloads are concerned.
> Perhaps to the early days of 1000BT.
>
> We do have a bit more CPU grunt these days, but for the last several
> years that has come primarily in the form of more cores per processor,
> not in the form of processors with higher and higher frequencies. In
> broad handwaving terms, single-threaded performance is not growing all
> that much. If at all.
>
> That is why we have things like multiple queues per NIC port now and
> Receive Side Scaling (RSS) or Receive Packet Scaling/Receive Flow
> Scaling in Linux (or Inbound Packet Scheduling/Thread Optimized Packet
> Scheduling in HP-UX etc etc). RSS works by having the NIC compute a
> hash over selected headers of the arriving packet - perhaps the source
> and destination MAC addresses, perhaps the source and destination IP
> addresses, and perhaps the source and destination TCP ports. But now
> the arrving traffic is all wrapped up in this encapsulation protocol
> that the NIC might not know about. Over what should the NIC compute the
> hash with which to pick the queue that then picks the CPU to interrupt?
> It may just punt and send all the traffic up one queue.
>
> There are similar sorts of hashes being computed at either end of a
> bond/aggregate/trunk. And the switches or bonding drivers making those
> calculations may not know about the encapsulation protocol, so they may
> not be able to spread traffic across multiple links. The information
> they used to use is now hidden from them by the encapsulation protocol.
>
> That then is what I was getting at when talking about NICs peering into
> GRE.
>
> rick jones
> All I want for Christmas is a 32 bit VLAN ID and NICs and switches which
> understand it... :)
>
Re: Directional network performance issues with Neutron + OpenvSwitch [ In reply to ]
On 10/25/2013 02:37 PM, Martinx - ジェームズ wrote:
> WOW!! Thank you for your time Rick! Awesome answer!! =D
>
> I'll do this tests (with ethtool GRO / CKO) tonight but, do you think
> that this is the main root of the problem?!
>
>
> I mean, I'm seeing two distinct problems here:
>
> 1- Slow connectivity to the External network plus SSH lags all over the
> cloud (everything that pass trough L3 / Namespace is problematic), and;
>
> 2- Communication between two Instances on different hypervisors (i.e.
> maybe it is related to this GRO / CKO thing).
>
>
> So, two different problems, right?!

One or two problems I cannot say. Certainly if one got the benefit of
stateless offloads in one direction and not the other, one could see
different performance limits in each direction.

All I can really say is I liked it better when we were called Quantum,
because then I could refer to it as "Spooky networking at a distance."
Sadly, describing Neutron as "Networking with no inherent charge"
doesn't work as well :)

rick jones


_______________________________________________
Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
Post to : openstack@lists.openstack.org
Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
Re: Directional network performance issues with Neutron + OpenvSwitch [ In reply to ]
LOL... One day, Internet via "Quantum Entanglement"! Oops, Neutron! =P

I'll ignore the problems related to the "performance between two instances
on different hypervisors" for now. My priority is the connectivity issue
with the External networks... At least, internal is slow but it works.

I'm about to remove the L3 Agent / Namespaces entirely from my topology...
It is a shame because it is pretty cool! With Grizzly I had no problems at
all. Plus, I need to put Havana into production ASAP! :-/

Why I'm giving it up (of L3 / NS) for now? Because I tried:

The option "tenant_network_type" with gre, vxlan and vlan (range
physnet1:206:256 configured at the 3Com switch as tagged).

From the instances, the connection with External network *is always slow*,
no matter if I choose for Tenants, GRE, VXLAN or VLAN.

For example, right now, I'm using VLAN, same problem.

Don't you guys think that this can be a problem with the bridge "br-ex" and
its internals ? Since I swapped the "Tenant Network Type" 3 times, same
result... But I still did not removed the br-ex from the scene.

If someone wants to debug it, I can give the root password, no problem, it
is just a lab... =)

Thanks!
Thiago

On 25 October 2013 19:45, Rick Jones <rick.jones2@hp.com> wrote:

> On 10/25/2013 02:37 PM, Martinx - $B%8%'!<%`%:(B wrote:
>
>> WOW!! Thank you for your time Rick! Awesome answer!! =D
>>
>> I'll do this tests (with ethtool GRO / CKO) tonight but, do you think
>> that this is the main root of the problem?!
>>
>>
>> I mean, I'm seeing two distinct problems here:
>>
>> 1- Slow connectivity to the External network plus SSH lags all over the
>> cloud (everything that pass trough L3 / Namespace is problematic), and;
>>
>> 2- Communication between two Instances on different hypervisors (i.e.
>> maybe it is related to this GRO / CKO thing).
>>
>>
>> So, two different problems, right?!
>>
>
> One or two problems I cannot say. Certainly if one got the benefit of
> stateless offloads in one direction and not the other, one could see
> different performance limits in each direction.
>
> All I can really say is I liked it better when we were called Quantum,
> because then I could refer to it as "Spooky networking at a distance."
> Sadly, describing Neutron as "Networking with no inherent charge" doesn't
> work as well :)
>
> rick jones
>
>
Re: Directional network performance issues with Neutron + OpenvSwitch [ In reply to ]
I was able to enable "ovs_use_veth" and start Instances (VXLAN / DHCP /
Metadata Okay)... But, same problem when accessing External network.

BTW, I have valid "Floating IPs" and easy access to the Internet from the
Network Node, if someone wants to debug, just ping a message.


On 26 October 2013 02:25, Martinx - $B%8%'!<%`%:(B <thiagocmartinsc@gmail.com> wrote:

> LOL... One day, Internet via "Quantum Entanglement"! Oops, Neutron! =P
>
> I'll ignore the problems related to the "performance between two instances
> on different hypervisors" for now. My priority is the connectivity issue
> with the External networks... At least, internal is slow but it works.
>
> I'm about to remove the L3 Agent / Namespaces entirely from my topology...
> It is a shame because it is pretty cool! With Grizzly I had no problems at
> all. Plus, I need to put Havana into production ASAP! :-/
>
> Why I'm giving it up (of L3 / NS) for now? Because I tried:
>
> The option "tenant_network_type" with gre, vxlan and vlan (range
> physnet1:206:256 configured at the 3Com switch as tagged).
>
> From the instances, the connection with External network *is always slow*,
> no matter if I choose for Tenants, GRE, VXLAN or VLAN.
>
> For example, right now, I'm using VLAN, same problem.
>
> Don't you guys think that this can be a problem with the bridge "br-ex"
> and its internals ? Since I swapped the "Tenant Network Type" 3 times, same
> result... But I still did not removed the br-ex from the scene.
>
> If someone wants to debug it, I can give the root password, no problem, it
> is just a lab... =)
>
> Thanks!
> Thiago
>
> On 25 October 2013 19:45, Rick Jones <rick.jones2@hp.com> wrote:
>
>> On 10/25/2013 02:37 PM, Martinx - $B%8%'!<%`%:(B wrote:
>>
>>> WOW!! Thank you for your time Rick! Awesome answer!! =D
>>>
>>> I'll do this tests (with ethtool GRO / CKO) tonight but, do you think
>>> that this is the main root of the problem?!
>>>
>>>
>>> I mean, I'm seeing two distinct problems here:
>>>
>>> 1- Slow connectivity to the External network plus SSH lags all over the
>>> cloud (everything that pass trough L3 / Namespace is problematic), and;
>>>
>>> 2- Communication between two Instances on different hypervisors (i.e.
>>> maybe it is related to this GRO / CKO thing).
>>>
>>>
>>> So, two different problems, right?!
>>>
>>
>> One or two problems I cannot say. Certainly if one got the benefit of
>> stateless offloads in one direction and not the other, one could see
>> different performance limits in each direction.
>>
>> All I can really say is I liked it better when we were called Quantum,
>> because then I could refer to it as "Spooky networking at a distance."
>> Sadly, describing Neutron as "Networking with no inherent charge" doesn't
>> work as well :)
>>
>> rick jones
>>
>>
>
Re: Directional network performance issues with Neutron + OpenvSwitch [ In reply to ]
Hi Thiago,

so just to confirm - on the same netnode machine, with the same OS, kernal and OVS versions - Grizzly is ok and Havana is not?

Also, on the network node, are there any errors in the neutron logs, the syslog, or /var/log/openvswitch/* ?


Re, Darragh.




On Saturday, 26 October 2013, 5:25, Martinx - ジェームズ <thiagocmartinsc@gmail.com> wrote:

LOL... One day, Internet via "Quantum Entanglement"! Oops, Neutron!     =P
>
>
>
>I'll ignore the problems related to the "performance between two instances on different hypervisors" for now. My priority is the connectivity issue with the External networks... At least, internal is slow but it works.
>
>
>I'm about to remove the L3 Agent / Namespaces entirely from my topology... It is a shame because it is pretty cool! With Grizzly I had no problems at all. Plus, I need to put Havana into production ASAP!    :-/
>
>
>Why I'm giving it up (of L3 / NS) for now? Because I tried:
>
>
>The option "tenant_network_type" with gre, vxlan and vlan (range physnet1:206:256 configured at the 3Com switch as tagged).
>
>
>From the instances, the connection with External network is always slow, no matter if I choose for Tenants, GRE, VXLAN or VLAN.
>
>
>For example, right now, I'm using VLAN, same problem.
>
>
>Don't you guys think that this can be a problem with the bridge "br-ex" and its internals ? Since I swapped the "Tenant Network Type" 3 times, same result... But I still did not removed the br-ex from the scene.
>
>
>If someone wants to debug it, I can give the root password, no problem, it is just a lab...   =)
>
>
>Thanks!
>Thiago
>
>
>On 25 October 2013 19:45, Rick Jones <rick.jones2@hp.com> wrote:
>
>On 10/25/2013 02:37 PM, Martinx - ジェームズ wrote:
>>
>>WOW!! Thank you for your time Rick! Awesome answer!!    =D
>>>
>>>I'll do this tests (with ethtool GRO / CKO) tonight but, do you think
>>>that this is the main root of the problem?!
>>>
>>>
>>>I mean, I'm seeing two distinct problems here:
>>>
>>>1- Slow connectivity to the External network plus SSH lags all over the
>>>cloud (everything that pass trough L3 / Namespace is problematic), and;
>>>
>>>2- Communication between two Instances on different hypervisors (i.e.
>>>maybe it is related to this GRO / CKO thing).
>>>
>>>
>>>So, two different problems, right?!
>>>
>>
One or two problems I cannot say.    Certainly if one got the benefit of stateless offloads in one direction and not the other, one could see different performance limits in each direction.
>>
>>All I can really say is I liked it better when we were called Quantum, because then I could refer to it as "Spooky networking at a distance."  Sadly, describing Neutron as "Networking with no inherent charge" doesn't work as well :)
>>
>>rick jones
>>
>>
>
>
>
Re: Directional network performance issues with Neutron + OpenvSwitch [ In reply to ]
Hi Darragh,

Yes, on the same net-node machine, Grizzly works, Havana don't... But, for
Grizzly, I have Ubuntu 12.04 with Linux 3.2 and OVS 1.4.0-1ubuntu1.6.

If I replace the Havana net-node hardware entirely, the problem persist
(i.e. it "follows" Havana net-node), so, I think, it can not be related to
the hardware.

I tried Havana with both OVS 1.10.2 (from Cloud Archive) and with OVS
1.11.0 (compiled and installed by myself using dpkg-buildpackage / dpkg).

My logs (including Open vSwitch) right after starting an Instance (nothing
at OVS logs):

http://paste.openstack.org/show/49870/

I tried everything, including installing the Network Node on top of a KVM
virtual machine or directly on a dedicated server, same result, the problem
follows Hanava node (virtual or physical). Grizzly Network Node works both
on a KVM VM or on a dedicated server.

Regards,
Thiago


On 26 October 2013 06:28, Darragh OReilly <darragh.oreilly@yahoo.com> wrote:

> Hi Thiago,
>
> so just to confirm - on the same netnode machine, with the same OS, kernal
> and OVS versions - Grizzly is ok and Havana is not?
>
> Also, on the network node, are there any errors in the neutron logs, the
> syslog, or /var/log/openvswitch/* ?
>
> Re, Darragh.
>
>
> On Saturday, 26 October 2013, 5:25, Martinx - $B%8%'!<%`%:(B <
> thiagocmartinsc@gmail.com> wrote:
>
> LOL... One day, Internet via "Quantum Entanglement"! Oops, Neutron! =P
>
> I'll ignore the problems related to the "performance between two instances
> on different hypervisors" for now. My priority is the connectivity issue
> with the External networks... At least, internal is slow but it works.
>
> I'm about to remove the L3 Agent / Namespaces entirely from my topology...
> It is a shame because it is pretty cool! With Grizzly I had no problems at
> all. Plus, I need to put Havana into production ASAP! :-/
>
> Why I'm giving it up (of L3 / NS) for now? Because I tried:
>
> The option "tenant_network_type" with gre, vxlan and vlan (range
> physnet1:206:256 configured at the 3Com switch as tagged).
>
> From the instances, the connection with External network *is always slow*,
> no matter if I choose for Tenants, GRE, VXLAN or VLAN.
>
> For example, right now, I'm using VLAN, same problem.
>
> Don't you guys think that this can be a problem with the bridge "br-ex"
> and its internals ? Since I swapped the "Tenant Network Type" 3 times, same
> result... But I still did not removed the br-ex from the scene.
>
> If someone wants to debug it, I can give the root password, no problem, it
> is just a lab... =)
>
> Thanks!
> Thiago
>
> On 25 October 2013 19:45, Rick Jones <rick.jones2@hp.com> wrote:
>
> On 10/25/2013 02:37 PM, Martinx - $B%8%'!<%`%:(B wrote:
>
> WOW!! Thank you for your time Rick! Awesome answer!! =D
>
> I'll do this tests (with ethtool GRO / CKO) tonight but, do you think
> that this is the main root of the problem?!
>
>
> I mean, I'm seeing two distinct problems here:
>
> 1- Slow connectivity to the External network plus SSH lags all over the
> cloud (everything that pass trough L3 / Namespace is problematic), and;
>
> 2- Communication between two Instances on different hypervisors (i.e.
> maybe it is related to this GRO / CKO thing).
>
>
> So, two different problems, right?!
>
>
> One or two problems I cannot say. Certainly if one got the benefit of
> stateless offloads in one direction and not the other, one could see
> different performance limits in each direction.
>
> All I can really say is I liked it better when we were called Quantum,
> because then I could refer to it as "Spooky networking at a distance."
> Sadly, describing Neutron as "Networking with no inherent charge" doesn't
> work as well :)
>
> rick jones
>
>
>
>
>
Re: Directional network performance issues with Neutron + OpenvSwitch [ In reply to ]
Stackers,

I have a small report from my latest tests.


Tests:

* Namespace (br-ex) *<->* Internet - OK

* Namespace (vxlan,gre,vlan) *<->* Tenant - OK

* Tenant *<->* Namespace *<->* Internet - *NOT-OK* (Very slow / Unstable /
Intermittent)


Since the connectivity from Tenant to its Namespace is fine AND, from its
Namespace to the Internet is also fine too, then, come to my mind: Hey, why
not run Squid WITHIN the Tenant Namespace as a workaround?!

And... Voialá! There I "Fixed" It! =P


New Test:

Tenant *<->* *Namespace with Squid* *<->* Internet - OK!


*NOTE:* I'm sure that the entire ethernet path (without L3, Namespace, OVS,
VXLANs, GREs, or Linux bridges, just plain Linux + IPs), *from the
hypervisor to the Internet*, *passing trough the same Network Node hardware
/ path*, is working smoothly. I mean, I tested the entire path BEFORE
installing OpenStack Havana... So, I it can not be a "infrastructure /
hardware" issue, it must be something else, located at the software layer
running within the Network Node itself.

I'm about to send more info about this problem.

Thanks!
Thiago

On 26 October 2013 13:57, Martinx - ジェームズ <thiagocmartinsc@gmail.com> wrote:

> Hi Darragh,
>
> Yes, on the same net-node machine, Grizzly works, Havana don't... But, for
> Grizzly, I have Ubuntu 12.04 with Linux 3.2 and OVS 1.4.0-1ubuntu1.6.
>
> If I replace the Havana net-node hardware entirely, the problem persist
> (i.e. it "follows" Havana net-node), so, I think, it can not be related to
> the hardware.
>
> I tried Havana with both OVS 1.10.2 (from Cloud Archive) and with OVS
> 1.11.0 (compiled and installed by myself using dpkg-buildpackage / dpkg).
>
> My logs (including Open vSwitch) right after starting an Instance (nothing
> at OVS logs):
>
> http://paste.openstack.org/show/49870/
>
> I tried everything, including installing the Network Node on top of a KVM
> virtual machine or directly on a dedicated server, same result, the problem
> follows Hanava node (virtual or physical). Grizzly Network Node works both
> on a KVM VM or on a dedicated server.
>
> Regards,
> Thiago
>
>
> On 26 October 2013 06:28, Darragh OReilly <darragh.oreilly@yahoo.com>wrote:
>
>> Hi Thiago,
>>
>> so just to confirm - on the same netnode machine, with the same OS,
>> kernal and OVS versions - Grizzly is ok and Havana is not?
>>
>> Also, on the network node, are there any errors in the neutron logs, the
>> syslog, or /var/log/openvswitch/* ?
>>
>> Re, Darragh.
>>
>>
>> On Saturday, 26 October 2013, 5:25, Martinx - ジェームズ <
>> thiagocmartinsc@gmail.com> wrote:
>>
>> LOL... One day, Internet via "Quantum Entanglement"! Oops, Neutron! =P
>>
>> I'll ignore the problems related to the "performance between two
>> instances on different hypervisors" for now. My priority is the
>> connectivity issue with the External networks... At least, internal is slow
>> but it works.
>>
>> I'm about to remove the L3 Agent / Namespaces entirely from my
>> topology... It is a shame because it is pretty cool! With Grizzly I had no
>> problems at all. Plus, I need to put Havana into production ASAP! :-/
>>
>> Why I'm giving it up (of L3 / NS) for now? Because I tried:
>>
>> The option "tenant_network_type" with gre, vxlan and vlan (range
>> physnet1:206:256 configured at the 3Com switch as tagged).
>>
>> From the instances, the connection with External network *is always slow*,
>> no matter if I choose for Tenants, GRE, VXLAN or VLAN.
>>
>> For example, right now, I'm using VLAN, same problem.
>>
>> Don't you guys think that this can be a problem with the bridge "br-ex"
>> and its internals ? Since I swapped the "Tenant Network Type" 3 times, same
>> result... But I still did not removed the br-ex from the scene.
>>
>> If someone wants to debug it, I can give the root password, no problem,
>> it is just a lab... =)
>>
>> Thanks!
>> Thiago
>>
>> On 25 October 2013 19:45, Rick Jones <rick.jones2@hp.com> wrote:
>>
>> On 10/25/2013 02:37 PM, Martinx - ジェームズ wrote:
>>
>> WOW!! Thank you for your time Rick! Awesome answer!! =D
>>
>> I'll do this tests (with ethtool GRO / CKO) tonight but, do you think
>> that this is the main root of the problem?!
>>
>>
>> I mean, I'm seeing two distinct problems here:
>>
>> 1- Slow connectivity to the External network plus SSH lags all over the
>> cloud (everything that pass trough L3 / Namespace is problematic), and;
>>
>> 2- Communication between two Instances on different hypervisors (i.e.
>> maybe it is related to this GRO / CKO thing).
>>
>>
>> So, two different problems, right?!
>>
>>
>> One or two problems I cannot say. Certainly if one got the benefit of
>> stateless offloads in one direction and not the other, one could see
>> different performance limits in each direction.
>>
>> All I can really say is I liked it better when we were called Quantum,
>> because then I could refer to it as "Spooky networking at a distance."
>> Sadly, describing Neutron as "Networking with no inherent charge" doesn't
>> work as well :)
>>
>> rick jones
>>
>>
>>
>>
>>
>
Re: Directional network performance issues with Neutron + OpenvSwitch [ In reply to ]
Thiago,

some more answers below.

Btw: I saw the problem with a "qemu-nbd -c" process using all the cpu on the compute. It happened just once - must be a bug in it. You can disable libvirt injection if you don't want it by setting "libvirt_inject_partition = -2" in nova.conf.


On Saturday, 26 October 2013, 16:58, Martinx - ジェームズ <thiagocmartinsc@gmail.com> wrote:

Hi Darragh,
>
>
>Yes, on the same net-node machine, Grizzly works, Havana don't... But, for Grizzly, I have Ubuntu 12.04 with Linux 3.2 and >OVS 1.4.0-1ubuntu1.6.


so we don't know if the problem is due to Neutron, the Ubuntu kernel or OVS. I suspect the kernel as it implements the routing/natting, interfaces and namespaces.  I don't think Neutron Havana changes how these things are setup too much.

Can you try running Havana on a network node with the Linux 3.2 kernel?


>
>
>If I replace the Havana net-node hardware entirely, the problem persist (i.e. it "follows" Havana net-node), so, I think, it can not be related to the hardware.
>
>
>I tried Havana with both OVS 1.10.2 (from Cloud Archive) and with OVS 1.11.0 (compiled and installed by myself using dpkg-buildpackage / dpkg).
>
>
>My logs (including Open vSwitch) right after starting an Instance (nothing at OVS logs):
>
>
>http://paste.openstack.org/show/49870/
>
>
>
>I tried everything, including installing the Network Node on top of a KVM virtual machine or directly on a dedicated server, same result, the problem follows Hanava node (virtual or physical). Grizzly Network Node works both on a KVM VM or on a dedicated server.
>
>
>Regards,
>Thiago
>
>
>
>On 26 October 2013 06:28, Darragh OReilly wrote:
>
>Hi Thiago,
>>
>>so just to confirm - on the same netnode machine, with the same OS, kernal and OVS versions - Grizzly is ok and Havana is not?
>>
>>Also, on the network node, are there any errors in the neutron logs, the syslog, or /var/log/openvswitch/* ?
>>
>>
>>
>>Re, Darragh.
>>
>>
>>
>>
>>On Saturday, 26 October 2013, 5:25, Martinx - ジェームズ <thiagocmartinsc@gmail.com> wrote:
>>
>>LOL... One day, Internet via "Quantum Entanglement"! Oops, Neutron!     =P
>>>
>>>
>>>
>>>I'll ignore the problems related to the "performance between two instances on different hypervisors" for now. My priority is the connectivity issue with the External networks... At least, internal is slow but it works.
>>>
>>>
>>>I'm about to remove the L3 Agent / Namespaces entirely from my topology... It is a shame because it is pretty cool! With Grizzly I had no problems at all. Plus, I need to put Havana into production ASAP!    :-/
>>>
>>>
>>>Why I'm giving it up (of L3 / NS) for now? Because I tried:
>>>
>>>
>>>The option "tenant_network_type" with gre, vxlan and vlan (range physnet1:206:256 configured at the 3Com switch as tagged).
>>>
>>>
>>>From the instances, the connection with External network is always slow, no matter if I choose for Tenants, GRE, VXLAN or VLAN.
>>>
>>>
>>>For example, right now, I'm using VLAN, same problem.
>>>
>>>
>>>Don't you guys think that this can be a problem with the bridge "br-ex" and its internals ? Since I swapped the "Tenant Network Type" 3 times, same result... But I still did not removed the br-ex from the scene.
>>>
>>>
>>>If someone wants to debug it, I can give the root password, no problem, it is just a lab...   =)
>>>
>>>
>>>Thanks!
>>>Thiago
>>>
>>>
>>>On 25 October 2013 19:45, Rick Jones <rick.jones2@hp.com> wrote:
>>>
>>>On 10/25/2013 02:37 PM, Martinx - ジェームズ wrote:
>>>>
>>>>WOW!! Thank you for your time Rick! Awesome answer!!    =D
>>>>>
>>>>>I'll do this tests (with ethtool GRO / CKO) tonight but, do you think
>>>>>that this is the main root of the problem?!
>>>>>
>>>>>
>>>>>I mean, I'm seeing two distinct problems here:
>>>>>
>>>>>1- Slow connectivity to the External network plus SSH lags all over the
>>>>>cloud (everything that pass trough L3 / Namespace is problematic), and;
>>>>>
>>>>>2- Communication between two Instances on different hypervisors (i.e.
>>>>>maybe it is related to this GRO / CKO thing).
>>>>>
>>>>>
>>>>>So, two different problems, right?!
>>>>>
>>>>
One or two problems I cannot say.    Certainly if one got the benefit of stateless offloads in one direction and not the other, one could see different performance limits in each direction.
>>>>
>>>>All I can really say is I liked it better when we were called Quantum, because then I could refer to it as "Spooky networking at a distance."  Sadly, describing Neutron as "Networking with no inherent charge" doesn't work as well :)
>>>>
>>>>rick jones
>>>>
>>>>
>>>
>>>
>>>
>
>
>

_______________________________________________
Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
Post to : openstack@lists.openstack.org
Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
Re: Directional network performance issues with Neutron + OpenvSwitch [ In reply to ]
Hello Stackers!

Sorry to not back on this topic last week, too many things to do...

So, instead of trying this and that, reply this, reply again... I made a
video about this problem, I hope that helps more than those e-mails that
I'm writing! =P

Honestly, I don't know the source of this problem, if it is with OpenStack
/ Neutron, or with "Linux / Namespace / OVS"... It would be great to test
it alone, Ubuntu Linux + Namespace + OVS (without Neutron), to see if the
problem persist but, I have no idea about how to setup everything, just
like Neutron does. Maybe, I just need to reproduce the "Namespace and OVS
bridges / ports / VXLAN - as is", without Neutron?! I can try that...

Also, my Grizzly setup is gone, I deleted it... Sorry about that... I know
it works because it is the first time I'm seeing this problem... I had used
Grizzly for ~5 months with only 1 problem (related to MTU 1400) but, this
problem with Havana is totally different...


Video:

OpenStack Havana L3 Router problem - Ubuntu 12.04.3 LTS:
http://www.youtube.com/watch?v=jVjiphMuuzM


* After 5 minutes, I inserted a new video, showing how I "fixed" it, by
running Squid within the Tenant router. You guys can see that, using the
default Tenant router (10:30), it will take about 1 hour to finish the
"apt-get download" and, with Squid (09:27), it goes down to about 3 minutes
(no, it is still not cached, I clean it for each test).


Sorry about the size of the video, it is about 12 minutes and high-res (to
see the screen details) but, it is a serious problem and I think it worth
watching it...

NOTE: Sorry about my English! It is very hard to "speak" a non-native
language, handling an Android phone and typing the keyboard... :-)

Best!
Thiago



On 28 October 2013 07:00, Darragh O'Reilly <dara2002-openstack@yahoo.com>wrote:

> Thiago,
>
> some more answers below.
>
> Btw: I saw the problem with a "qemu-nbd -c" process using all the cpu on
> the compute. It happened just once - must be a bug in it. You can disable
> libvirt injection if you don't want it by setting "libvirt_inject_partition
> = -2" in nova.conf.
>
>
> On Saturday, 26 October 2013, 16:58, Martinx - $B%8%'!<%`%:(B <
> thiagocmartinsc@gmail.com> wrote:
>
> Hi Darragh,
> >
> >
> >Yes, on the same net-node machine, Grizzly works, Havana don't... But,
> for Grizzly, I have Ubuntu 12.04 with Linux 3.2 and >OVS 1.4.0-1ubuntu1.6.
>
>
> so we don't know if the problem is due to Neutron, the Ubuntu kernel or
> OVS. I suspect the kernel as it implements the routing/natting, interfaces
> and namespaces. I don't think Neutron Havana changes how these things are
> setup too much.
>
> Can you try running Havana on a network node with the Linux 3.2 kernel?
>
>
> >
> >
> >If I replace the Havana net-node hardware entirely, the problem persist
> (i.e. it "follows" Havana net-node), so, I think, it can not be related to
> the hardware.
> >
> >
> >I tried Havana with both OVS 1.10.2 (from Cloud Archive) and with OVS
> 1.11.0 (compiled and installed by myself using dpkg-buildpackage / dpkg).
> >
> >
> >My logs (including Open vSwitch) right after starting an Instance
> (nothing at OVS logs):
> >
> >
> >http://paste.openstack.org/show/49870/
> >
> >
> >
> >I tried everything, including installing the Network Node on top of a KVM
> virtual machine or directly on a dedicated server, same result, the problem
> follows Hanava node (virtual or physical). Grizzly Network Node works both
> on a KVM VM or on a dedicated server.
> >
> >
> >Regards,
> >Thiago
> >
> >
> >
> >On 26 October 2013 06:28, Darragh OReilly wrote:
> >
> >Hi Thiago,
> >>
> >>so just to confirm - on the same netnode machine, with the same OS,
> kernal and OVS versions - Grizzly is ok and Havana is not?
> >>
> >>Also, on the network node, are there any errors in the neutron logs, the
> syslog, or /var/log/openvswitch/* ?
> >>
> >>
> >>
> >>Re, Darragh.
> >>
> >>
> >>
> >>
> >>On Saturday, 26 October 2013, 5:25, Martinx - $B%8%'!<%`%:(B <
> thiagocmartinsc@gmail.com> wrote:
> >>
> >>LOL... One day, Internet via "Quantum Entanglement"! Oops, Neutron!
> =P
> >>>
> >>>
> >>>
> >>>I'll ignore the problems related to the "performance between two
> instances on different hypervisors" for now. My priority is the
> connectivity issue with the External networks... At least, internal is slow
> but it works.
> >>>
> >>>
> >>>I'm about to remove the L3 Agent / Namespaces entirely from my
> topology... It is a shame because it is pretty cool! With Grizzly I had no
> problems at all. Plus, I need to put Havana into production ASAP! :-/
> >>>
> >>>
> >>>Why I'm giving it up (of L3 / NS) for now? Because I tried:
> >>>
> >>>
> >>>The option "tenant_network_type" with gre, vxlan and vlan (range
> physnet1:206:256 configured at the 3Com switch as tagged).
> >>>
> >>>
> >>>From the instances, the connection with External network is always
> slow, no matter if I choose for Tenants, GRE, VXLAN or VLAN.
> >>>
> >>>
> >>>For example, right now, I'm using VLAN, same problem.
> >>>
> >>>
> >>>Don't you guys think that this can be a problem with the bridge "br-ex"
> and its internals ? Since I swapped the "Tenant Network Type" 3 times, same
> result... But I still did not removed the br-ex from the scene.
> >>>
> >>>
> >>>If someone wants to debug it, I can give the root password, no problem,
> it is just a lab... =)
> >>>
> >>>
> >>>Thanks!
> >>>Thiago
> >>>
> >>>
> >>>On 25 October 2013 19:45, Rick Jones <rick.jones2@hp.com> wrote:
> >>>
> >>>On 10/25/2013 02:37 PM, Martinx - $B%8%'!<%`%:(B wrote:
> >>>>
> >>>>WOW!! Thank you for your time Rick! Awesome answer!! =D
> >>>>>
> >>>>>I'll do this tests (with ethtool GRO / CKO) tonight but, do you think
> >>>>>that this is the main root of the problem?!
> >>>>>
> >>>>>
> >>>>>I mean, I'm seeing two distinct problems here:
> >>>>>
> >>>>>1- Slow connectivity to the External network plus SSH lags all over
> the
> >>>>>cloud (everything that pass trough L3 / Namespace is problematic),
> and;
> >>>>>
> >>>>>2- Communication between two Instances on different hypervisors (i.e.
> >>>>>maybe it is related to this GRO / CKO thing).
> >>>>>
> >>>>>
> >>>>>So, two different problems, right?!
> >>>>>
> >>>>
> One or two problems I cannot say. Certainly if one got the benefit of
> stateless offloads in one direction and not the other, one could see
> different performance limits in each direction.
> >>>>
> >>>>All I can really say is I liked it better when we were called Quantum,
> because then I could refer to it as "Spooky networking at a distance."
> Sadly, describing Neutron as "Networking with no inherent charge" doesn't
> work as well :)
> >>>>
> >>>>rick jones
> >>>>
> >>>>
> >>>
> >>>
> >>>
> >
> >
> >
>
Re: Directional network performance issues with Neutron + OpenvSwitch [ In reply to ]
Guys,

This problem is kind of a "deal breaker"... I was counting on OpenStack
Havana (and with Ubuntu) for my first public cloud that I'm (was) about to
announce / launch but, this problem changed everything.

I can not put Havana with Ubuntu LTS into production because of this
network issue. This is a very serious problem for me... Since all sites, or
even ssh connections, that pass through the "Floating IPs" entering into
the tenant's subnets, are very slow and, all the connections freezes for
seconds, every minute.

Again, I'm seeing that there is no way to put Havana into production (using
Per-Tenant Routers with Private Networks), *because the Network Node is
broken*. At least when with Ubuntu... I'll try it with Debian 7, or CentOS
(I don't like it), just to see if the problem persist but, I prefer Ubuntu
distro since Warty Warthog... :-/

So, what is being done to fix it? I already tried everything I could,
without any kind of success...

Also, I followed this doc (to triple * triple re-check my env):
http://docs.openstack.org/havana/install-guide/install/apt/content/section_networking-routers-with-private-networks.html
but,
it does not work as expected.

BTW, I can give full access into my environment for you guys, no problem...
I can build a lab from scratch, following your instructions, I can also
give root access to OpenStack experts... Just, let me know... =)

Thanks!
Thiago

On 6 November 2013 09:20, Martinx - $B%8%'!<%`%:(B <thiagocmartinsc@gmail.com> wrote:

> Hello Stackers!
>
> Sorry to not back on this topic last week, too many things to do...
>
> So, instead of trying this and that, reply this, reply again... I made a
> video about this problem, I hope that helps more than those e-mails that
> I'm writing! =P
>
> Honestly, I don't know the source of this problem, if it is with OpenStack
> / Neutron, or with "Linux / Namespace / OVS"... It would be great to test
> it alone, Ubuntu Linux + Namespace + OVS (without Neutron), to see if the
> problem persist but, I have no idea about how to setup everything, just
> like Neutron does. Maybe, I just need to reproduce the "Namespace and OVS
> bridges / ports / VXLAN - as is", without Neutron?! I can try that...
>
> Also, my Grizzly setup is gone, I deleted it... Sorry about that... I know
> it works because it is the first time I'm seeing this problem... I had used
> Grizzly for ~5 months with only 1 problem (related to MTU 1400) but, this
> problem with Havana is totally different...
>
>
> Video:
>
> OpenStack Havana L3 Router problem - Ubuntu 12.04.3 LTS:
> http://www.youtube.com/watch?v=jVjiphMuuzM
>
>
> * After 5 minutes, I inserted a new video, showing how I "fixed" it, by
> running Squid within the Tenant router. You guys can see that, using the
> default Tenant router (10:30), it will take about 1 hour to finish the
> "apt-get download" and, with Squid (09:27), it goes down to about 3 minutes
> (no, it is still not cached, I clean it for each test).
>
>
> Sorry about the size of the video, it is about 12 minutes and high-res (to
> see the screen details) but, it is a serious problem and I think it worth
> watching it...
>
> NOTE: Sorry about my English! It is very hard to "speak" a non-native
> language, handling an Android phone and typing the keyboard... :-)
>
> Best!
> Thiago
>
>
>
> On 28 October 2013 07:00, Darragh O'Reilly <dara2002-openstack@yahoo.com>wrote:
>
>> Thiago,
>>
>> some more answers below.
>>
>> Btw: I saw the problem with a "qemu-nbd -c" process using all the cpu on
>> the compute. It happened just once - must be a bug in it. You can disable
>> libvirt injection if you don't want it by setting "libvirt_inject_partition
>> = -2" in nova.conf.
>>
>>
>> On Saturday, 26 October 2013, 16:58, Martinx - $B%8%'!<%`%:(B <
>> thiagocmartinsc@gmail.com> wrote:
>>
>> Hi Darragh,
>> >
>> >
>> >Yes, on the same net-node machine, Grizzly works, Havana don't... But,
>> for Grizzly, I have Ubuntu 12.04 with Linux 3.2 and >OVS 1.4.0-1ubuntu1.6.
>>
>>
>> so we don't know if the problem is due to Neutron, the Ubuntu kernel or
>> OVS. I suspect the kernel as it implements the routing/natting, interfaces
>> and namespaces. I don't think Neutron Havana changes how these things are
>> setup too much.
>>
>> Can you try running Havana on a network node with the Linux 3.2 kernel?
>>
>>
>> >
>> >
>> >If I replace the Havana net-node hardware entirely, the problem persist
>> (i.e. it "follows" Havana net-node), so, I think, it can not be related to
>> the hardware.
>> >
>> >
>> >I tried Havana with both OVS 1.10.2 (from Cloud Archive) and with OVS
>> 1.11.0 (compiled and installed by myself using dpkg-buildpackage / dpkg).
>> >
>> >
>> >My logs (including Open vSwitch) right after starting an Instance
>> (nothing at OVS logs):
>> >
>> >
>> >http://paste.openstack.org/show/49870/
>> >
>> >
>> >
>> >I tried everything, including installing the Network Node on top of a
>> KVM virtual machine or directly on a dedicated server, same result, the
>> problem follows Hanava node (virtual or physical). Grizzly Network Node
>> works both on a KVM VM or on a dedicated server.
>> >
>> >
>> >Regards,
>> >Thiago
>> >
>> >
>> >
>> >On 26 October 2013 06:28, Darragh OReilly wrote:
>> >
>> >Hi Thiago,
>> >>
>> >>so just to confirm - on the same netnode machine, with the same OS,
>> kernal and OVS versions - Grizzly is ok and Havana is not?
>> >>
>> >>Also, on the network node, are there any errors in the neutron logs,
>> the syslog, or /var/log/openvswitch/* ?
>> >>
>> >>
>> >>
>> >>Re, Darragh.
>> >>
>> >>
>> >>
>> >>
>> >>On Saturday, 26 October 2013, 5:25, Martinx - $B%8%'!<%`%:(B <
>> thiagocmartinsc@gmail.com> wrote:
>> >>
>> >>LOL... One day, Internet via "Quantum Entanglement"! Oops, Neutron!
>> =P
>> >>>
>> >>>
>> >>>
>> >>>I'll ignore the problems related to the "performance between two
>> instances on different hypervisors" for now. My priority is the
>> connectivity issue with the External networks... At least, internal is slow
>> but it works.
>> >>>
>> >>>
>> >>>I'm about to remove the L3 Agent / Namespaces entirely from my
>> topology... It is a shame because it is pretty cool! With Grizzly I had no
>> problems at all. Plus, I need to put Havana into production ASAP! :-/
>> >>>
>> >>>
>> >>>Why I'm giving it up (of L3 / NS) for now? Because I tried:
>> >>>
>> >>>
>> >>>The option "tenant_network_type" with gre, vxlan and vlan (range
>> physnet1:206:256 configured at the 3Com switch as tagged).
>> >>>
>> >>>
>> >>>From the instances, the connection with External network is always
>> slow, no matter if I choose for Tenants, GRE, VXLAN or VLAN.
>> >>>
>> >>>
>> >>>For example, right now, I'm using VLAN, same problem.
>> >>>
>> >>>
>> >>>Don't you guys think that this can be a problem with the bridge
>> "br-ex" and its internals ? Since I swapped the "Tenant Network Type" 3
>> times, same result... But I still did not removed the br-ex from the scene.
>> >>>
>> >>>
>> >>>If someone wants to debug it, I can give the root password, no
>> problem, it is just a lab... =)
>> >>>
>> >>>
>> >>>Thanks!
>> >>>Thiago
>> >>>
>> >>>
>> >>>On 25 October 2013 19:45, Rick Jones <rick.jones2@hp.com> wrote:
>> >>>
>> >>>On 10/25/2013 02:37 PM, Martinx - $B%8%'!<%`%:(B wrote:
>> >>>>
>> >>>>WOW!! Thank you for your time Rick! Awesome answer!! =D
>> >>>>>
>> >>>>>I'll do this tests (with ethtool GRO / CKO) tonight but, do you think
>> >>>>>that this is the main root of the problem?!
>> >>>>>
>> >>>>>
>> >>>>>I mean, I'm seeing two distinct problems here:
>> >>>>>
>> >>>>>1- Slow connectivity to the External network plus SSH lags all over
>> the
>> >>>>>cloud (everything that pass trough L3 / Namespace is problematic),
>> and;
>> >>>>>
>> >>>>>2- Communication between two Instances on different hypervisors (i.e.
>> >>>>>maybe it is related to this GRO / CKO thing).
>> >>>>>
>> >>>>>
>> >>>>>So, two different problems, right?!
>> >>>>>
>> >>>>
>> One or two problems I cannot say. Certainly if one got the benefit of
>> stateless offloads in one direction and not the other, one could see
>> different performance limits in each direction.
>> >>>>
>> >>>>All I can really say is I liked it better when we were called
>> Quantum, because then I could refer to it as "Spooky networking at a
>> distance." Sadly, describing Neutron as "Networking with no inherent
>> charge" doesn't work as well :)
>> >>>>
>> >>>>rick jones
>> >>>>
>> >>>>
>> >>>
>> >>>
>> >>>
>> >
>> >
>> >
>>
>
>

1 2 3  View All