Mailing List Archive: Best kernel options for openvswitch on network nodes on a large setup

Best kernel options for openvswitch on network nodes on a large setup

jp.methot at planethoster

Sep 25, 2018, 4:26 PM

Post #1 of 9 (964 views)

Hi,

Are there some recommendations regarding kernel settings configuration for openvswitch? We’ve just been hit by what we believe may be an attack of some kind we have never seen before and we’re wondering if there’s a way to optimize our network nodes kernel for openvswitch operation and thus minimize the impact of such an attack, or whatever it was.

Best regards,

Jean-Philippe Méthot
Openstack system administrator
Administrateur système Openstack
PlanetHoster inc.

Re: Best kernel options for openvswitch on network nodes on a large setup [ In reply to ]

emccormick at cirrusseven

Sep 25, 2018, 4:37 PM

Post #2 of 9 (964 views)

Ate you getting any particular log messages that lead you to conclude your
issue lies with OVS? I've hit lots of kernel limits under those conditions
before OVS itself ever noticed. Anything in dmesg, journal or neutron logs
of interest?

On Tue, Sep 25, 2018, 7:27 PM Jean-Philippe Méthot <
jp.methot@planethoster.info> wrote:

> Hi,
>
> Are there some recommendations regarding kernel settings configuration for
> openvswitch? We’ve just been hit by what we believe may be an attack of
> some kind we have never seen before and we’re wondering if there’s a way to
> optimize our network nodes kernel for openvswitch operation and thus
> minimize the impact of such an attack, or whatever it was.
>
> Best regards,
>
> Jean-Philippe Méthot
> Openstack system administrator
> Administrateur système Openstack
> PlanetHoster inc.
>
>
>
>
> _______________________________________________
> OpenStack-operators mailing list
> OpenStack-operators@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>

Re: Best kernel options for openvswitch on network nodes on a large setup [ In reply to ]

jp.methot at planethoster

Sep 25, 2018, 4:49 PM

Post #3 of 9 (964 views)

This particular message makes it sound as if openvswitch is getting overloaded.

Sep 23 03:54:08 network1 ovsdb-server: ovs|01253|reconnect|ERR|tcp:127.0.0.1:50814: no response to inactivity probe after 5.01 seconds, disconnecting

A lot of those keep appear, and openvswitch always reconnects almost instantly though. I’ve done some research about that particular message, but it didn’t give me anything I can use to fix it.

Jean-Philippe Méthot
Openstack system administrator
Administrateur système Openstack
PlanetHoster inc.

> Le 25 sept. 2018 à 19:37, Erik McCormick <emccormick@cirrusseven.com> a écrit :
>
> Ate you getting any particular log messages that lead you to conclude your issue lies with OVS? I've hit lots of kernel limits under those conditions before OVS itself ever noticed. Anything in dmesg, journal or neutron logs of interest?
>
> On Tue, Sep 25, 2018, 7:27 PM Jean-Philippe Méthot <jp.methot@planethoster.info <mailto:jp.methot@planethoster.info>> wrote:
> Hi,
>
> Are there some recommendations regarding kernel settings configuration for openvswitch? We’ve just been hit by what we believe may be an attack of some kind we have never seen before and we’re wondering if there’s a way to optimize our network nodes kernel for openvswitch operation and thus minimize the impact of such an attack, or whatever it was.
>
> Best regards,
>
> Jean-Philippe Méthot
> Openstack system administrator
> Administrateur système Openstack
> PlanetHoster inc.
>
>
>
>
> _______________________________________________
> OpenStack-operators mailing list
> OpenStack-operators@lists.openstack.org <mailto:OpenStack-operators@lists.openstack.org>
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators <http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators>

Re: Best kernel options for openvswitch on network nodes on a large setup [ In reply to ]

simon.leinen at switch

Sep 26, 2018, 8:48 AM

Post #4 of 9 (961 views)

Jean-Philippe Méthot writes:
> This particular message makes it sound as if openvswitch is getting overloaded.
> Sep 23 03:54:08 network1 ovsdb-server: ovs|01253|reconnect|ERR|tcp:127.0.0.1:50814: no response to inactivity probe after 5.01 seconds, disconnecting

We get these as well :-(

> A lot of those keep appear, and openvswitch always reconnects almost
> instantly though. I’ve done some research about that particular
> message, but it didn’t give me anything I can use to fix it.

Would be interested in solutions as well. But I'm sceptical whether
kernel settings can help here, because the timeout/slowness seems to be
located in the user-space/control-plane parts of Open vSwitch,
i.e. OVSDB.
--
Simon.

> Jean-Philippe Méthot
> Openstack system administrator
> Administrateur système Openstack
> PlanetHoster inc.

> Le 25 sept. 2018 à 19:37, Erik McCormick <emccormick@cirrusseven.com> a écrit :

> Ate you getting any particular log messages that lead you to conclude your issue lies with OVS? I've hit lots of kernel limits under those conditions before OVS itself ever
> noticed. Anything in dmesg, journal or neutron logs of interest?

> On Tue, Sep 25, 2018, 7:27 PM Jean-Philippe Méthot <jp.methot@planethoster.info> wrote:

> Hi,

> Are there some recommendations regarding kernel settings configuration for openvswitch? We’ve just been hit by what we believe may be an attack of some kind we
> have never seen before and we’re wondering if there’s a way to optimize our network nodes kernel for openvswitch operation and thus minimize the impact of such an
> attack, or whatever it was.

> Best regards,

> Jean-Philippe Méthot
> Openstack system administrator
> Administrateur système Openstack
> PlanetHoster inc.

> _______________________________________________
> OpenStack-operators mailing list
> OpenStack-operators@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators

> _______________________________________________
> OpenStack-operators mailing list
> OpenStack-operators@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators

_______________________________________________
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators

Re: Best kernel options for openvswitch on network nodes on a large setup [ In reply to ]

jp.methot at planethoster

Sep 26, 2018, 12:16 PM

Post #5 of 9 (961 views)

Yes, I notice that every time that message appears, at least a few packets get dropped and some of our instances pop up in nagios, even though they are reachable 1 or 2 seconds after. It’s really causing us some issues as we can’t ensure proper network quality for our customers. Have you noticed the same?

By that point I think it may be best to contact openvswitch directly since it seems to be an issue with their component. I am about to do that and hope I don’t get sent back to the openstack mailing list. I would really like to know what this probe is and why it disconnects constantly under load.

Jean-Philippe Méthot
Openstack system administrator
Administrateur système Openstack
PlanetHoster inc.

> Le 26 sept. 2018 à 11:48, Simon Leinen <simon.leinen@switch.ch> a écrit :
>
> Jean-Philippe Méthot writes:
>> This particular message makes it sound as if openvswitch is getting overloaded.
>> Sep 23 03:54:08 network1 ovsdb-server: ovs|01253|reconnect|ERR|tcp:127.0.0.1:50814: no response to inactivity probe after 5.01 seconds, disconnecting
>
> We get these as well :-(
>
>> A lot of those keep appear, and openvswitch always reconnects almost
>> instantly though. I’ve done some research about that particular
>> message, but it didn’t give me anything I can use to fix it.
>
> Would be interested in solutions as well. But I'm sceptical whether
> kernel settings can help here, because the timeout/slowness seems to be
> located in the user-space/control-plane parts of Open vSwitch,
> i.e. OVSDB.
> --
> Simon.
>
>> Jean-Philippe Méthot
>> Openstack system administrator
>> Administrateur système Openstack
>> PlanetHoster inc.
>
>> Le 25 sept. 2018 à 19:37, Erik McCormick <emccormick@cirrusseven.com> a écrit :
>
>> Ate you getting any particular log messages that lead you to conclude your issue lies with OVS? I've hit lots of kernel limits under those conditions before OVS itself ever
>> noticed. Anything in dmesg, journal or neutron logs of interest?
>
>> On Tue, Sep 25, 2018, 7:27 PM Jean-Philippe Méthot <jp.methot@planethoster.info> wrote:
>
>> Hi,
>
>> Are there some recommendations regarding kernel settings configuration for openvswitch? We’ve just been hit by what we believe may be an attack of some kind we
>> have never seen before and we’re wondering if there’s a way to optimize our network nodes kernel for openvswitch operation and thus minimize the impact of such an
>> attack, or whatever it was.
>
>> Best regards,
>
>> Jean-Philippe Méthot
>> Openstack system administrator
>> Administrateur système Openstack
>> PlanetHoster inc.
>
>> _______________________________________________
>> OpenStack-operators mailing list
>> OpenStack-operators@lists.openstack.org
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>
>> _______________________________________________
>> OpenStack-operators mailing list
>> OpenStack-operators@lists.openstack.org
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>

Re: Best kernel options for openvswitch on network nodes on a large setup [ In reply to ]

jp.methot at planethoster

Sep 27, 2018, 2:05 PM

Post #6 of 9 (961 views)

I got some answers from the openvswitch mailing list, essentially indicating the issue is in the connection between neutron-openvswitch-agent and ovs.

Here’s an output of ovs-vsctl list controller:

_uuid?????????????? : ff2dca74-9628-43c8-b89c-8d2f1242dd3f
connection_mode???? : out-of-band
controller_burst_limit: []
controller_rate_limit: []
enable_async_messages: []
external_ids????????: {}
inactivity_probe????: []
is_connected????????: false
local_gateway?????? : []
local_ip????????????: []
local_netmask?????? : []
max_backoff???????? : []
other_config????????: {}
role????????????????: other
status??????????????: {last_error="Connection timed out", sec_since_connect="22", sec_since_disconnect="1", state=BACKOFF}
target??????????????: "tcp:127.0.0.1:6633 »

So OVS is still working but the connection between neutron-openvswitch-agent and OVS gets interrupted somehow. It may also be linked to the HA vrrp switching host at random as the connection between both network nodes get severed. We also see SSH lagging momentarily. I’m starting to think that a limit of some kind in linux is reached, preventing connections from happening. However, I don’t think it’s max open file since the number of open files is nowhere close to what I’ve set it.

Ideas?

Jean-Philippe Méthot
Openstack system administrator
Administrateur système Openstack
PlanetHoster inc.

> Le 26 sept. 2018 à 15:16, Jean-Philippe Méthot <jp.methot@planethoster.info> a écrit :
>
> Yes, I notice that every time that message appears, at least a few packets get dropped and some of our instances pop up in nagios, even though they are reachable 1 or 2 seconds after. It’s really causing us some issues as we can’t ensure proper network quality for our customers. Have you noticed the same?
>
> By that point I think it may be best to contact openvswitch directly since it seems to be an issue with their component. I am about to do that and hope I don’t get sent back to the openstack mailing list. I would really like to know what this probe is and why it disconnects constantly under load.
>
> Jean-Philippe Méthot
> Openstack system administrator
> Administrateur système Openstack
> PlanetHoster inc.
>
>
>
>
>> Le 26 sept. 2018 à 11:48, Simon Leinen <simon.leinen@switch.ch <mailto:simon.leinen@switch.ch>> a écrit :
>>
>> Jean-Philippe Méthot writes:
>>> This particular message makes it sound as if openvswitch is getting overloaded.
>>> Sep 23 03:54:08 network1 ovsdb-server: ovs|01253|reconnect|ERR|tcp:127.0.0.1:50814: no response to inactivity probe after 5.01 seconds, disconnecting
>>
>> We get these as well :-(
>>
>>> A lot of those keep appear, and openvswitch always reconnects almost
>>> instantly though. I’ve done some research about that particular
>>> message, but it didn’t give me anything I can use to fix it.
>>
>> Would be interested in solutions as well. But I'm sceptical whether
>> kernel settings can help here, because the timeout/slowness seems to be
>> located in the user-space/control-plane parts of Open vSwitch,
>> i.e. OVSDB.
>> --
>> Simon.
>>
>>> Jean-Philippe Méthot
>>> Openstack system administrator
>>> Administrateur système Openstack
>>> PlanetHoster inc.
>>
>>> Le 25 sept. 2018 à 19:37, Erik McCormick <emccormick@cirrusseven.com <mailto:emccormick@cirrusseven.com>> a écrit :
>>
>>> Ate you getting any particular log messages that lead you to conclude your issue lies with OVS? I've hit lots of kernel limits under those conditions before OVS itself ever
>>> noticed. Anything in dmesg, journal or neutron logs of interest?
>>
>>> On Tue, Sep 25, 2018, 7:27 PM Jean-Philippe Méthot <jp.methot@planethoster.info <mailto:jp.methot@planethoster.info>> wrote:
>>
>>> Hi,
>>
>>> Are there some recommendations regarding kernel settings configuration for openvswitch? We’ve just been hit by what we believe may be an attack of some kind we
>>> have never seen before and we’re wondering if there’s a way to optimize our network nodes kernel for openvswitch operation and thus minimize the impact of such an
>>> attack, or whatever it was.
>>
>>> Best regards,
>>
>>> Jean-Philippe Méthot
>>> Openstack system administrator
>>> Administrateur système Openstack
>>> PlanetHoster inc.
>>
>>> _______________________________________________
>>> OpenStack-operators mailing list
>>> OpenStack-operators@lists.openstack.org <mailto:OpenStack-operators@lists.openstack.org>
>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators <http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators>
>>
>>> _______________________________________________
>>> OpenStack-operators mailing list
>>> OpenStack-operators@lists.openstack.org <mailto:OpenStack-operators@lists.openstack.org>
>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>>
>
> _______________________________________________
> OpenStack-operators mailing list
> OpenStack-operators@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators

Re: Best kernel options for openvswitch on network nodes on a large setup [ In reply to ]

skaplons at redhat

Sep 28, 2018, 12:03 AM

Post #7 of 9 (961 views)

Hi,

What version of Neutron and ovsdbapp You are using? IIRC there was such issue somewhere around Pike version, we saw it in functional tests quite often. But later with new ovsdbapp version I think that this problem was somehow solved.
Maybe try newer version of ovsdbapp and check if it will be better.

> Wiadomo?? napisana przez Jean-Philippe Méthot <jp.methot@planethoster.info> w dniu 27.09.2018, o godz. 23:05:
>
> I got some answers from the openvswitch mailing list, essentially indicating the issue is in the connection between neutron-openvswitch-agent and ovs.
>
> Here’s an output of ovs-vsctl list controller:
>
> _uuid?????????????? : ff2dca74-9628-43c8-b89c-8d2f1242dd3f
> connection_mode???? : out-of-band
> controller_burst_limit: []
> controller_rate_limit: []
> enable_async_messages: []
> external_ids????????: {}
> inactivity_probe????: []
> is_connected????????: false
> local_gateway?????? : []
> local_ip????????????: []
> local_netmask?????? : []
> max_backoff???????? : []
> other_config????????: {}
> role????????????????: other
> status??????????????: {last_error="Connection timed out", sec_since_connect="22", sec_since_disconnect="1", state=BACKOFF}
> target??????????????: "tcp:127.0.0.1:6633 »
>
> So OVS is still working but the connection between neutron-openvswitch-agent and OVS gets interrupted somehow. It may also be linked to the HA vrrp switching host at random as the connection between both network nodes get severed. We also see SSH lagging momentarily. I’m starting to think that a limit of some kind in linux is reached, preventing connections from happening. However, I don’t think it’s max open file since the number of open files is nowhere close to what I’ve set it.
>
> Ideas?
>
> Jean-Philippe Méthot
> Openstack system administrator
> Administrateur système Openstack
> PlanetHoster inc.
>
>
>
>
>> Le 26 sept. 2018 à 15:16, Jean-Philippe Méthot <jp.methot@planethoster.info> a écrit :
>>
>> Yes, I notice that every time that message appears, at least a few packets get dropped and some of our instances pop up in nagios, even though they are reachable 1 or 2 seconds after. It’s really causing us some issues as we can’t ensure proper network quality for our customers. Have you noticed the same?
>>
>> By that point I think it may be best to contact openvswitch directly since it seems to be an issue with their component. I am about to do that and hope I don’t get sent back to the openstack mailing list. I would really like to know what this probe is and why it disconnects constantly under load.
>>
>> Jean-Philippe Méthot
>> Openstack system administrator
>> Administrateur système Openstack
>> PlanetHoster inc.
>>
>>
>>
>>
>>> Le 26 sept. 2018 à 11:48, Simon Leinen <simon.leinen@switch.ch> a écrit :
>>>
>>> Jean-Philippe Méthot writes:
>>>> This particular message makes it sound as if openvswitch is getting overloaded.
>>>> Sep 23 03:54:08 network1 ovsdb-server: ovs|01253|reconnect|ERR|tcp:127.0.0.1:50814: no response to inactivity probe after 5.01 seconds, disconnecting
>>>
>>> We get these as well :-(
>>>
>>>> A lot of those keep appear, and openvswitch always reconnects almost
>>>> instantly though. I’ve done some research about that particular
>>>> message, but it didn’t give me anything I can use to fix it.
>>>
>>> Would be interested in solutions as well. But I'm sceptical whether
>>> kernel settings can help here, because the timeout/slowness seems to be
>>> located in the user-space/control-plane parts of Open vSwitch,
>>> i.e. OVSDB.
>>> --
>>> Simon.
>>>
>>>> Jean-Philippe Méthot
>>>> Openstack system administrator
>>>> Administrateur système Openstack
>>>> PlanetHoster inc.
>>>
>>>> Le 25 sept. 2018 à 19:37, Erik McCormick <emccormick@cirrusseven.com> a écrit :
>>>
>>>> Ate you getting any particular log messages that lead you to conclude your issue lies with OVS? I've hit lots of kernel limits under those conditions before OVS itself ever
>>>> noticed. Anything in dmesg, journal or neutron logs of interest?
>>>
>>>> On Tue, Sep 25, 2018, 7:27 PM Jean-Philippe Méthot <jp.methot@planethoster.info> wrote:
>>>
>>>> Hi,
>>>
>>>> Are there some recommendations regarding kernel settings configuration for openvswitch? We’ve just been hit by what we believe may be an attack of some kind we
>>>> have never seen before and we’re wondering if there’s a way to optimize our network nodes kernel for openvswitch operation and thus minimize the impact of such an
>>>> attack, or whatever it was.
>>>
>>>> Best regards,
>>>
>>>> Jean-Philippe Méthot
>>>> Openstack system administrator
>>>> Administrateur système Openstack
>>>> PlanetHoster inc.
>>>
>>>> _______________________________________________
>>>> OpenStack-operators mailing list
>>>> OpenStack-operators@lists.openstack.org
>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>>>
>>>> _______________________________________________
>>>> OpenStack-operators mailing list
>>>> OpenStack-operators@lists.openstack.org
>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>>>
>>
>> _______________________________________________
>> OpenStack-operators mailing list
>> OpenStack-operators@lists.openstack.org
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>
> _______________________________________________
> OpenStack-operators mailing list
> OpenStack-operators@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators

—
Slawek Kaplonski
Senior software engineer
Red Hat

_______________________________________________
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators

Re: Best kernel options for openvswitch on network nodes on a large setup [ In reply to ]

jp.methot at planethoster

Sep 28, 2018, 7:53 AM

Post #8 of 9 (958 views)

Thank you, I will try it next week (since today is Friday) and update this thread if it has fixed my issues. We are indeed using the latest RDO Pike, so ovsdbapp 0.4.3.1 .

Jean-Philippe Méthot
Openstack system administrator
Administrateur système Openstack
PlanetHoster inc.

> Le 28 sept. 2018 à 03:03, Slawomir Kaplonski <skaplons@redhat.com> a écrit :
>
> Hi,
>
> What version of Neutron and ovsdbapp You are using? IIRC there was such issue somewhere around Pike version, we saw it in functional tests quite often. But later with new ovsdbapp version I think that this problem was somehow solved.
> Maybe try newer version of ovsdbapp and check if it will be better.

Re: Best kernel options for openvswitch on network nodes on a large setup [ In reply to ]

jp.methot at planethoster

Oct 1, 2018, 2:51 PM

Post #9 of 9 (950 views)

So, after some testing, we finally fixed our issue of lost connections to instances. The actual issue was that the ARP table on the network node was hitting its limit constantly and thus, discarding legitimate routes. This caused our connections to flap and the HA routers to switch node without warning. Increasing net.ipv4.neigh.default.gc_thresh1, net.ipv4.neigh.default.gc_thresh2 and net.ipv4.neigh.default.gc_thresh3 kernel values ended up fixing the issue.

Jean-Philippe Méthot
Openstack system administrator
Administrateur système Openstack
PlanetHoster inc.

> Le 28 sept. 2018 à 10:53, Jean-Philippe Méthot <jp.methot@planethoster.info> a écrit :
>
> Thank you, I will try it next week (since today is Friday) and update this thread if it has fixed my issues. We are indeed using the latest RDO Pike, so ovsdbapp 0.4.3.1 .
>
> Jean-Philippe Méthot
> Openstack system administrator
> Administrateur système Openstack
> PlanetHoster inc.
>
>
>
>
>> Le 28 sept. 2018 à 03:03, Slawomir Kaplonski <skaplons@redhat.com <mailto:skaplons@redhat.com>> a écrit :
>>
>> Hi,
>>
>> What version of Neutron and ovsdbapp You are using? IIRC there was such issue somewhere around Pike version, we saw it in functional tests quite often. But later with new ovsdbapp version I think that this problem was somehow solved.
>> Maybe try newer version of ovsdbapp and check if it will be better.
>
> _______________________________________________
> OpenStack-operators mailing list
> OpenStack-operators@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators