Mailing List Archive

Re: [PATCH][RFC]: followup ...
Hi Julian,

Basically I have to agree with your statements, the problem
is, that we just see the network setup of a LVS-cluster
differently. I tend to use a good firewall architecture that
assures me that the loadbalancer will not get hit too badly
in case of a [D]DoS attack. You assume the worst case, where
the loadbalancer is standing right in the Internet and is
fighting alone against all the malicious forged packets. For
me the LVS is a part of the firewall design, I'd personally
never build a net only with a load balancer and without some
filter and/or proxy. Your mileage may vary.

> > Don't laugh, I've been doing this lately. On a heavy loaded server
> > you get a shift of +-150 connections. :)
>
> Yep, may be we need an universal netlink transport to get data
> from LVS but this must be carefully analyzed.

I'm currently talking with laforge (Harald Welte) from the
iptables team about the universal netlink interface. He's
trying to synchronize the state tables too via the netlink
interface. A intelligent netlink framework, maybe even as
some kind of API could help people a lot.

> The problem is that LVS has another view for the real server
> load. The director sees one number of connections the real server

Yes, if (amount forged packets << correctly crafted requests).

> sees another one. And under attack we observe big gap between the
> active/inactive counters and the used threshold values. In this case
> we just exclude all real servers. This is the reason I prefer the
> more informed approach of using agents.

Then set the threshold to a high enough value just as much
that the realserver will not die.

> No, we have two choices:
>
> - use SYN cookies and much memory for open requests, accept more
> valid requests
>
> - don't use SYN cookies, drop the requests exceeding the backlog length,
> drop many valid requests but the real servers are not overloaded

So, what you're saying is, that in the beginning you set the value
for the backlog queue high and if you experience high request rate
that never completes 3-Way handshaking you reduce the backlog number?
IMO you should enable both, SYN cookies and a well chosen backlog
queue number because if you disable SYN cookies and exceed the amount
of allowed connections in the backlog queue, new TCP_SYN requests
will be dropped no matter what. If you however enabled SYN cookies
and the amount of SYN_RECV states exceeds the backlog queue number
new incoming requests trying to finish the 3-Way handshake will still
get the possibility to do so.

> In both cases the listeners don't see requests until the handshake is
> completed (Linux).

Yep, corretly.

> Yes, nobody claims the defense strategies guard the real
> servers. This is not their goal. They keep the director with more
> free memory and nothing more :) Only drop_packet can control the
> request rate but only for the new requests.

Hmm, so why can't I enable the drop_packet and set thresholds?
Excuse my ignorance but drop_packet would randomly drop some
connections and the threshold would guard that the server
doesn't crash. I just provided an additional feature that doesn't
affect the functionality or flow control of the rest. Do you
want the RS to be exhausted (with enabled drop_packet) or do
you want to have the possibilty to act before this happens. If
you let the RS get flooded by forged packets it's IMHO the same
as if you set their value to zero. Existing connection in the
transition table will still be able communicate with the server
I just want to prevent the LB to sent another request to the
RS. If this has to be done in kernel space is a question per se.

> Yes but the clients can't work, you exclude all servers
> in this case because the LVS spreads the requests to all servers
> and the rain becomes deluge :)

No, in quiesced mode, all existing connections will still be
handled. If the (dest->inactconns + dest->activeconns) for
this RS drop below the lower threshold he will get new requests.
I don't see the deluge. In case the LVS kernel sets all the
RS weights 0 it will act as a sink by dropping all connections
not in the template state table. Of course no new correct
connection will be processed but in your case where you drop
randomly packets you can have the same amount of loss of 'good'
packets. Am I wrong ? :)

> Yes, we have to find a way to configure all these features
> that will be implemented further. The main problem here is how
> we can provide one binary ipvsadm tool to all Linux distributions.

Huh? To adjust the ipvsadm tool to the distros is the problem
of the maintainers itself. We just provide the functionality
to be able to maintain all tasks we support. I have built an
own distribution too and I too have to maintain my diffs to
the LVS-code and the ipvsadm app. I'd disagree if ipvsadm would
become distro-related.

> We know how many problems the users have when they use the ipvsadm
> supplied in their distrib.

So the distributions can handle it. It can't be our task to
adjust the binary tool to every distro it's our task to keep
it clean and independant of any distro.

> > Hmm, I just want to limit the amount of concurrent connections
> > per realserver and in the future maybe per service. This saved
> > me quite some lines of code in my userspace healthchecking
> > daemon.
>
> Yes, you vote for moving some features from user to the
> kernel space. We must find the right balance: what can be done in
> LVS and what must be implemented in the user space tools.

I absolutely agree and even if you consider the fact that in
the future (although I don't think so) this very clean patch
could be part of the mainstream kernel. I also disagree putting
every cool feature one thinks he needs into the kernel just
because it's faster and it saves him 2500+ lines of code. My
ugly patch doesn't, if implemented in the correct way, affect
the normal kernel control path in case you don't use the
feature. Anyway, we will find a cool solution to the problem
because admittedly both solutions are not the best of all worlds.
I also would like to hear from other people what experiences
they've made with DDoS and the way the LVS was working under
an attack. So far I've not seen more than an akademic proof
(doing some stress tests not reflecting real world example)
to the designed defense strategies. I think Anoush was working
on something too but I haven't heard of him since ages ;)

> May be we can limit the SYN rate. Of course, that not covers
> all cases, so my thought was to limit the packet rate for all states
> or per connection, not sure, this is an open topic. It is easy to open

Uih! This is strong tobacco. You can screw quite alot if you
start doing modifications in all states. But we should discuss
such an approach because it sounds challenging.

> a connection through the director (especially in LVS-DR) and then
> to flood with packets this connection. This is one of the cases where
> LVS can really guard the real servers from packet floods. If we

True.

> combine this with the other kind of attacks, the distributed ones,
> we have better control. Of course, some QoS implementations can
> cover such problems, not sure. And this can be a simple implementation,
> of course, nobody wants to invent the wheel :)

Yes, I doubt however that existing QoS schedulers would already
bring such a functionality.

> Yes, I know that this is a working solution. But see, you
> exclude all real servers :) You are giving up. My idea is we to find

No. :) :) I allow all existing connections for a template entered
into the transition table to finish their requests. It's just the
quiesced mode. I was very happy, when Wensong introduced it ;) The
RS just won't get new requests until at least one of them falls
below the low threshold.

> a state when we can drop some of the requests and to keep the
> real servers busy but responsive. This can be a difficult task but
> not when we have the help from our agents. We expect that many
> valid requests can be dropped but if we keep the real server in
> good health we can handle some valid requests because nobody knows
> when the flood will stop. The link is busy but it contains valid
> requests. And the service does not see the invalid ones.

This is the biggest problem with LVS in DR-mode. The control of
the states and the packets. We just don't have yet a reliable
way of weighting an incoming connection and this is IMHO also
impossible.

> > TUX 2.0 and try to overflow it. If you can, apply the zero-copy
> > patches of DaveM. No way you will find such a fast (88MBit/s
> > requests!!) Link to saturate the server.
>
> Nobody overflows the service :) You need so many clients
> for this. The easiest way the attackers use is to flood the link.
> And they prefer to reach the service because this makes more
> troubles. More hops reached, more links busy, more troubles.

I can't follow you here, sorry ;)

> The requests are not meaningful, we care how much load they
> introduce and we report this load to the director. It can look, for
> example, as one value (weight) for the real host that can be set
> for all real services running on this host. We don't need to generate
> 10 weights for the 10 real services running in our real host. And

I don't know, this could be desirable unless we have an
intelligent enough scheduler. In lots of projects I've seen
or implemented the application or database behind such a LVS
cluster was crap or the Tier-architecture was extremly clumsy
so that already after a day I had huge load imbalance even
with wlc and non-persistency.

> we change the weight on each 2 seconds for example. We need two
> syscalls (lseek and read) to get most of the values from /proc fs.
> But may be from 2-3 files. This is in Linux, of course. Not sure
> how this behaves under attack. We will see it :)

Are you going for it?

> The only problem we have with this scheme is the ipvsadm
> binary. It must be changed (the user structure in the kernel :))

This is not your only problem :)

> The last change is dated from 0.9.10 and this is a big period :)
> But you know what means a change in the user structures :)

Indeed.

> Yes, the picture is complex and there are so many details
> we can consider. IMO, there is no simple solution :) But if we
> combine all useful ideas in a user space software, I think, we can
> have an useful tool.

Definitely true, you already started with a very promising
user space tool which is extremely open to extend.

> > A packetfilter, the router (most of use do have a CISCO, don't the?)
>
> Yes, the question is how Cisco will know what packet rate
> overloads the real servers :)

:) The router is in my example just configured to drop non-net related
packets and these are already enough (seeing the huge logfile that
comes every day.

> No, no :) I'm never happy, always look for better ideas (a
> joke :)) May be I'm thinking for too complex things. And the time is
> always not enough :)

Well, I'm happy to hear this so I know we're both pulling on the
same rope. I'm also not happy as long as the proper solution I
can live with is implemented. That's the way the IT business should
work (it doesn't however).

Thank you again for the interesting comments and thoughts,
Roberto Nibali, ratz

--
mailto: `echo NrOatSz@tPacA.cMh | sed 's/[NOSPAM]//g'`
Re: [PATCH][RFC]: followup ... [ In reply to ]
Hello Ratz,

On Sun, 18 Feb 2001, Roberto Nibali wrote:

> Hi Julian,
>
> Basically I have to agree with your statements, the problem
> is, that we just see the network setup of a LVS-cluster
> differently. I tend to use a good firewall architecture that
> assures me that the loadbalancer will not get hit too badly
> in case of a [D]DoS attack. You assume the worst case, where
> the loadbalancer is standing right in the Internet and is
> fighting alone against all the malicious forged packets. For
> me the LVS is a part of the firewall design, I'd personally
> never build a net only with a load balancer and without some
> filter and/or proxy. Your mileage may vary.

I agree, some firewalling can be done before the balancer
but when the normally looking traffic comes only the balancer knows
for open/closed ports, related ICMP, etc. The main things
you can do before the balancer are to avoid source address spoofing,
some bad packets, may be some ICMP types? But the balancer can be
attacked even with normal traffic. The request rate can be limited
before the balancer. But it is good this to be related with such
things as: the virtual service, the real server load, etc.

> > Yep, may be we need an universal netlink transport to get data
> > from LVS but this must be carefully analyzed.
>
> I'm currently talking with laforge (Harald Welte) from the
> iptables team about the universal netlink interface. He's
> trying to synchronize the state tables too via the netlink
> interface. A intelligent netlink framework, maybe even as
> some kind of API could help people a lot.

Yes, we need NETLINK_LVS kernel socket or similar. I don't
think that for netfilter will be easy but for LVS can be easier. If
we use full state (yes, Netfilter has "Real statefull connection
tracking") replication we can flood the the internal links. There
are ideas the state replication to be implemented only for long
living connections. And yes, we can use this universal transport
for many things, not only for connection state replication.

> > sees another one. And under attack we observe big gap between the
> > active/inactive counters and the used threshold values. In this case
> > we just exclude all real servers. This is the reason I prefer the
> > more informed approach of using agents.
>
> Then set the threshold to a high enough value just as much
> that the realserver will not die.

We will need one admin to stay and to change the range :)
Of course, the range you propose can be tuned once until the
parameters are changed under attack. OK, the user space tool can
change the values under attack :)

> > No, we have two choices:
> >
> > - use SYN cookies and much memory for open requests, accept more
> > valid requests
> >
> > - don't use SYN cookies, drop the requests exceeding the backlog length,
> > drop many valid requests but the real servers are not overloaded
>
> So, what you're saying is, that in the beginning you set the value
> for the backlog queue high and if you experience high request rate
> that never completes 3-Way handshaking you reduce the backlog number?
> IMO you should enable both, SYN cookies and a well chosen backlog
> queue number because if you disable SYN cookies and exceed the amount
> of allowed connections in the backlog queue, new TCP_SYN requests
> will be dropped no matter what. If you however enabled SYN cookies
> and the amount of SYN_RECV states exceeds the backlog queue number
> new incoming requests trying to finish the 3-Way handshake will still
> get the possibility to do so.

Yes, the user must select a backlog size value according to the
connection rate, we don't want dropped requests even while not under
attack. Of course, the SYN cookies help, for the OSes that support
them. Not very much if our link is full with invalid requests because
we can flood our output pipe too. But I don't know how often DDoS
SYN attacks happen these days.

> > Yes, nobody claims the defense strategies guard the real
> > servers. This is not their goal. They keep the director with more
> > free memory and nothing more :) Only drop_packet can control the
> > request rate but only for the new requests.
>
> Hmm, so why can't I enable the drop_packet and set thresholds?
> Excuse my ignorance but drop_packet would randomly drop some
> connections and the threshold would guard that the server
> doesn't crash. I just provided an additional feature that doesn't
> affect the functionality or flow control of the rest. Do you

Agreed. drop_packet and RS limits are different things.
The question is how efficient will be the RS limits but if they
are option the users can select, I don't see a problem. That can
be an option just like the people use wlc for example - no
guarantee for the real server load :) But while under attack the
wlc is not affected (except if the flood is over one connection),
the RS limits are. And this is the problem I see.

> want the RS to be exhausted (with enabled drop_packet) or do
> you want to have the possibilty to act before this happens. If
> you let the RS get flooded by forged packets it's IMHO the same
> as if you set their value to zero. Existing connection in the
> transition table will still be able communicate with the server
> I just want to prevent the LB to sent another request to the
> RS. If this has to be done in kernel space is a question per se.

Yes, these RS limits are a simple control we can add.
And of course it will be used from many users. My doubts are related
to the moment where all real server will disappear and will not
accept more new connections. How fast we will increase these
limits or will start scheduling connections to these real servers.
It again appears to be a user space problem :)

> > Yes but the clients can't work, you exclude all servers
> > in this case because the LVS spreads the requests to all servers
> > and the rain becomes deluge :)
>
> No, in quiesced mode, all existing connections will still be
> handled. If the (dest->inactconns + dest->activeconns) for
> this RS drop below the lower threshold he will get new requests.
> I don't see the deluge. In case the LVS kernel sets all the
> RS weights 0 it will act as a sink by dropping all connections
> not in the template state table. Of course no new correct
> connection will be processed but in your case where you drop
> randomly packets you can have the same amount of loss of 'good'
> packets. Am I wrong ? :)

Yes but drop_packet can be activated when we see a very
big connection rate that will occupy all the memory for connections
in the director. If we don't run other user space software we
can simply ignore the defense strategies and to leave the packets
to be dropped after memory allocation error.

> > Yes, we have to find a way to configure all these features
> > that will be implemented further. The main problem here is how
> > we can provide one binary ipvsadm tool to all Linux distributions.
>
> Huh? To adjust the ipvsadm tool to the distros is the problem
> of the maintainers itself. We just provide the functionality
> to be able to maintain all tasks we support. I have built an
> own distribution too and I too have to maintain my diffs to
> the LVS-code and the ipvsadm app. I'd disagree if ipvsadm would
> become distro-related.

Yes, may be we can imlpement a better mechanism that will
allow the different options to be supported without hurting all
users. Who knows, may be we can create more sockops? But the
things can be very complex. BTW, netfilter has such shared libs,
for example.

> > We know how many problems the users have when they use the ipvsadm
> > supplied in their distrib.
>
> So the distributions can handle it. It can't be our task to
> adjust the binary tool to every distro it's our task to keep
> it clean and independant of any distro.

This is true but it means thay have to put all features in?
Currently, for LVS we have the following methods in hand:

- create new scheduler

Total 1 methods to add new separated features (may be I'm missing
something). The things can be very complex if one new feature wants
to touch some parts of the functions in the fast path or in the user
space structures. What can be the solution? Putting hooks inside LVS?
IMO, we already must think for such needs.

> > > Hmm, I just want to limit the amount of concurrent connections
> > > per realserver and in the future maybe per service. This saved
> > > me quite some lines of code in my userspace healthchecking
> > > daemon.
> >
> > Yes, you vote for moving some features from user to the
> > kernel space. We must find the right balance: what can be done in
> > LVS and what must be implemented in the user space tools.
>
> I absolutely agree and even if you consider the fact that in
> the future (although I don't think so) this very clean patch
> could be part of the mainstream kernel. I also disagree putting
> every cool feature one thinks he needs into the kernel just
> because it's faster and it saves him 2500+ lines of code. My

No doubts, there will be some nice features that can't be
done in user space. And exactly these features are not used from
other users. The example is the cp->fwmark support proposed from
Henrik Nordstrom: we have a feature that is difficult to say it
is for user space but that touches two parts: internal functions
and adds another hook that can delay the processing for some
users. I'm not sure what will happen if we start to think in
"hooks" just like netfilter. If that looks good in user space
I'm not sure we can tell the same for the kernel space. Any
ideas here, may be for new topic?

> ugly patch doesn't, if implemented in the correct way, affect
> the normal kernel control path in case you don't use the
> feature. Anyway, we will find a cool solution to the problem
> because admittedly both solutions are not the best of all worlds.
> I also would like to hear from other people what experiences
> they've made with DDoS and the way the LVS was working under
> an attack. So far I've not seen more than an akademic proof
> (doing some stress tests not reflecting real world example)
> to the designed defense strategies. I think Anoush was working
> on something too but I haven't heard of him since ages ;)

Hm, it seems nobody has such problems :)))

> > May be we can limit the SYN rate. Of course, that not covers
> > all cases, so my thought was to limit the packet rate for all states
> > or per connection, not sure, this is an open topic. It is easy to open
>
> Uih! This is strong tobacco. You can screw quite alot if you
> start doing modifications in all states. But we should discuss
> such an approach because it sounds challenging.

No, counter which is reset on state change. But this is
another issue and I didn't started to think more about such things.
May be will not :)

> > combine this with the other kind of attacks, the distributed ones,
> > we have better control. Of course, some QoS implementations can
> > cover such problems, not sure. And this can be a simple implementation,
> > of course, nobody wants to invent the wheel :)
>
> Yes, I doubt however that existing QoS schedulers would already
> bring such a functionality.

Yes, that defense can be connection state related, LVS is
connection scheduler, though, not a packet scheduler.

> > a state when we can drop some of the requests and to keep the
> > real servers busy but responsive. This can be a difficult task but
> > not when we have the help from our agents. We expect that many
> > valid requests can be dropped but if we keep the real server in
> > good health we can handle some valid requests because nobody knows
> > when the flood will stop. The link is busy but it contains valid
> > requests. And the service does not see the invalid ones.
>
> This is the biggest problem with LVS in DR-mode. The control of
> the states and the packets. We just don't have yet a reliable
> way of weighting an incoming connection and this is IMHO also
> impossible.

Yes, job for the agents to represent the real server load
in weights.

> > The requests are not meaningful, we care how much load they
> > introduce and we report this load to the director. It can look, for
> > example, as one value (weight) for the real host that can be set
> > for all real services running on this host. We don't need to generate
> > 10 weights for the 10 real services running in our real host. And
>
> I don't know, this could be desirable unless we have an
> intelligent enough scheduler. In lots of projects I've seen
> or implemented the application or database behind such a LVS
> cluster was crap or the Tier-architecture was extremly clumsy
> so that already after a day I had huge load imbalance even
> with wlc and non-persistency.

Yes, wlc is not my preferred scheduler when it comes to
connections dealing with database :)

I don't think we need intelligent scheduler if we
are talking about current set of information used from the LVS
schedulers. Only the users know what kind of connections are
scheduled and they can instruct an user space tool how to set the
WRR weights according to the load.

> > we change the weight on each 2 seconds for example. We need two
> > syscalls (lseek and read) to get most of the values from /proc fs.
> > But may be from 2-3 files. This is in Linux, of course. Not sure
> > how this behaves under attack. We will see it :)
>
> Are you going for it?

Yes, when my user space libs are ready we will test them
for different setups and services.

> > The only problem we have with this scheme is the ipvsadm
> > binary. It must be changed (the user structure in the kernel :))
>
> This is not your only problem :)
>
> > The last change is dated from 0.9.10 and this is a big period :)
> > But you know what means a change in the user structures :)
>
> Indeed.
>
> > Yes, the picture is complex and there are so many details
> > we can consider. IMO, there is no simple solution :) But if we
> > combine all useful ideas in a user space software, I think, we can
> > have an useful tool.
>
> Definitely true, you already started with a very promising
> user space tool which is extremely open to extend.
>
> > > A packetfilter, the router (most of use do have a CISCO, don't the?)
> >
> > Yes, the question is how Cisco will know what packet rate
> > overloads the real servers :)
>
> :) The router is in my example just configured to drop non-net related
> packets and these are already enough (seeing the huge logfile that
> comes every day.

Yes, there are packets with sources from the private networks
too :)

> > No, no :) I'm never happy, always look for better ideas (a
> > joke :)) May be I'm thinking for too complex things. And the time is
> > always not enough :)
>
> Well, I'm happy to hear this so I know we're both pulling on the
> same rope. I'm also not happy as long as the proper solution I
> can live with is implemented. That's the way the IT business should
> work (it doesn't however).

I hope other people will express their ideas about this
topic. May be I'm too pedantic in some cases :) And now I'm talking
without "showing the code" :) I hope the things will change soon :)

> Thank you again for the interesting comments and thoughts,
> Roberto Nibali, ratz


Regards

--
Julian Anastasov <ja@ssi.bg>
Re: [PATCH][RFC]: followup ... [ In reply to ]
Hi Julian,

> I agree, some firewalling can be done before the balancer
> but when the normally looking traffic comes only the balancer knows
> for open/closed ports, related ICMP, etc. The main things

Unless you put a proxying firewall ;)

> you can do before the balancer are to avoid source address spoofing,
> some bad packets, may be some ICMP types? But the balancer can be
> attacked even with normal traffic. The request rate can be limited

Ok, Julian, let's make a real example. I try to set up some LVS
cluster with webserver and a normally configured firewall and you
try to flood it in a way that the service cannot be delivered
anymore normally :). Any ISP that wants to give me temporary access
to his backbone?

> Yes, we need NETLINK_LVS kernel socket or similar. I don't
> think that for netfilter will be easy but for LVS can be easier. If

The architecture he proposed to me was rather simple, actually he
had the same idea. He's doing it as a module that hooks into
conntrack. There you have quite the same template structures for
incoming connections just more of them ;)

> we use full state (yes, Netfilter has "Real statefull connection
> tracking") replication we can flood the the internal links. There

There we'd have to split the LVS-code into two sourcetrees.
Because doing connection tracking and replication is too much
to implement in kernel space for 2.2.x.

> are ideas the state replication to be implemented only for long
> living connections. And yes, we can use this universal transport
> for many things, not only for connection state replication.

[OT] I proposed him to make a general framework so that we don't
have to reinvent the wheel. I though that it should be possible
to register via a device all the template tables you want to
have synch'd and the module itself would be responsible to
create the appropriate NETLINK packets and to start/reset the
timers in kernel space. [/OT]

> We will need one admin to stay and to change the range :)
> Of course, the range you propose can be tuned once until the
> parameters are changed under attack. OK, the user space tool can
> change the values under attack :)

Grr, yes, I know you're right, I just don't want to accept the
fact. :)

> Yes, the user must select a backlog size value according to the
> connection rate, we don't want dropped requests even while not under

Oh, this sound very reasonable. How and where do you think this can
be implemented?

> attack. Of course, the SYN cookies help, for the OSes that support
> them. Not very much if our link is full with invalid requests because
> we can flood our output pipe too. But I don't know how often DDoS
> SYN attacks happen these days.

It's a O(N^3) proportion to the popularity :) I'd love to see
the snort logfiles of nasa.gov or nsa.com or some *.mil? Over here
we have this stupid "big brother" stuff broadcasted trough some
ACEdirector3 loadbalancers. Two hours after launch the RS were not
reachable anymore.

> Agreed. drop_packet and RS limits are different things.
> The question is how efficient will be the RS limits but if they
> are option the users can select, I don't see a problem. That can

Good. That's what I did, see my example when announing it. ;)

> be an option just like the people use wlc for example - no
> guarantee for the real server load :) But while under attack the
> wlc is not affected (except if the flood is over one connection),
> the RS limits are. And this is the problem I see.

Yep, this is the problem. I have to do some more testing and
real life examples with existing customer projects (I'm happy
to have an LVS-cluster that works and not some unconfigurable
ACEdirector-X patchwork) and in my lab (Just finished this
weekend)

> Yes, these RS limits are a simple control we can add.
> And of course it will be used from many users. My doubts are related
> to the moment where all real server will disappear and will not
> accept more new connections. How fast we will increase these

I will investigate this. Could you just give me some proposals
on how to make different test setups, please? With enough time
I prepare some kernel with different options enabled and will
do some penetration tests.

> limits or will start scheduling connections to these real servers.
> It again appears to be a user space problem :)

Yes, this is definitely a user space problem, if you want to
make it dynamically. I proposed the statical approach. If I
do it dynamically, we have to introduce some more setsockopts,
don't we?

> Yes but drop_packet can be activated when we see a very
> big connection rate that will occupy all the memory for connections
> in the director. If we don't run other user space software we
> can simply ignore the defense strategies and to leave the packets
> to be dropped after memory allocation error.

I have no experiences with this approach. Do I understand you
correctly when I say: The defense level is set by the amount
of kmalloc'able pages in the kernel per skb?

> Yes, may be we can imlpement a better mechanism that will
> allow the different options to be supported without hurting all
> users. Who knows, may be we can create more sockops? But the

Isn't that the case right now? The provided function of ipvsadm
is very sparse.

> > So the distributions can handle it. It can't be our task to
> > adjust the binary tool to every distro it's our task to keep
> > it clean and independant of any distro.
>
> This is true but it means thay have to put all features in?

No exactly, if there is a framework proposed by some distributor
that can be of use for everyone and that doesn't affect the rest
of the flow of LVS it should possible to include it.

> Currently, for LVS we have the following methods in hand:
>
> - create new scheduler

I could think of a method for "defense strategies". Do you know about
the OOM-killer framework for kernel-2.4.x? There we have a general
hook like for creating a new scheduler and everybody that thinks he
has a great idea to improve the functionality of the structure can
add his code (like f.e. Thomas Proell did with the hashing scheduler).
A lot of people already proposed some patches for the OOM-killer and
so I could imagine a hook into LVS where you can register your own
defense strategy, so we can test them under different penetration
tests.

> Total 1 methods to add new separated features (may be I'm missing
> something). The things can be very complex if one new feature wants
> to touch some parts of the functions in the fast path or in the user
> space structures. What can be the solution? Putting hooks inside LVS?

Yes, but I don't think Wensong likes that idea :)

> IMO, we already must think for such needs.

Yes, the project got larger and more reputation than some of us
initially thought. The code is very clear and stable, it's time
to enhance it. The only very big problem that I see is that it
looks like we're going to have to separate code paths one patch
for 2.2.x kernels and one for 2.4.x.

> No doubts, there will be some nice features that can't be
> done in user space. And exactly these features are not used from
> other users. The example is the cp->fwmark support proposed from
> Henrik Nordstrom: we have a feature that is difficult to say it
> is for user space but that touches two parts: internal functions
> and adds another hook that can delay the processing for some

The problem with his patch is:
static struct nf_hook_ops ip_vs_in_ops = {
{ NULL, NULL },
- ip_vs_in, PF_INET, NF_IP_LOCAL_IN, 100
+ ip_vs_in, PF_INET, NF_IP_LOCAL_IN, -10
+};

> users. I'm not sure what will happen if we start to think in
> "hooks" just like netfilter. If that looks good in user space
> I'm not sure we can tell the same for the kernel space. Any
> ideas here, may be for new topic?

See above about hooks for defense strategies. But you're right
IMHO, there is not a lot you can put into kernel space since
most of the stuff has to be done in userspace.

> > I also would like to hear from other people what experiences
> > they've made with DDoS and the way the LVS was working under
> > an attack. So far I've not seen more than an akademic proof
> > (doing some stress tests not reflecting real world example)
> > to the designed defense strategies. I think Anoush was working
> > on something too but I haven't heard of him since ages ;)
>
> Hm, it seems nobody has such problems :)))

:) no comment.

> No, counter which is reset on state change. But this is
> another issue and I didn't started to think more about such things.
> May be will not :)

Isn't that the case for 2.4.x and conntrack already?

> Yes, that defense can be connection state related, LVS is
> connection scheduler, though, not a packet scheduler.

Not yet ;)

> Yes, job for the agents to represent the real server load
> in weights.

The biggest problem I see here is that maybe the user space daemons
don't get enough scheduling time to be accurate enough.

> Yes, wlc is not my preferred scheduler when it comes to
> connections dealing with database :)

Tell me, which scheduler should I take? None of the existing ones
gives me good enough results currently with persistency. We have
to accept the fact, that 3-Tier application programmers don't
know about loadbalancing or clustering, mostly using Java and this
is just about the end of trying to load balance the application
smoothly.

> I don't think we need intelligent scheduler if we
> are talking about current set of information used from the LVS
> schedulers. Only the users know what kind of connections are
> scheduled and they can instruct an user space tool how to set the
> WRR weights according to the load.

See, the timeperiod of setting the weights and the resulting load
rebalance is just a relation of 1:100. If you try to adjust the
weights dynamically, you will see (for an average e-buiz application
framework with webserver and database) that you can never balance
it right in time. The good thing is, that even commercial load
balancer can't do it.

> > > > A packetfilter, the router (most of use do have a CISCO, don't the?)
> > >
> > > Yes, the question is how Cisco will know what packet rate
> > > overloads the real servers :)
> >
> > :) The router is in my example just configured to drop non-net related
> > packets and these are already enough (seeing the huge logfile that
> > comes every day.
>
> Yes, there are packets with sources from the private networks
> too :)

They are masqueraded and their netentity belongs to an interface
which of course will not drop the packets :)

> I hope other people will express their ideas about this
> topic. May be I'm too pedantic in some cases :) And now I'm talking
> without "showing the code" :) I hope the things will change soon :)

No, no, I also hope some other people join the discussion since
we both could be completely wrong (well, in your case I doubt ...)

Best regards,
Roberto Nibali, ratz

--
mailto: `echo NrOatSz@tPacA.cMh | sed 's/[NOSPAM]//g'`
Re: [PATCH][RFC]: followup ... [ In reply to ]
Hello Ratz,

On Mon, 19 Feb 2001, Roberto Nibali wrote:

> Hi Julian,
>
> > I agree, some firewalling can be done before the balancer
> > but when the normally looking traffic comes only the balancer knows
> > for open/closed ports, related ICMP, etc. The main things
>
> Unless you put a proxying firewall ;)
>
> > you can do before the balancer are to avoid source address spoofing,
> > some bad packets, may be some ICMP types? But the balancer can be
> > attacked even with normal traffic. The request rate can be limited
>
> Ok, Julian, let's make a real example. I try to set up some LVS
> cluster with webserver and a normally configured firewall and you
> try to flood it in a way that the service cannot be delivered
> anymore normally :). Any ISP that wants to give me temporary access
> to his backbone?

:)

> > Yes, we need NETLINK_LVS kernel socket or similar. I don't
> > think that for netfilter will be easy but for LVS can be easier. If
>
> The architecture he proposed to me was rather simple, actually he
> had the same idea. He's doing it as a module that hooks into
> conntrack. There you have quite the same template structures for
> incoming connections just more of them ;)

Hm, not sure how will look the details.

> > we use full state (yes, Netfilter has "Real statefull connection
> > tracking") replication we can flood the the internal links. There
>
> There we'd have to split the LVS-code into two sourcetrees.
> Because doing connection tracking and replication is too much
> to implement in kernel space for 2.2.x.

We already have two source trees (for 2.2 and 2.4). I don't
see very big difference for the replication requirements in 2.2 and
2.4. For Netfilter the picture can look different and the other
think is that I don't know how the replication is going to be
implemented there.

> > are ideas the state replication to be implemented only for long
> > living connections. And yes, we can use this universal transport
> > for many things, not only for connection state replication.
>
> [OT] I proposed him to make a general framework so that we don't
> have to reinvent the wheel. I though that it should be possible
> to register via a device all the template tables you want to
> have synch'd and the module itself would be responsible to
> create the appropriate NETLINK packets and to start/reset the
> timers in kernel space. [/OT]

When LVS and Netfilter have different connection tables
we must first make the replication separately and then to see what
is the common code. Or at least to sync.

> > Yes, the user must select a backlog size value according to the
> > connection rate, we don't want dropped requests even while not under
>
> Oh, this sound very reasonable. How and where do you think this can
> be implemented?

This can be automated (cotrolled from user space) but I
talked about the simple case where the user looks in /proc/net/netstat
or in the log for generated SYN cookies.

> > attack. Of course, the SYN cookies help, for the OSes that support
> > them. Not very much if our link is full with invalid requests because
> > we can flood our output pipe too. But I don't know how often DDoS
> > SYN attacks happen these days.
>
> It's a O(N^3) proportion to the popularity :) I'd love to see
> the snort logfiles of nasa.gov or nsa.com or some *.mil? Over here
> we have this stupid "big brother" stuff broadcasted trough some
> ACEdirector3 loadbalancers. Two hours after launch the RS were not
> reachable anymore.

:)

> > Agreed. drop_packet and RS limits are different things.
> > The question is how efficient will be the RS limits but if they
> > are option the users can select, I don't see a problem. That can
>
> Good. That's what I did, see my example when announing it. ;)
>
> > Yes, these RS limits are a simple control we can add.
> > And of course it will be used from many users. My doubts are related
> > to the moment where all real server will disappear and will not
> > accept more new connections. How fast we will increase these
>
> I will investigate this. Could you just give me some proposals
> on how to make different test setups, please? With enough time
> I prepare some kernel with different options enabled and will
> do some penetration tests.

May be testlvs is enough to hit the upper connection
limits for all real servers. And it seems deleting and then adding
the real servers (some LVS user can do that with user space tools) can
lead to higher active/inactive numbers, for example, in LVS-DR.

> > limits or will start scheduling connections to these real servers.
> > It again appears to be a user space problem :)
>
> Yes, this is definitely a user space problem, if you want to
> make it dynamically. I proposed the statical approach. If I
> do it dynamically, we have to introduce some more setsockopts,
> don't we?

Not sure, isn't the SET_EDITDEST sockopt enough for these two
limits?

> > Yes but drop_packet can be activated when we see a very
> > big connection rate that will occupy all the memory for connections
> > in the director. If we don't run other user space software we
> > can simply ignore the defense strategies and to leave the packets
> > to be dropped after memory allocation error.
>
> I have no experiences with this approach. Do I understand you
> correctly when I say: The defense level is set by the amount
> of kmalloc'able pages in the kernel per skb?

Yes, currently LVS uses the free memory value as key for
manipulating the defense strategies. No skbs involved or I don't
understand the question.

> > Yes, may be we can imlpement a better mechanism that will
> > allow the different options to be supported without hurting all
> > users. Who knows, may be we can create more sockops? But the
>
> Isn't that the case right now? The provided function of ipvsadm
> is very sparse.

Yes, in 2.4 they are many but in 2.2 it is one. It seems
in 2.4 it is more easy to add more sockopts.

> > > So the distributions can handle it. It can't be our task to
> > > adjust the binary tool to every distro it's our task to keep
> > > it clean and independant of any distro.
> >
> > This is true but it means thay have to put all features in?
>
> No exactly, if there is a framework proposed by some distributor
> that can be of use for everyone and that doesn't affect the rest
> of the flow of LVS it should possible to include it.
>
> > Currently, for LVS we have the following methods in hand:
> >
> > - create new scheduler
>
> I could think of a method for "defense strategies". Do you know about
> the OOM-killer framework for kernel-2.4.x? There we have a general
> hook like for creating a new scheduler and everybody that thinks he
> has a great idea to improve the functionality of the structure can
> add his code (like f.e. Thomas Proell did with the hashing scheduler).
> A lot of people already proposed some patches for the OOM-killer and
> so I could imagine a hook into LVS where you can register your own
> defense strategy, so we can test them under different penetration
> tests.

What to answer, we have to analyze every case separately
because it can touch many parts from the code. Not sure whether
the current structure allows such hooks.

> > Total 1 methods to add new separated features (may be I'm missing
> > something). The things can be very complex if one new feature wants
> > to touch some parts of the functions in the fast path or in the user
> > space structures. What can be the solution? Putting hooks inside LVS?
>
> Yes, but I don't think Wensong likes that idea :)

Because this idea is not clear :)

> > IMO, we already must think for such needs.
>
> Yes, the project got larger and more reputation than some of us
> initially thought. The code is very clear and stable, it's time
> to enhance it. The only very big problem that I see is that it
> looks like we're going to have to separate code paths one patch
> for 2.2.x kernels and one for 2.4.x.

Yes, this is the reality. We can try to keep the things not
to look different for the user space.

> > No doubts, there will be some nice features that can't be
> > done in user space. And exactly these features are not used from
> > other users. The example is the cp->fwmark support proposed from
> > Henrik Nordstrom: we have a feature that is difficult to say it
> > is for user space but that touches two parts: internal functions
> > and adds another hook that can delay the processing for some
>
> The problem with his patch is:
> static struct nf_hook_ops ip_vs_in_ops = {
> { NULL, NULL },
> - ip_vs_in, PF_INET, NF_IP_LOCAL_IN, 100
> + ip_vs_in, PF_INET, NF_IP_LOCAL_IN, -10
> +};

Yes, we discussed it. But this is another issue, not
related to the cp->fwmark support. Of course, the users have to
choose how they will use fwmark: for routing, for QoS, for fwmark
based services. What happens when you want to do QoS and use fwmark
to classify the input traffic with different fwmarks but to
return the traffic using source routing. Currently, the LVS and the
MASQ code can't do source routing for the outgoing traffic (after
NAT in the in->out direction) - ip_forward.c:ip_forward() is not
ready for such games. LVS is ready because we know what will be
the saddr after the packet is masqueraded. This is for 2.2.
In 2.4 the cp->fwmark can help but this is not a complete and
universal solution. May be there is another solution and LVS can
use the route functions to select the right outdev and gateway
after the packet is masqueraded. May be we can simply change
skb->dst (in 2.2) and ip_forward to call ip_send with the
right (new) device and to forward the packet to the right gw?

> > users. I'm not sure what will happen if we start to think in
> > "hooks" just like netfilter. If that looks good in user space
> > I'm not sure we can tell the same for the kernel space. Any
> > ideas here, may be for new topic?
>
> See above about hooks for defense strategies. But you're right
> IMHO, there is not a lot you can put into kernel space since
> most of the stuff has to be done in userspace.
>
> > No, counter which is reset on state change. But this is
> > another issue and I didn't started to think more about such things.
> > May be will not :)
>
> Isn't that the case for 2.4.x and conntrack already?

May be yes. But LVS has separate conntracking.

> > Yes, that defense can be connection state related, LVS is
> > connection scheduler, though, not a packet scheduler.
>
> Not yet ;)
>
> > Yes, job for the agents to represent the real server load
> > in weights.
>
> The biggest problem I see here is that maybe the user space daemons
> don't get enough scheduling time to be accurate enough.

That is definitely true. When the CPU(s) are busy
transferring packets the processes can be delayed. So, the director
better not spend many cycles in user space. This is the reason I
prefer all these health checks to run in the real servers but this
is not always good/possible.

> > Yes, wlc is not my preferred scheduler when it comes to
> > connections dealing with database :)
>
> Tell me, which scheduler should I take? None of the existing ones
> gives me good enough results currently with persistency. We have
> to accept the fact, that 3-Tier application programmers don't
> know about loadbalancing or clustering, mostly using Java and this
> is just about the end of trying to load balance the application
> smoothly.

WRR + load informed cluster software. But I'm not sure in
the fact the persistency can do very bad things. May be yes, but
for small number of clients or when wlc is used (we don't talk for
the other dumb schedulers).

> > I don't think we need intelligent scheduler if we
> > are talking about current set of information used from the LVS
> > schedulers. Only the users know what kind of connections are
> > scheduled and they can instruct an user space tool how to set the
> > WRR weights according to the load.
>
> See, the timeperiod of setting the weights and the resulting load
> rebalance is just a relation of 1:100. If you try to adjust the
> weights dynamically, you will see (for an average e-buiz application
> framework with webserver and database) that you can never balance
> it right in time. The good thing is, that even commercial load
> balancer can't do it.

Of course, there can be some peaks but we are not going to
react only on the load generated from the client requests. There can
be a load not related to the clients, for example, program allocating
memory or spending some CPU cycles. Such load is not visible to wlc
and dramatic things can happen. Very often some weird things can
happen, for example, with a cgi bins that work with databases. The
simple fact of allocated memory is a bad symptom. Of course,
everything is application specific.

> > Yes, there are packets with sources from the private networks
> > too :)
>
> They are masqueraded and their netentity belongs to an interface
> which of course will not drop the packets :)

The problem is that they are not masqueraded :) And they can
reach us. Not every ISP drops spoofed packets at the place they are
generated. But in most of the cases this is not fatal.

> > I hope other people will express their ideas about this
> > topic. May be I'm too pedantic in some cases :) And now I'm talking
> > without "showing the code" :) I hope the things will change soon :)
>
> No, no, I also hope some other people join the discussion since
> we both could be completely wrong (well, in your case I doubt ...)

Yep, may be we are just inventing the wheel :))

> Best regards,
> Roberto Nibali, ratz


Regards

--
Julian Anastasov <ja@ssi.bg>
Re: [PATCH][RFC]: followup ... [ In reply to ]
Roberto Nibali wrote:

> > I agree, some firewalling can be done before the balancer
> > but when the normally looking traffic comes only the balancer knows
> > for open/closed ports, related ICMP, etc. The main things
>
> Unless you put a proxying firewall ;)

In which case you should not need LVS for load balancing...

--
Henrik Nordstrom
Safecore Technologies
Re: [PATCH][RFC]: followup ... [ In reply to ]
Henrik Nordstrom wrote:
>
> Roberto Nibali wrote:
>
> > > I agree, some firewalling can be done before the balancer
> > > but when the normally looking traffic comes only the balancer knows
> > > for open/closed ports, related ICMP, etc. The main things
> >
> > Unless you put a proxying firewall ;)
>
> In which case you should not need LVS for load balancing...

Could you please elaborate on this statement?

Best regards,
Roberto Nibali, ratz

--
mailto: `echo NrOatSz@tPacA.cMh | sed 's/[NOSPAM]//g'`
Re: [PATCH][RFC]: followup ... [ In reply to ]
Roberto Nibali wrote:

> > In which case you should not need LVS for load balancing...
>
> Could you please elaborate on this statement?

If you already have a "reverse" proxy which accepts all requests at the
application level and forwards them to the server, the load balancing
function is better implemented in the proxy, with all of the benefits and
none of the drawbacks from NAT based load balancing.

--
Henrik Nordstrom
Re: [PATCH][RFC]: followup ... [ In reply to ]
Hi,

[...]

> > There we'd have to split the LVS-code into two sourcetrees.
> > Because doing connection tracking and replication is too much
> > to implement in kernel space for 2.2.x.
>
> We already have two source trees (for 2.2 and 2.4). I don't
> see very big difference for the replication requirements in 2.2 and
> 2.4. For Netfilter the picture can look different and the other
> think is that I don't know how the replication is going to be
> implemented there.

Yes, but until now we had more or less the same functionality for
2.2.x kernel series and for 2.4.x kernels. Now I see it drifting
away. This is not a problem however. I actually also don't know
how Laforge is going to do it but he promised me to send me his
patches as soon as he's got something working.

> When LVS and Netfilter have different connection tables
> we must first make the replication separately and then to see what
> is the common code. Or at least to sync.

Corrrect.

> > > Yes, the user must select a backlog size value according to the
> > > connection rate, we don't want dropped requests even while not under
> >
> > Oh, this sound very reasonable. How and where do you think this can
> > be implemented?
>
> This can be automated (cotrolled from user space) but I
> talked about the simple case where the user looks in /proc/net/netstat
> or in the log for generated SYN cookies.

This may change extremely fast!

> > I will investigate this. Could you just give me some proposals
> > on how to make different test setups, please? With enough time
> > I prepare some kernel with different options enabled and will
> > do some penetration tests.
>
> May be testlvs is enough to hit the upper connection
> limits for all real servers. And it seems deleting and then adding
> the real servers (some LVS user can do that with user space tools) can
> lead to higher active/inactive numbers, for example, in LVS-DR.

Yep, maybe this weekend I might get a test setup ready.

> > > limits or will start scheduling connections to these real servers.
> > > It again appears to be a user space problem :)
> >
> > Yes, this is definitely a user space problem, if you want to
> > make it dynamically. I proposed the statical approach. If I
> > do it dynamically, we have to introduce some more setsockopts,
> > don't we?
>
> Not sure, isn't the SET_EDITDEST sockopt enough for these two
> limits?

Of course.

> > I have no experiences with this approach. Do I understand you
> > correctly when I say: The defense level is set by the amount
> > of kmalloc'able pages in the kernel per skb?
>
> Yes, currently LVS uses the free memory value as key for
> manipulating the defense strategies. No skbs involved or I don't
> understand the question.

Ok, saw it in the code now, and yes, no skbs ;)

> > I could think of a method for "defense strategies". Do you know about
> > the OOM-killer framework for kernel-2.4.x? There we have a general
> > hook like for creating a new scheduler and everybody that thinks he
> > has a great idea to improve the functionality of the structure can
> > add his code (like f.e. Thomas Proell did with the hashing scheduler).
> > A lot of people already proposed some patches for the OOM-killer and
> > so I could imagine a hook into LVS where you can register your own
> > defense strategy, so we can test them under different penetration
> > tests.
>
> What to answer, we have to analyze every case separately
> because it can touch many parts from the code. Not sure whether
> the current structure allows such hooks.

Agreed. Talking about stuff and generating ideas is simple but only
code will show if it's feasible.

> > > Total 1 methods to add new separated features (may be I'm missing
> > > something). The things can be very complex if one new feature wants
> > > to touch some parts of the functions in the fast path or in the user
> > > space structures. What can be the solution? Putting hooks inside LVS?
> >
> > Yes, but I don't think Wensong likes that idea :)
>
> Because this idea is not clear :)

Maybe. But I see that the defense_level is triggered via a sysctrl
and invoked in the sltimer_handler as well as the *_dropentry. If
we push those functions on level higher and introduce a metalayer
that registers the defense_strategy which would be selectable via
sysctrl and would currently contain update_defense_level we had the
possibility to register other defense strategies like f.e. limiting
threshold. Is this feasible? I mean instead of calling update_defense\
_level() and ip_vs_random_dropentry() in the sltimer_handler we just
call the registered defense_strategy[sysctrl_read] function. In the
existing case the defense_strategy[0]=update_defense_level() which
also merges the ip_vs_dropentry. Do I make myself sound stupid? ;)

> > Yes, the project got larger and more reputation than some of us
> > initially thought. The code is very clear and stable, it's time
> > to enhance it. The only very big problem that I see is that it
> > looks like we're going to have to separate code paths one patch
> > for 2.2.x kernels and one for 2.4.x.
>
> Yes, this is the reality. We can try to keep the things not
> to look different for the user space.

This would be a pain in the ass if we had two ipvsadm. IMHO the
userspace tools should recognize (compile-time) what kernel it
is working with and therefore enable the featureset. This will
of course bloat it up in future the more feature-differences we
will have regarding 2.2.x and 2.4.x series.

> Yes, we discussed it. But this is another issue, not
> related to the cp->fwmark support. Of course, the users have to
> choose how they will use fwmark: for routing, for QoS, for fwmark
> based services. What happens when you want to do QoS and use fwmark
> to classify the input traffic with different fwmarks but to
> return the traffic using source routing. Currently, the LVS and the
> MASQ code can't do source routing for the outgoing traffic (after
> NAT in the in->out direction) - ip_forward.c:ip_forward() is not
> ready for such games. LVS is ready because we know what will be
> the saddr after the packet is masqueraded. This is for 2.2.

I believe you that although I don't understand it :)

> In 2.4 the cp->fwmark can help but this is not a complete and
> universal solution. May be there is another solution and LVS can
> use the route functions to select the right outdev and gateway
> after the packet is masqueraded. May be we can simply change
> skb->dst (in 2.2) and ip_forward to call ip_send with the
> right (new) device and to forward the packet to the right gw?

Could you point me to a sketch where I could try to see how the
control path for a packet looks like in kernel 2.4? I mean some-
thing like I would do for 2.2.x kernels:

----------------------------------------------------------------
| ACCEPT/ lo interface |
v REDIRECT _______ |
--> C --> S --> ______ --> D --> ~~~~~~~~ -->|forward|----> _______ -->
h a |input | e {Routing } |Chain | |output |ACCEPT
e n |Chain | m {Decision} |_______| --->|Chain |
c i |______| a ~~~~~~~~ | | ->|_______|
k t | s | | | | |
s y | q | v | | |
u | v e v DENY/ | | v
m | DENY/ r Local Process REJECT | | DENY/
| v REJECT a | | | REJECT
| DENY d --------------------- |
v e -----------------------------
DENY

> > The biggest problem I see here is that maybe the user space daemons
> > don't get enough scheduling time to be accurate enough.
>
> That is definitely true. When the CPU(s) are busy
> transferring packets the processes can be delayed. So, the director
> better not spend many cycles in user space. This is the reason I
> prefer all these health checks to run in the real servers but this
> is not always good/possible.

No, considering the fact that not all RS are running Linux. We would
need to port the healthchecks to every possible RS architecture.

> > Tell me, which scheduler should I take? None of the existing ones
> > gives me good enough results currently with persistency. We have
> > to accept the fact, that 3-Tier application programmers don't
> > know about loadbalancing or clustering, mostly using Java and this
> > is just about the end of trying to load balance the application
> > smoothly.
>
> WRR + load informed cluster software. But I'm not sure in
> the fact the persistency can do very bad things. May be yes, but
> for small number of clients or when wlc is used (we don't talk for
> the other dumb schedulers).

I currently get some values via an daemon coded in perl on the RS,
started via xinetd. The LB connects to the healthcheck port and
gets some prepared results. He then puts this stuff into a db and
starts calculating the next steps to reconfigure the LVS-cluster to
smoothen the imbalance. The longer you let it running the more data
you get and the less adjustments you have to make. I reckon some
guy showing up on this list once had this idea in direction of
fuzzy logic. Hey Julian, maybe we should accept the fact that the
wlc scheduler also isn't a very advanced one:
loh = atomic_read(&least->activeconns)*50+atomic_read(&least->inactconns);
What would you think would change if we made this 50 dynamic?

> Of course, there can be some peaks but we are not going to
> react only on the load generated from the client requests. There can

Of course not, this is just an additional factor in calculating the
next steps for the choice of the RS for delivery.

> be a load not related to the clients, for example, program allocating
> memory or spending some CPU cycles. Such load is not visible to wlc
> and dramatic things can happen. Very often some weird things can
> happen, for example, with a cgi bins that work with databases. The
> simple fact of allocated memory is a bad symptom. Of course,
> everything is application specific.

That's the challenge.

> > > Yes, there are packets with sources from the private networks
> > > too :)
> >
> > They are masqueraded and their netentity belongs to an interface
> > which of course will not drop the packets :)
>
> The problem is that they are not masqueraded :) And they can
> reach us. Not every ISP drops spoofed packets at the place they are
> generated. But in most of the cases this is not fatal.

Broken network design, IMO.

> > > I hope other people will express their ideas about this
> > > topic. May be I'm too pedantic in some cases :) And now I'm talking
> > > without "showing the code" :) I hope the things will change soon :)
> >
> > No, no, I also hope some other people join the discussion since
> > we both could be completely wrong (well, in your case I doubt ...)
>
> Yep, may be we are just inventing the wheel :))

We'll see ...

Later,
Roberto Nibali, ratz

--
mailto: `echo NrOatSz@tPacA.cMh | sed 's/[NOSPAM]//g'`
Re: [PATCH][RFC]: followup ... [ In reply to ]
Henrik Nordstrom wrote:
>
> Roberto Nibali wrote:
>
> > > In which case you should not need LVS for load balancing...
> >
> > Could you please elaborate on this statement?
>
> If you already have a "reverse" proxy which accepts all requests at the
> application level and forwards them to the server, the load balancing
> function is better implemented in the proxy, with all of the benefits and
> none of the drawbacks from NAT based load balancing.

Agreed, if you have the source code. But tell me, how many proxies
out there have loadbalancing capability built-in? Yes, if you write
your own proxy for you own application I would also consider including
loadbalancing. Thank you for the comment.

Regards,
Roberto Nibali, ratz

--
mailto: `echo NrOatSz@tPacA.cMh | sed 's/[NOSPAM]//g'`
Re: [PATCH][RFC]: followup ... [ In reply to ]
Hello Ratz,

On Thu, 22 Feb 2001, Roberto Nibali wrote:

> > > > Total 1 methods to add new separated features (may be I'm missing
> > > > something). The things can be very complex if one new feature wants
> > > > to touch some parts of the functions in the fast path or in the user
> > > > space structures. What can be the solution? Putting hooks inside LVS?
> > >
> > > Yes, but I don't think Wensong likes that idea :)
> >
> > Because this idea is not clear :)
>
> Maybe. But I see that the defense_level is triggered via a sysctrl
> and invoked in the sltimer_handler as well as the *_dropentry. If
> we push those functions on level higher and introduce a metalayer
> that registers the defense_strategy which would be selectable via
> sysctrl and would currently contain update_defense_level we had the
> possibility to register other defense strategies like f.e. limiting
> threshold. Is this feasible? I mean instead of calling update_defense\
> _level() and ip_vs_random_dropentry() in the sltimer_handler we just
> call the registered defense_strategy[sysctrl_read] function. In the
> existing case the defense_strategy[0]=update_defense_level() which
> also merges the ip_vs_dropentry. Do I make myself sound stupid? ;)

The different strategies work in different places and it is
difficult to use one hook. The current implementation allows they to
work together. But may be there is another solution considering how
LVS is called: to drop packets or to drop entries. There are no many
places for such hooks, so may be it is possible something to be done.
But first let's see what kind of other defense strategies will come.

> > > Yes, the project got larger and more reputation than some of us
> > > initially thought. The code is very clear and stable, it's time
> > > to enhance it. The only very big problem that I see is that it
> > > looks like we're going to have to separate code paths one patch
> > > for 2.2.x kernels and one for 2.4.x.
> >
> > Yes, this is the reality. We can try to keep the things not
> > to look different for the user space.
>
> This would be a pain in the ass if we had two ipvsadm. IMHO the
> userspace tools should recognize (compile-time) what kernel it
> is working with and therefore enable the featureset. This will
> of course bloat it up in future the more feature-differences we
> will have regarding 2.2.x and 2.4.x series.

Not possible, the sockopt are different in 2.4

> Could you point me to a sketch where I could try to see how the
> control path for a packet looks like in kernel 2.4? I mean some-
> thing like I would do for 2.2.x kernels:
>
> ----------------------------------------------------------------
> | ACCEPT/ lo interface |
> v REDIRECT _______ |
> --> C --> S --> ______ --> D --> ~~~~~~~~ -->|forward|----> _______ -->
> h a |input | e {Routing } |Chain | |output |ACCEPT
> e n |Chain | m {Decision} |_______| --->|Chain |
> c i |______| a ~~~~~~~~ | | ->|_______|
> k t | s | | | | |
> s y | q | v | | |
> u | v e v DENY/ | | v
> m | DENY/ r Local Process REJECT | | DENY/
> | v REJECT a | | | REJECT
> | DENY d --------------------- |
> v e -----------------------------
> DENY


Here is some info I maintain (may be not actual, the new ICMP
hooks are missing). Look for "LVS" where is LVS placed.

The Netfilter hooks:

Priorities:
NF_IP_PRI_FIRST = INT_MIN,
NF_IP_PRI_CONNTRACK = -200,
NF_IP_PRI_MANGLE = -150,
NF_IP_PRI_NAT_DST = -100,
NF_IP_PRI_FILTER = 0,
NF_IP_PRI_NAT_SRC = 100,
NF_IP_PRI_LAST = INT_MAX,


PRE_ROUTING (ip_input.c:ip_rcv):
CONNTRACK=-200, ip_conntrack_core.c:ip_conntrack_in
MANGLE=-150, iptable_mangle.c:ipt_hook
NAT_DST=-100, ip_nat_standalone.c:ip_nat_fn
FILTER=0, ip_fw_compat.c:fw_in, defrag, firewall, demasq, redirect
FILTER+1=1, net/sched/sch_ingress.c:ing_hook

LOCAL_IN (ip_input.c:ip_local_deliver):
FILTER=0, iptable_filter.c:ipt_hook
LVS=100, ip_vs_in
LAST-1, ip_fw_compat.c:fw_confirm
CONNTRACK=LAST-1, ip_conntrack_standalone.c:ip_confirm

FORWARD (ip_forward.c:ip_forward):
FILTER=0, iptable_filter.c:ipt_hook
FILTER=0, ip_fw_compat.c:fw_in, firewall, LVS:check_for_ip_vs_out,
masquerade
LVS=100, ip_vs_out

LOCAL_OUT (ip_output.c):
CONNTRACK=-200, ip_conntrack_standalone.c:ip_conntrack_local
MANGLE=-150, iptable_mangle.c:ipt_local_out_hook
NAT_DST=-100, ip_nat_standalone.c:ip_nat_local_fn
FILTER=0, iptable_filter.c:ipt_local_out_hook

POST_ROUTING (ip_output.c:ip_finish_output):
FILTER=0, ip_fw_compat.c:fw_in, firewall, unredirect,
mangle ICMP replies
LVS=NAT_SRC-1, ip_vs_post_routing
NAT_SRC=100, ip_nat_standalone.c:ip_nat_out
CONNTRACK=LAST, ip_conntrack_standalone.c:ip_refrag


CONNTRACK:
PRE_ROUTING, LOCAL_IN, LOCAL_OUT, POST_ROUTING

FILTER:
LOCAL_IN, FORWARD, LOCAL_OUT

MANGLE:
PRE_ROUTING, LOCAL_OUT

NAT:
PRE_ROUTING, LOCAL_OUT, POST_ROUTING


Running variants:

1. Only lvs - the fastest
2. lvs + ipfw NAT
3. lvs + iptables NAT

Where is LVS placed:

LOCAL_IN:100 ip_vs_in

FORWARD:100 ip_vs_out

POST_ROUTING:NF_IP_PRI_NAT_SRC-1 ip_vs_post_routing



The chains:

The out->in LVS packets (for any forwarding method) walk:

pre_routing -> LOCAL_IN -> ip_route_output or dst cache -> POST_ROUTING


LOCAL_IN
ip_vs_in -> ip_route_output/dst cache
-> set skb->nfmark with special value
-> ip_send -> POST_ROUTING

POST_ROUTING
ip_vs_post_routing
- check skb->nfmark and exit from the
chain


The in->out LVS packets (for LVS/NAT) walk:

pre_routing -> FORWARD -> POST_ROUTING

FORWARD
ip_vs_out -> NAT -> NF_ACCEPT

POST_ROUTING
ip_vs_post_routing
- check skb->nfmark and exit from the
chain

I hope in the netfilter docs there is a nice ascii diagram.
But I hope the above info is more useful if you already know what
means each hook.


> > > The biggest problem I see here is that maybe the user space daemons
> > > don't get enough scheduling time to be accurate enough.
> >
> > That is definitely true. When the CPU(s) are busy
> > transferring packets the processes can be delayed. So, the director
> > better not spend many cycles in user space. This is the reason I
> > prefer all these health checks to run in the real servers but this
> > is not always good/possible.
>
> No, considering the fact that not all RS are running Linux. We would
> need to port the healthchecks to every possible RS architecture.

Yes, this is a drawback.

> > > Tell me, which scheduler should I take? None of the existing ones
> > > gives me good enough results currently with persistency. We have
> > > to accept the fact, that 3-Tier application programmers don't
> > > know about loadbalancing or clustering, mostly using Java and this
> > > is just about the end of trying to load balance the application
> > > smoothly.
> >
> > WRR + load informed cluster software. But I'm not sure in
> > the fact the persistency can do very bad things. May be yes, but
> > for small number of clients or when wlc is used (we don't talk for
> > the other dumb schedulers).
>
> I currently get some values via an daemon coded in perl on the RS,
> started via xinetd. The LB connects to the healthcheck port and
> gets some prepared results. He then puts this stuff into a db and
> starts calculating the next steps to reconfigure the LVS-cluster to
> smoothen the imbalance. The longer you let it running the more data
> you get and the less adjustments you have to make. I reckon some
> guy showing up on this list once had this idea in direction of
> fuzzy logic. Hey Julian, maybe we should accept the fact that the
> wlc scheduler also isn't a very advanced one:
> loh = atomic_read(&least->activeconns)*50+atomic_read(&least->inactconns);
> What would you think would change if we made this 50 dynamic?

Not sure :) I don't have results from experiments with wlc :)
You can put it in /proc and to make different experiments, for example :)
But warning, ip_vs_wlc can be module, check how lblc* register /proc
vars.

> Later,
> Roberto Nibali, ratz


Regards

--
Julian Anastasov <ja@ssi.bg>