Mailing List Archive

Re: [quagga-users 491] Re: Hardware Spec for a route server
Paul Jakma wrote:

> look at the call graphs for zebra_client_read [3] and its children,
> zread_ipv4_{add,delete} ([6] and [12] respectively).

(a lot of profiling output I can't read)

What's that output? Any legend / manpage I can refer to? What and how is
it measuring a process' performance, wrt user time, kernel time and
overall time?


> the _delete path is similar, it hits rib_process via rib_delete_ipv4.
> majority of the time is spent and under rib_process. However,
> strangely, the cumulative time for adds and deletes appears to be
> equal - which is strange, because wall-clock time for the delete case
> is /much/ higher than for add if memory serves me right. hmm...

Doesn't seem too strange to me: since we're not deleting any addresses /
connected routes, rib_update is not involved, hence add/del operations
should be rather equivalent, don't you think? (also, see below)


> time spent in zread_* and stream_* are pretty low taking about 0.10
> to 0.15s in total for each path. and anyway, /adding/ the feed is
> pretty quick - and its roughly the same amount of data that has to be
> read and marshaled for the delete case.

Yes, but you should also consider the context switch for read operation,
the kernel overhead (buffer copying, socket management, etc), and so on;
and all that, just because we don't use a "killall" message instead...


> after that the main time spent is in the route table, especially
> prefix_match/check_bit. and i dont see a way to speed those up
> (they're pretty fast already, and i cant think of a better way to
> index prefixes that preserves parent/child relationships that wouldnt
> require occasional expensive rebalancing - but i know nothing :) ).

Well, bear in mind that using a bulk deletion is much faster in that
sense: you don't do prefix_match per each route (something like O(logn)
each, total of O(n*logn)) but rather scan the whole tree once (i.e. O(n)
total); so there's some benefit by that manner as well.

>>several improvements come to mind, like (1) extending the zebra
>>protocol in order to support "route tagging", that is allow
>>protocol daemons to mark each route with some tag (for example, the
>>peer-id for bgpd instances, neighbor-id for ospfd, etc) and provide
>>a "delete routes with tag X" message;
>
>
> interesting.. so a client could express with one message what
> otherwise might take 120k? yes, could be very useful.

Yes, should be very easy to do in zebra, less easy in protocol daemons
(requires a specialized event for these case, e.g. peer drop in bgpd,
neighbor timeout in ospfd, etc).


>>or (2) add classification capabilities to "route delete" messages,
>>thus allowing for bulk removal of all routes that match a
>>combination of destination prefix, nexthop, distance, and so on.
>>Note that all bulk deletions should be restricted to the routes of
>>the very client (protocol type + client-id) that issued them, and
>>so we're safe with their impact.
>
>
> yep. why 'or' though? really you want 1 and 2, no? ie 1 to allow a
> client to specify a class (via the tag) and 2 being the most natural
> operation that would operate on a tag/class. (_LOOKUP could use it
> too i guess).

The 'or' is there because section 2 is far too general, less feasible
nor required, IMO... (mainly I put it in order to better support my
point ;-> but really, I don't think it's worth designing and
implementing, not at the moment)


> sounds like a good idea anyway, less work for the client anyway, if
> not for zebra. (it could maintain lists of tags too - to save on
> full route table lookups).

Good point (about the tag lists).

...
> yes, and i wonder, is it simply the kernel which is less efficient at
> doing deletes (see above about how deletion within zebra /seemingly/
> taking as much time as adds - could be profiling problems though.)

As I said, the profiling conclusions make sense to me (that is, in and
out are the same), so you may have a point here: I guess it could be
easily tested with a simple script that adds some 10-20K static routes
via iproute2, then delete them, with proper timestamps. What do you say?

...
> Hmm.. for breaking up the sending of 100k deletions, current patch
> already does that - it allows bgpd to quickly transfer the 120k
> deletes to zebra without getting bogged down by zebra (cause zebra
> queues up the work rather than doing it synchronously - unix stream
> sockets block writer if socket is full, dont they?). Ie the deletes
> go to a pending rib_process work queue and are dealt with at zebra's
> leisure rather than being processed there and then (as zebra does
> now).

Okay, so that's not a problem (bgpd doesn't drop peers anymore? very nice!)


> To deal with acks, hmmm... i guess you could create a work queue for
> sending messages to the kernel, then have the worker function (which
> would send the message to the kernel) save the queue item opaque data
> (ie the details) to a list along with the netlink sequence number
> (bonus points if its a sparse array indexed by the sequence number :)
> ) - then when receiving acks, you can reconcile against that list.
> occassionally scan the list for items which are 'old' and if you find
> any just add them back to the work queue. is that roughly what you
> mean?
>
> but what about dependencies? eg say you:
>
> add prefix y via x, sequence 1.
> delete prefix y, seq. 2
> add prefix y via z, seq 3.
> <ack arrives for 1>
> reconcile 1
> <ack 3>
> reconcile 3.
> .
> .
> .
> scan pending-acks, find seq 2 unacked and due for retransmit.
> delete prefix y, seq. 4.
>
> prefix y is gone.

Good example, although this isn't the real problem, IMO: the vanishing
of prefix y could be prevented rather easily if we used explicit
pointers to RIB elements as the callback info -- when the handler
receives an ack/nack, all it needs to do is raise the 'fib' flag of the
corresponding nexthop elements. Nonetheless, it is probably required
that any further modifications to that route node's routes should be
delayed until that ack arrives, and how this is to be implemented is a
good question -- locally, as a per-node waiting list (probably most
efficient, in terms of utilizing kernel delays), or globally, as a
RIB-wide waiting list (simpler, yet inefficient)? In other words, there
needs to be some locking and queueing for non-acked routes.


> is that possible? does zebra specify the 'via' and hence avoid this
> problem? if not, how do you the dependency?

See above (hope this is what you were asking).


> queues: from the zebra rib POV - by the time zebra gets to the point
> where it sends a delete or add message to the kernel, all rib
> processing is finished, isnt it? (ie there is further no rib
> processing to be done /after/ we get ack?)

No, just process the next route update... ;-> (maybe I misunderstood
the question?)

...
> interesting thoughts :)

As mentioned, Monday morning after a long weekend.

Gilad
Re: [quagga-users 491] Re: Hardware Spec for a route server [ In reply to ]
On Mon, 29 Sep 2003, Gilad Arnold wrote:

> (a lot of profiling output I can't read)
>
> What's that output?

gprof.

> Any legend / manpage I can refer to?

hmm.. best thing to consult is the gprof info page -> pinfo gprof.

> What and how is it measuring a process' performance, wrt user time,
> kernel time and overall time?

its solely user time, afaik. so it doesnt account for kernel or wall
clock time.

> Doesn't seem too strange to me: since we're not deleting any
> addresses / connected routes, rib_update is not involved, hence
> add/del operations should be rather equivalent, don't you think?

guess so :)

> Yes, but you should also consider the context switch for read
> operation, the kernel overhead (buffer copying, socket management,
> etc),

yes, but it should be fairly efficient though - unix sockets are
/very/ efficient. (eg the X Window System protocol).

> and so on; and all that, just because we don't use a "killall"
> message instead...

yeah, that would be useful.

> Well, bear in mind that using a bulk deletion is much faster in
> that sense: you don't do prefix_match per each route (something
> like O(logn) each, total of O(n*logn)) but rather scan the whole
> tree once (i.e. O(n) total); so there's some benefit by that
> manner as well.

yes. that would save a good bit of work in itself.

> Yes, should be very easy to do in zebra, less easy in protocol
> daemons (requires a specialized event for these case, e.g. peer
> drop in bgpd, neighbor timeout in ospfd, etc).

those are the most stressful events. in normal operation, bgpd and
zebra dont use /too/ much cpu. zebra can hover around 5 to 10% due
to the steadyish update/withdrawal traffic found in DFZ. bgpd 1 to
4ish %. peer events are what cause problems, which could /really/ be
of use.

> The 'or' is there because section 2 is far too general, less
> feasible nor required, IMO... (mainly I put it in order to better
> support my point ;-> but really, I don't think it's worth designing
> and implementing, not at the moment)

k.

> As I said, the profiling conclusions make sense to me (that is, in
> and out are the same), so you may have a point here: I guess it
> could be easily tested with a simple script that adds some 10-20K
> static routes via iproute2, then delete them, with proper
> timestamps. What do you say?

i'll give it a go.

> Okay, so that's not a problem (bgpd doesn't drop peers anymore?
> very nice!)

I didnt say that :)

My bgpd patch is still WIP - but i'd like some feedback on the
workqueue patch and some testing with the zebra rib_process -> work
queue patch. Some very rough testing suggests it works (at least it
survived a day of running with a bgpd client that had a live feed).
But i'd like to make sure the wq side of things is good - but no one
has tested it.

but that patch provides the framework by which we can eliminate
bgpd's peer-drop problem, yes.

another nasty bgpd problem (i /suspect/): type 'show ip bgp' so that
you have the beginnings of your large bgp table displayed to you in a
pager. now leave it that way. I /suspect/ this blocks bgpd.

> > add prefix y via x, sequence 1.
> > delete prefix y, seq. 2
> > add prefix y via z, seq 3.
> > <ack arrives for 1>
> > reconcile 1
> > <ack 3>
> > reconcile 3.
> > .
> > .
> > .
> > scan pending-acks, find seq 2 unacked and due for retransmit.
> > delete prefix y, seq. 4.
> >
> > prefix y is gone.
>

> Good example, although this isn't the real problem, IMO: the
> vanishing of prefix y could be prevented rather easily if we used
> explicit pointers to RIB elements as the callback info -- when the
> handler receives an ack/nack, all it needs to do is raise the 'fib'
> flag of the corresponding nexthop elements.

i dont follow..

> Nonetheless, it is
> probably required that any further modifications to that route
> node's routes should be delayed until that ack arrives,

probably.

> and how this is to be implemented is a good question -- locally, as
> a per-node waiting list (probably most efficient, in terms of
> utilizing kernel delays), or globally, as a RIB-wide waiting list
> (simpler, yet inefficient)? In other words, there needs to be some
> locking and queueing for non-acked routes.

well, the work queue supports actions. so the 'work function' (the
function supplied by whoever setup the queue to process each item in
the queue) can return REQUEUE to tell the work queue layer to simply
requeue this item at the back of the queue. so the work function
(which processes each 'queue item') would first look at the node, if
ack pending -> return REQUEUE and the work queue code takes care of
requeueing. Alternatively, it can return ERROR and the work queue
code can run the error handler which was specified for the queue.

> No, just process the next route update... ;-> (maybe I misunderstood
> the question?)

that was the question :)

> As mentioned, Monday morning after a long weekend.

hehe.

> Gilad

regards,
--
Paul Jakma paul@clubi.ie paul@jakma.org Key ID: 64A2FF6A
warning: do not ever send email to spam@dishone.st
Fortune:
Nothing is rich but the inexhaustible wealth of nature.
She shows us only surfaces, but she is a million fathoms deep.
-- Ralph Waldo Emerson
Re: [quagga-users 491] Re: Hardware Spec for a route server [ In reply to ]
On Mon, 29 Sep 2003, Gilad Arnold wrote:

> As I said, the profiling conclusions make sense to me (that is, in and
> out are the same), so you may have a point here: I guess it could be
> easily tested with a simple script that adds some 10-20K static routes
> via iproute2, then delete them, with proper timestamps. What do you say?

just wondering, netlink can have multipart messages. does zebra make
use of this? or does it send one rtnetlink message per netlink
datagramme? and if not could we (hint hint gilad) make of use of
multipart messages? (and does that imply queuing and coalescing
support? ie rib_process queue -> message queue -> coalesce)

regards,
--
Paul Jakma paul@clubi.ie paul@jakma.org Key ID: 64A2FF6A
warning: do not ever send email to spam@dishone.st
Fortune:
The reward for working hard is more hard work.
Re: [quagga-users 491] Re: Hardware Spec for a route server [ In reply to ]
I'm only relating to one item in your reply...


Paul Jakma wrote:

> another nasty bgpd problem (i /suspect/): type 'show ip bgp' so that
> you have the beginnings of your large bgp table displayed to you in a
> pager. now leave it that way. I /suspect/ this blocks bgpd.

I don't believe it blocks bgpd: what I've experienced in zebra itself
('show ip route'), and what I see in bgpd/bgp_route.c as well, is that
when the lines limit is reached, the current route node is recorded to
the vty's output_rn member (together with some other parameters, like
the callback function, etc) and the process is released until further
input is received from the user, be it 'continues' or 'break'. The only
lock I can think of is the refcount of the current route node, however
this shouldn't block anything, at least not as long as the route table
is intact, of course (can that happen?! what happens upon 'no router
bgp'? worth checking!... ;-> )

Gilad
Re: [quagga-users 491] Re: Hardware Spec for a route server [ In reply to ]
Paul Jakma wrote:

> just wondering, netlink can have multipart messages. does zebra make
> use of this? or does it send one rtnetlink message per netlink
> datagramme? and if not could we (hint hint gilad) make of use of
> multipart messages? (and does that imply queuing and coalescing
> support? ie rib_process queue -> message queue -> coalesce)

No it doesn't use it right now, for good reasons: work is generally
synchronous in zebra, wrt to kernel management, namely zebra expects
that a returned kernel call has accomplished its action (that's the main
reason for using blocking read calls to get netlink acks, an otherwise
intolerable practice in zebra). I assume that, in the presence of kernel
work queues (somewhere around Q2-04... ;-> ), this could be implemented,
but then again -- only for netlink interface (is it worthwhile? not sure)

Gilad
Re: [quagga-users 491] Re: Hardware Spec for a route server [ In reply to ]
On Mon, 29 Sep 2003, Gilad Arnold wrote:

> No it doesn't use it right now, for good reasons: work is generally
> synchronous in zebra,

yes, but thats bad :)

> wrt to kernel management, namely zebra expects that a returned
> kernel call has accomplished its action

yep.

> (that's the main reason for using blocking read calls to get
> netlink acks, an otherwise intolerable practice in zebra).

indeed - but what does zebra do currently if it gets an error back
from the kernel, or if it gets nothing? (hmmm.. must get something).

> I assume that, in the presence of kernel work queues (somewhere
> around Q2-04... ;-> ),

2.6?

> this could be implemented, but then again -- only for netlink
> interface (is it worthwhile? not sure)

if nothing else it would save a wee bit on fewer context-switches for
netlink comms. and if kernel is smart, it could save it work too. bit
i dont know either.

> Gilad

regards,
--
Paul Jakma paul@clubi.ie paul@jakma.org Key ID: 64A2FF6A
warning: do not ever send email to spam@dishone.st
Fortune:
A statesman is a politician who's been dead 10 or 15 years.
-- Harry S. Truman
Re: [quagga-users 491] Re: Hardware Spec for a route server [ In reply to ]
On Mon, 29 Sep 2003, Gilad Arnold wrote:

> I don't believe it blocks bgpd:

yeah, i didnt think it could either, but i had an unexplained 4-day
outage (friday to monday - we hadnt setup monitoring for that bgpd
and didnt notice) on a bgpd, and the only thing i can pin it on is
having left a 'show ip bgp' sitting in the pager. when i hit q on the
pager on tuesday morning it apparently went back to normal and
immediately reset the sesssion and all was well.

I dont know though if it was a problem with all peers or just the ISP
who got in touch. That bgpd had one other upstream peer who are
really good about monitoring and usually raise all sorts of alarms
within ten minutes max of there being a problem, but they hadnt.

maybe it was coincidence.. i dont know.

> what I've experienced in zebra itself ('show ip route'), and what I
> see in bgpd/bgp_route.c as well, is that when the lines limit is
> reached, the current route node is recorded to the vty's output_rn
> member (together with some other parameters, like the callback
> function, etc) and the process is released until further input is
> received from the user, be it 'continues' or 'break'.

yep. i dont know how it would block bgpd.

> The only lock I can think of is the refcount of the current route
> node, however this shouldn't block anything, at least not as long
> as the route table is intact, of course (can that happen?! what
> happens upon 'no router bgp'? worth checking!... ;-> )

how do you mean? while vty output is suspended in pager?

anyway, i dont know how, and maybe it was pure coincidence, perhaps
some other problem altogether, i dont know - but its something on my
'must investigate' list anyway.

> Gilad

regards,
--
Paul Jakma paul@clubi.ie paul@jakma.org Key ID: 64A2FF6A
warning: do not ever send email to spam@dishone.st
Fortune:
try again