Paul Jakma wrote:
> look at the call graphs for zebra_client_read [3] and its children,
> zread_ipv4_{add,delete} ([6] and [12] respectively).
(a lot of profiling output I can't read)
What's that output? Any legend / manpage I can refer to? What and how is
it measuring a process' performance, wrt user time, kernel time and
overall time?
> the _delete path is similar, it hits rib_process via rib_delete_ipv4.
> majority of the time is spent and under rib_process. However,
> strangely, the cumulative time for adds and deletes appears to be
> equal - which is strange, because wall-clock time for the delete case
> is /much/ higher than for add if memory serves me right. hmm...
Doesn't seem too strange to me: since we're not deleting any addresses /
connected routes, rib_update is not involved, hence add/del operations
should be rather equivalent, don't you think? (also, see below)
> time spent in zread_* and stream_* are pretty low taking about 0.10
> to 0.15s in total for each path. and anyway, /adding/ the feed is
> pretty quick - and its roughly the same amount of data that has to be
> read and marshaled for the delete case.
Yes, but you should also consider the context switch for read operation,
the kernel overhead (buffer copying, socket management, etc), and so on;
and all that, just because we don't use a "killall" message instead...
> after that the main time spent is in the route table, especially
> prefix_match/check_bit. and i dont see a way to speed those up
> (they're pretty fast already, and i cant think of a better way to
> index prefixes that preserves parent/child relationships that wouldnt
> require occasional expensive rebalancing - but i know nothing :) ).
Well, bear in mind that using a bulk deletion is much faster in that
sense: you don't do prefix_match per each route (something like O(logn)
each, total of O(n*logn)) but rather scan the whole tree once (i.e. O(n)
total); so there's some benefit by that manner as well.
>>several improvements come to mind, like (1) extending the zebra
>>protocol in order to support "route tagging", that is allow
>>protocol daemons to mark each route with some tag (for example, the
>>peer-id for bgpd instances, neighbor-id for ospfd, etc) and provide
>>a "delete routes with tag X" message;
>
>
> interesting.. so a client could express with one message what
> otherwise might take 120k? yes, could be very useful.
Yes, should be very easy to do in zebra, less easy in protocol daemons
(requires a specialized event for these case, e.g. peer drop in bgpd,
neighbor timeout in ospfd, etc).
>>or (2) add classification capabilities to "route delete" messages,
>>thus allowing for bulk removal of all routes that match a
>>combination of destination prefix, nexthop, distance, and so on.
>>Note that all bulk deletions should be restricted to the routes of
>>the very client (protocol type + client-id) that issued them, and
>>so we're safe with their impact.
>
>
> yep. why 'or' though? really you want 1 and 2, no? ie 1 to allow a
> client to specify a class (via the tag) and 2 being the most natural
> operation that would operate on a tag/class. (_LOOKUP could use it
> too i guess).
The 'or' is there because section 2 is far too general, less feasible
nor required, IMO... (mainly I put it in order to better support my
point ;-> but really, I don't think it's worth designing and
implementing, not at the moment)
> sounds like a good idea anyway, less work for the client anyway, if
> not for zebra. (it could maintain lists of tags too - to save on
> full route table lookups).
Good point (about the tag lists).
...
> yes, and i wonder, is it simply the kernel which is less efficient at
> doing deletes (see above about how deletion within zebra /seemingly/
> taking as much time as adds - could be profiling problems though.)
As I said, the profiling conclusions make sense to me (that is, in and
out are the same), so you may have a point here: I guess it could be
easily tested with a simple script that adds some 10-20K static routes
via iproute2, then delete them, with proper timestamps. What do you say?
...
> Hmm.. for breaking up the sending of 100k deletions, current patch
> already does that - it allows bgpd to quickly transfer the 120k
> deletes to zebra without getting bogged down by zebra (cause zebra
> queues up the work rather than doing it synchronously - unix stream
> sockets block writer if socket is full, dont they?). Ie the deletes
> go to a pending rib_process work queue and are dealt with at zebra's
> leisure rather than being processed there and then (as zebra does
> now).
Okay, so that's not a problem (bgpd doesn't drop peers anymore? very nice!)
> To deal with acks, hmmm... i guess you could create a work queue for
> sending messages to the kernel, then have the worker function (which
> would send the message to the kernel) save the queue item opaque data
> (ie the details) to a list along with the netlink sequence number
> (bonus points if its a sparse array indexed by the sequence number :)
> ) - then when receiving acks, you can reconcile against that list.
> occassionally scan the list for items which are 'old' and if you find
> any just add them back to the work queue. is that roughly what you
> mean?
>
> but what about dependencies? eg say you:
>
> add prefix y via x, sequence 1.
> delete prefix y, seq. 2
> add prefix y via z, seq 3.
> <ack arrives for 1>
> reconcile 1
> <ack 3>
> reconcile 3.
> .
> .
> .
> scan pending-acks, find seq 2 unacked and due for retransmit.
> delete prefix y, seq. 4.
>
> prefix y is gone.
Good example, although this isn't the real problem, IMO: the vanishing
of prefix y could be prevented rather easily if we used explicit
pointers to RIB elements as the callback info -- when the handler
receives an ack/nack, all it needs to do is raise the 'fib' flag of the
corresponding nexthop elements. Nonetheless, it is probably required
that any further modifications to that route node's routes should be
delayed until that ack arrives, and how this is to be implemented is a
good question -- locally, as a per-node waiting list (probably most
efficient, in terms of utilizing kernel delays), or globally, as a
RIB-wide waiting list (simpler, yet inefficient)? In other words, there
needs to be some locking and queueing for non-acked routes.
> is that possible? does zebra specify the 'via' and hence avoid this
> problem? if not, how do you the dependency?
See above (hope this is what you were asking).
> queues: from the zebra rib POV - by the time zebra gets to the point
> where it sends a delete or add message to the kernel, all rib
> processing is finished, isnt it? (ie there is further no rib
> processing to be done /after/ we get ack?)
No, just process the next route update... ;-> (maybe I misunderstood
the question?)
...
> interesting thoughts :)
As mentioned, Monday morning after a long weekend.
Gilad
> look at the call graphs for zebra_client_read [3] and its children,
> zread_ipv4_{add,delete} ([6] and [12] respectively).
(a lot of profiling output I can't read)
What's that output? Any legend / manpage I can refer to? What and how is
it measuring a process' performance, wrt user time, kernel time and
overall time?
> the _delete path is similar, it hits rib_process via rib_delete_ipv4.
> majority of the time is spent and under rib_process. However,
> strangely, the cumulative time for adds and deletes appears to be
> equal - which is strange, because wall-clock time for the delete case
> is /much/ higher than for add if memory serves me right. hmm...
Doesn't seem too strange to me: since we're not deleting any addresses /
connected routes, rib_update is not involved, hence add/del operations
should be rather equivalent, don't you think? (also, see below)
> time spent in zread_* and stream_* are pretty low taking about 0.10
> to 0.15s in total for each path. and anyway, /adding/ the feed is
> pretty quick - and its roughly the same amount of data that has to be
> read and marshaled for the delete case.
Yes, but you should also consider the context switch for read operation,
the kernel overhead (buffer copying, socket management, etc), and so on;
and all that, just because we don't use a "killall" message instead...
> after that the main time spent is in the route table, especially
> prefix_match/check_bit. and i dont see a way to speed those up
> (they're pretty fast already, and i cant think of a better way to
> index prefixes that preserves parent/child relationships that wouldnt
> require occasional expensive rebalancing - but i know nothing :) ).
Well, bear in mind that using a bulk deletion is much faster in that
sense: you don't do prefix_match per each route (something like O(logn)
each, total of O(n*logn)) but rather scan the whole tree once (i.e. O(n)
total); so there's some benefit by that manner as well.
>>several improvements come to mind, like (1) extending the zebra
>>protocol in order to support "route tagging", that is allow
>>protocol daemons to mark each route with some tag (for example, the
>>peer-id for bgpd instances, neighbor-id for ospfd, etc) and provide
>>a "delete routes with tag X" message;
>
>
> interesting.. so a client could express with one message what
> otherwise might take 120k? yes, could be very useful.
Yes, should be very easy to do in zebra, less easy in protocol daemons
(requires a specialized event for these case, e.g. peer drop in bgpd,
neighbor timeout in ospfd, etc).
>>or (2) add classification capabilities to "route delete" messages,
>>thus allowing for bulk removal of all routes that match a
>>combination of destination prefix, nexthop, distance, and so on.
>>Note that all bulk deletions should be restricted to the routes of
>>the very client (protocol type + client-id) that issued them, and
>>so we're safe with their impact.
>
>
> yep. why 'or' though? really you want 1 and 2, no? ie 1 to allow a
> client to specify a class (via the tag) and 2 being the most natural
> operation that would operate on a tag/class. (_LOOKUP could use it
> too i guess).
The 'or' is there because section 2 is far too general, less feasible
nor required, IMO... (mainly I put it in order to better support my
point ;-> but really, I don't think it's worth designing and
implementing, not at the moment)
> sounds like a good idea anyway, less work for the client anyway, if
> not for zebra. (it could maintain lists of tags too - to save on
> full route table lookups).
Good point (about the tag lists).
...
> yes, and i wonder, is it simply the kernel which is less efficient at
> doing deletes (see above about how deletion within zebra /seemingly/
> taking as much time as adds - could be profiling problems though.)
As I said, the profiling conclusions make sense to me (that is, in and
out are the same), so you may have a point here: I guess it could be
easily tested with a simple script that adds some 10-20K static routes
via iproute2, then delete them, with proper timestamps. What do you say?
...
> Hmm.. for breaking up the sending of 100k deletions, current patch
> already does that - it allows bgpd to quickly transfer the 120k
> deletes to zebra without getting bogged down by zebra (cause zebra
> queues up the work rather than doing it synchronously - unix stream
> sockets block writer if socket is full, dont they?). Ie the deletes
> go to a pending rib_process work queue and are dealt with at zebra's
> leisure rather than being processed there and then (as zebra does
> now).
Okay, so that's not a problem (bgpd doesn't drop peers anymore? very nice!)
> To deal with acks, hmmm... i guess you could create a work queue for
> sending messages to the kernel, then have the worker function (which
> would send the message to the kernel) save the queue item opaque data
> (ie the details) to a list along with the netlink sequence number
> (bonus points if its a sparse array indexed by the sequence number :)
> ) - then when receiving acks, you can reconcile against that list.
> occassionally scan the list for items which are 'old' and if you find
> any just add them back to the work queue. is that roughly what you
> mean?
>
> but what about dependencies? eg say you:
>
> add prefix y via x, sequence 1.
> delete prefix y, seq. 2
> add prefix y via z, seq 3.
> <ack arrives for 1>
> reconcile 1
> <ack 3>
> reconcile 3.
> .
> .
> .
> scan pending-acks, find seq 2 unacked and due for retransmit.
> delete prefix y, seq. 4.
>
> prefix y is gone.
Good example, although this isn't the real problem, IMO: the vanishing
of prefix y could be prevented rather easily if we used explicit
pointers to RIB elements as the callback info -- when the handler
receives an ack/nack, all it needs to do is raise the 'fib' flag of the
corresponding nexthop elements. Nonetheless, it is probably required
that any further modifications to that route node's routes should be
delayed until that ack arrives, and how this is to be implemented is a
good question -- locally, as a per-node waiting list (probably most
efficient, in terms of utilizing kernel delays), or globally, as a
RIB-wide waiting list (simpler, yet inefficient)? In other words, there
needs to be some locking and queueing for non-acked routes.
> is that possible? does zebra specify the 'via' and hence avoid this
> problem? if not, how do you the dependency?
See above (hope this is what you were asking).
> queues: from the zebra rib POV - by the time zebra gets to the point
> where it sends a delete or add message to the kernel, all rib
> processing is finished, isnt it? (ie there is further no rib
> processing to be done /after/ we get ack?)
No, just process the next route update... ;-> (maybe I misunderstood
the question?)
...
> interesting thoughts :)
As mentioned, Monday morning after a long weekend.
Gilad