Mailing List Archive: Any clues to source of this delay?

Any clues to source of this delay?

Aug 3, 1999, 1:38 AM

Post #1 of 10 (884 views)

Ok, as I've noted in some earlier posts, I have nasty tendency of
prototyping some protocols in Python for occassional implementation in more
'robust' 'commercially feasible' languages. I've ran at some oddity this
time, though:

Background:

The protocol involves encapsulating some data as 'packets' on TCP
connection somewhat like the Record layer of TLSv1. The simple protocol in
running top of TCP; this, in itself, is nothing particularly new or
fancy. The _problem_, however, seems to be.

Problem:

On Linux, single connections have 'glass ceiling' of 50 roundtrips/second;
no matter what I kludge, it seems to stay there. Only way to fix this was
to send 'empty' packets after processed ones, thus increasing throughput by
about 5X.

On NT, the glass ceiling is same, just 1/10 of Linux's (5
roundtrips/second).

I _assume_ this is some TCP/Python-related feature; no other traffic than
the one mentioned occurs during the test period. Additionally, for pure
TCP, much higher throughput in small packets can be achieved:

P/TCP:snd+rcv : 2110.1495/sec [1.39s] (473.90us/call)

However, with the protocol around it, the times suddenly die, literally:

Single echo(c:Up,exit:Normal) : 23.6462/sec [4.06s] (42.290ms/call)

And the CPU spent is about 2%, thus the delays are .. somewhere. Any
insights on where? The Linux performance is sufficient, but NT performance
most definitely is _NOT_.

For reference, here's where it blocks in the Linux case:

+---------+-------------------+-------+--------------------------------------+
|Function | Time spent|# Calls| Percent|
+---------+-------------------+-------+--------------------------------------+
|recv | 79ms and 223us| 510|* |
|send | 354ms| 512|** |
|select |9s, 564ms and 750us| 530|***************95.08%*************** |
+---------+-------------------+-------+--------------------------------------+
+-------+--------------------------------------------------------------------+
|Threads| Coverage|
+-------+--------------------------------------------------------------------+
|1 |***************************90.11%**************************** |
+-------+--------------------------------------------------------------------+

Therefore it seems as if it just waits for the pending data in the select
about 90% of the time. That does not, by my definition, seem to be 'good'.

That reminds me, is there a tool for blockage coverage? CPU-use coverage is
rarely as interesting as 'where code spends it's time', to me, at any rate.

-Markus Stenberg

--
Running Windows on a Pentium is like having a brand new Porsche but
only be able to drive backwards with the handbrake on.
(Unknown source)

Any clues to source of this delay? [ In reply to ]

gmcm at hypernet

Aug 3, 1999, 6:38 AM

Post #2 of 10 (870 views)

Markus Stenberg wrote:

> Ok, as I've noted in some earlier posts, I have nasty tendency of
> prototyping some protocols in Python for occassional implementation
> in more 'robust' 'commercially feasible' languages. I've ran at some
> oddity this time, though:
>
> Background:
>
> The protocol involves encapsulating some data as 'packets' on TCP
> connection somewhat like the Record layer of TLSv1. The simple
> protocol in running top of TCP; this, in itself, is nothing
> particularly new or fancy. The _problem_, however, seems to be.
>
> Problem:
>
> On Linux, single connections have 'glass ceiling' of 50
> roundtrips/second; no matter what I kludge, it seems to stay there.
> Only way to fix this was to send 'empty' packets after processed
> ones, thus increasing throughput by about 5X.
>
> On NT, the glass ceiling is same, just 1/10 of Linux's (5
> roundtrips/second).
>
> I _assume_ this is some TCP/Python-related feature; no other traffic
> than the one mentioned occurs during the test period.

I have also run into similar (though not identical) oddities. (In my
case I was stuck with the protocol, but had to get the servers
working better). I found that Linux (AMD K6, 200Mhz) would max out
at 90 "messages" / sec, and NT (P100) at 50 / sec.

As my first cut, I wrote pure Python client and server, and ignored
their protocol. I was maxing out the LAN connection from any box.

Then I used their protocol and their client (which can't send more
than 40 / sec on the Linux box) and ran into these limits. Then I
rewrote in C. The amount of CPU used dropped, but the speed did not
increase.

So, at least in my case, there's no blame on Python. I still don't
understand what it is about their protocol that causes this slowdown.
It appears that select just takes longer than it "should".

- Gordon

Any clues to source of this delay? [ In reply to ]

Aug 3, 1999, 10:23 AM

Post #3 of 10 (870 views)

In article <al8vhaxtg1y.fsf@sirppi.helsinki.fi>,
Markus Stenberg <mstenber@cc.Helsinki.FI> wrote:
>
>On Linux, single connections have 'glass ceiling' of 50 roundtrips/second;
>no matter what I kludge, it seems to stay there. Only way to fix this was
>to send 'empty' packets after processed ones, thus increasing throughput by
>about 5X.
>
>On NT, the glass ceiling is same, just 1/10 of Linux's (5
>roundtrips/second).

I can probably explain part of the NT problem: I bet you're using
Workstation instead of Server. Workstation has artificial limits.
--
--- Aahz (@netcom.com)

Androgynous poly kinky vanilla queer het <*> http://www.rahul.net/aahz/
Hugs and backrubs -- I break Rule 6 (if you want to know, do some research)

Any clues to source of this delay? [ In reply to ]

Aug 3, 1999, 1:54 PM

Post #4 of 10 (867 views)

Markus> Problem:

Markus> On Linux, single connections have 'glass ceiling' of 50
Markus> roundtrips/second; no matter what I kludge, it seems to stay
Markus> there. Only way to fix this was to send 'empty' packets after
Markus> processed ones, thus increasing throughput by about 5X.

Markus> On NT, the glass ceiling is same, just 1/10 of Linux's (5
Markus> roundtrips/second).

TCP's slow start algorithm, perhaps? If you can figure a way to reuse the
connection for multiple messages you may find that subsequent messages are
sent more quickly. A quick check at Google suggests that slow start details
might be found in RFC's 2001 and 2581.

Skip Montanaro | http://www.mojam.com/
skip@mojam.com | http://www.musi-cal.com/~skip/
847-475-3758

Any clues to source of this delay? [ In reply to ]

Aug 3, 1999, 8:38 PM

Post #5 of 10 (872 views)

Skip Montanaro <skip@mojam.com> writes:
> Markus> Problem:
>
> Markus> On Linux, single connections have 'glass ceiling' of 50
> Markus> roundtrips/second; no matter what I kludge, it seems to stay
> Markus> there. Only way to fix this was to send 'empty' packets after
> Markus> processed ones, thus increasing throughput by about 5X.
>
> Markus> On NT, the glass ceiling is same, just 1/10 of Linux's (5
> Markus> roundtrips/second).
> TCP's slow start algorithm, perhaps? If you can figure a way to reuse the
> connection for multiple messages you may find that subsequent messages are
> sent more quickly. A quick check at Google suggests that slow start details
> might be found in RFC's 2001 and 2581.

Oh, sorry, wasn't being explicit; it isn't slow start, as the ceiling is on
_single_ connection, however long it stays up. By opening multiple
connections, the server's total handling capacity goes up but single
connections' speed stays constant.

-Markus

>
> Skip Montanaro | http://www.mojam.com/
> skip@mojam.com | http://www.musi-cal.com/~skip/
> 847-475-3758
>

--
Markus Stenberg

Any clues to source of this delay? [ In reply to ]

<robin.boerdijk at nl

Aug 4, 1999, 10:45 AM

Post #6 of 10 (869 views)

> Oh, sorry, wasn't being explicit; it isn't slow start, as the ceiling is
on
> _single_ connection, however long it stays up. By opening multiple
> connections, the server's total handling capacity goes up but single
> connections' speed stays constant.

At the risk of being completely wrong: your problem might be caused by the
fact that TCP has a sliding window of 64Kb. This means that there can never
be more than 64Kb of outstanding, unacknowledged data on a single TCP
connection. When TCP reaches that limit, it will stop sending data until an
acknowledgement from the receiver makes the window 'slide' to the right. The
larger the size of the packets you sent, the sooner the windows fills up and
the less rountrips per second you will get.

Hoping this guess is not totally off,

Robin.

Any clues to source of this delay? [ In reply to ]

Aug 4, 1999, 9:12 PM

Post #7 of 10 (871 views)

"Robin Boerdijk" <robin.boerdijk@nl.origin-it.com> writes:
> > Oh, sorry, wasn't being explicit; it isn't slow start, as the ceiling is
> on
> > _single_ connection, however long it stays up. By opening multiple
> > connections, the server's total handling capacity goes up but single
> > connections' speed stays constant.
> At the risk of being completely wrong: your problem might be caused by the
> fact that TCP has a sliding window of 64Kb. This means that there can never
> be more than 64Kb of outstanding, unacknowledged data on a single TCP
> connection. When TCP reaches that limit, it will stop sending data until an
> acknowledgement from the receiver makes the window 'slide' to the right. The
> larger the size of the packets you sent, the sooner the windows fills up and
> the less rountrips per second you will get.

Data transferred was minimal.. _apparently_ it was mainly caused by delayed
ack, although, I doubt I'll ever know for sure (too lazy to reread RFC's;
delayed ack would fit the picture though, or possibly slow start without
window size growing).

> Hoping this guess is not totally off,
>
> Robin.

-Markus

--

Only in the silence voice, Only in the darkness light,
Only in dying life; The hawk's bright flight on the empty sky
-- Ursula Le Guin / (the) Earthsea Quartet

Any clues to source of this delay? [ In reply to ]

bruce_dodson at bigfoot

Aug 5, 1999, 7:37 PM

Post #8 of 10 (868 views)

This isn't unique to Python. It has to do with the way TCP buffers data
within a time window by default. The data is sent when some threshold size
is surpassed or when the time window expires (and I guess you saw that
sending zero bytes can force it to flush on some platforms). For many
applications this buffering gives a performance improvement, but in your
case (the kind of application that wants to do lots round-trips per second
with small amounts of data in each request), it kills performance.

To optimize a socket for this kind of conversation, you can use socket
option TCP_NODELAY, but this is not a portable solution; I think it's a bit
different on Windows than on Linux, and it is not available on all
platforms. I found no reference to TCP_NODELAY in the Python reference,
which is probably for the best given its non-portability. If you are
targetting a platform that supports it, you can probably do something like:

if (os.name == 'nt'):
TCP_NODELAY = 1
theSocket.setsockopt(IPPROTO_TCP, TCP_NODELAY, 1)

Depending on the nature of the conversation, one or both sides of your
connection may want to do that. Protecting the nonstandard NODELAY option
with a platform test means the code will still work on other platforms,
although it will still have a "glass ceiling" on platforms that you haven't
planned for.

Bruce

Any clues to source of this delay? [ In reply to ]

billtut at microsoft

Aug 5, 1999, 8:12 PM

Post #9 of 10 (876 views)

Windows does indeed support this socket option.
Here's what MSDN has to say about the option (in order to fill in all of the
icky details):

TCP_NODELAY
The TCP_NODELAY option is specific to TCP/IP service providers. The Nagle
algorithm is disabled if the TCP_NODELAY option is enabled (and vice versa).
The process involves buffering send data when there is unacknowledged data
already in flight or buffering send data until a full-size packet can be
sent. It is highly recommended that TCP/IP service providers enable the
Nagle Algorithm by default, and for the vast majority of application
protocols the Nagle Algorithm can deliver significant performance
enhancements. However, for some applications this algorithm can impede
performance, and TCP_NODELAY can be used to turn it off. These are
applications where many small messages are sent, and the time delays between
the messages are maintained. Application writers should not set TCP_NODELAY
unless the impact of doing so is well-understood and desired because setting
TCP_NODELAY can have a significant negative impact on network and
application performance.

Bill

> -----Original Message-----
> From: Bruce Dodson [mailto:bruce_dodson@bigfoot.com]
>
>
> This isn't unique to Python. It has to do with the way TCP
> buffers data
> within a time window by default. The data is sent when some
> threshold size
> is surpassed or when the time window expires (and I guess you saw that
> sending zero bytes can force it to flush on some platforms). For many
> applications this buffering gives a performance improvement,
> but in your
> case (the kind of application that wants to do lots
> round-trips per second
> with small amounts of data in each request), it kills performance.
>
> To optimize a socket for this kind of conversation, you can use socket
> option TCP_NODELAY, but this is not a portable solution; I
> think it's a bit
> different on Windows than on Linux, and it is not available on all
> platforms. I found no reference to TCP_NODELAY in the Python
> reference,
> which is probably for the best given its non-portability. If you are
> targetting a platform that supports it, you can probably do
> something like:
>
> if (os.name == 'nt'):
> TCP_NODELAY = 1
> theSocket.setsockopt(IPPROTO_TCP, TCP_NODELAY, 1)
>
> Depending on the nature of the conversation, one or both sides of your
> connection may want to do that. Protecting the nonstandard
> NODELAY option
> with a platform test means the code will still work on other
> platforms,
> although it will still have a "glass ceiling" on platforms
> that you haven't
> planned for.
>
> Bruce
>
>

Any clues to source of this delay? [ In reply to ]

bruce_dodson at bigfoot

Aug 6, 1999, 10:25 AM

Post #10 of 10 (873 views)

>Windows does indeed support this socket option.

I didn't say that it did not. Take a closer look at my example code:

>> if (os.name == 'nt') ...