Mailing List Archive

question about performance
Wikipedia is often extremely slow.
What's the bottleneck ?
* network i/o ?
* database performance ?
* wikipedia script performance ?
* something else ?
Re: question about performance [ In reply to ]
Tomasz Wegrzanowski wrote:

>Wikipedia is often extremely slow.
>What's the bottleneck ?
>* network i/o ?
>* database performance ?
>* wikipedia script performance ?
>* something else ?
>
>
>
No-one really knows. Another possibility would be
* dodgy hardware
Perhaps a disk, or a network switch, is intermittently hanging.

One interesting observation is that the even when the English-language
Wikipedia is jammed up,
the international ones, which I believe run on the same server, are
often working OK. This suggests a software, not a hardware, problem.

The new software is currently under test, and it's being filled up with
test articles, and exercised using test scripts.

Network I/O is not a big bottleneck: the new server has been clocked
doing 2.8 hits/sec sustained, and that was limited by the 512k bandwidth
of the ADSL link to the testing machine. Further tests are being carried
out, and I hope to have some results soon.

Assuming Bomis has 10 Mbit/s bandwidth available, such a link could
support 20 x 2.8 = 56 hits/sec, or 4.8 million hits/day assuming evenly
distributed load.

Database performance is likely to be contstrained by two things:
* locking
* disk I/O

Locking is a problem because it serializes accesses, reducing
opportunities for parallel processing, and creating bottlenecks on the
locked resources.
Locking can be made better by:
* locking for as short a time as possible
* locking with the finest grain possible
* using a database which supports concurrent transactions with reduced
locking

Disk I/O can be made faster by
* using disks which spin fast (rotational latency is reduced), and
* putting them in a big RAID with lots of spindles and a high-speed
attachment
* using an operating system which multi-threads I/O properly

Wikipedia script performance is unlikely to be the bottleneck. We now
have the opportunity to
load the test system heavily and measure CPU load, to be able to
estimate this factor accurately.

Something else could be:
* Memory hogging

This is a little-known nasty factor in server programming. Here, the
problem is worker threads being tied up by slow or malfunctioning
clients, such as those on modems, or with high packet loss, or both.
Say a worker thread consumes W Mbytes of store, and an access transfers
50k bytes (400 kbits) of data.
Then a really slow link at say 20 kbps will take 20 seconds to download
this page. In doing so, it locks W Mbytes in store for that entire time.
If we have X megabytes of store, and slow clients are the dominating
factor, then we can only accomodate X/W concurrent workers, serving
(1/20)*X/W pages per second.

For X = 256, W=2, that's 6.4 hits per second. Therefore, a server needs
to have lots of RAM to prevent slow clients from blocking it. Hmm...
increasing the OS socket buffer size to > 50k might be a win here.
Fortunately, the new server has lots of RAM.

* Swapping
Once you are doing VM swapping on a webserver or database, performance
plummets. Memory leaks somewhere could be bloating processes, causing
the server to swap.

* Congestion collapse
Whatever the mechanism for going slow, once a system is overloaded it
can enter a state known as 'congestion collapse', where it's slow
because it's overloaded, and overloaded because it's slow. This is made
worse if users keep on retrying their requests. A system may take some
time to recover from congestion collapse, and after recovery it will be
back to normal as if the collapse had never happened. This resembles the
current state of affairs.

* Cracking
Someone may have cracked the server, and be using it for malicious
purposes. Yow! I've just checked, and wikipedia.com is still running
Apache/1.3.23 (Unix) PHP/4.0.6 mod_fastcgi/2.2.12, which may well be
vulnerable to the chunked-transfer-encoding bug unless it's been patched.

Beta.wikipedia.com is running Apache/1.3.26 (Unix) PHP/4.2.1, so it's
not vulnerable to this bug.

* Denial-of-service attacks
SYN floods, that sort of thing.

Well, that's a list of the first few things off the top of my head.

Neil
Re: question about performance [ In reply to ]
Hi all,

after watching various wikipedia mailing lists silently for a while, it
might be time for my first two cents.

Neil Harris wrote:
> One interesting observation is that the even when the English-language
Wikipedia is jammed up,
> the international ones, which I believe run on the same server, are often
working OK

I have noticed that, too, but also the other way round, that e.g. the German
Wikipedia is jamming while the English version is still working. Since I
have often been looking at pages from both the German version (de), the
German test version (test-de) and the English version, I think I have seen
all possible combinations of 3 working or jamming wikis. So I don't think
the English version is something special regarding this problem. But usually
when one is on strike, the others are following soon within a few minutes.
Maybe they're all in the same labor union ... ;-)

Sven (Ben-Zin)

Btw ... has anyone noticed yet that the automatically added address
Wikitech-l@ross.bomis.com at the bottom doesn't seem to work? Mailserver
told me something about "User unknown". Maybe someone can update this
address.
Re: question about performance [ In reply to ]
On Thu, Jul 11, 2002 at 05:21:40PM +0100, Neil Harris wrote:
> One interesting observation is that the even when the English-language
> Wikipedia is jammed up,
> the international ones, which I believe run on the same server, are
> often working OK. This suggests a software, not a hardware, problem.

I've seen both English-down-others-working, and all-down.

> Database performance is likely to be contstrained by two things:
> * locking
> * disk I/O
>
> Locking is a problem because it serializes accesses, reducing
> opportunities for parallel processing, and creating bottlenecks on the
> locked resources.
> Locking can be made better by:
> * locking for as short a time as possible
> * locking with the finest grain possible
> * using a database which supports concurrent transactions with reduced
> locking

What about switching to Postgres ?
It is said to have better locking.

> Disk I/O can be made faster by
> * using disks which spin fast (rotational latency is reduced), and
> * putting them in a big RAID with lots of spindles and a high-speed
> attachment
> * using an operating system which multi-threads I/O properly
>
> Wikipedia script performance is unlikely to be the bottleneck. We now
> have the opportunity to
> load the test system heavily and measure CPU load, to be able to
> estimate this factor accurately.

Even if is far from consuming 100% CPU, if it's slow, it occupies memory
for longer time. Or it may simply be using too much memory per thread.

> Something else could be:
> * Memory hogging
>
> This is a little-known nasty factor in server programming. Here, the
> problem is worker threads being tied up by slow or malfunctioning
> clients, such as those on modems, or with high packet loss, or both.
> Say a worker thread consumes W Mbytes of store, and an access transfers
> 50k bytes (400 kbits) of data.
> Then a really slow link at say 20 kbps will take 20 seconds to download
> this page. In doing so, it locks W Mbytes in store for that entire time.
> If we have X megabytes of store, and slow clients are the dominating
> factor, then we can only accomodate X/W concurrent workers, serving
> (1/20)*X/W pages per second.
>
> For X = 256, W=2, that's 6.4 hits per second. Therefore, a server needs
> to have lots of RAM to prevent slow clients from blocking it. Hmm...
> increasing the OS socket buffer size to > 50k might be a win here.
> Fortunately, the new server has lots of RAM.

2 megabytes of non-shared memory per thread ?
That would be enormous.

What's the real value like ?

Also if the thread is up it may be unnecessarily holding database
connection. But it's not likely to be major problem.

> * Swapping
> Once you are doing VM swapping on a webserver or database, performance
> plummets. Memory leaks somewhere could be bloating processes, causing
> the server to swap.

Swaping isn't a problem, it's a symptom.

Heavy apache or mysql bloat is unlikely and Wikipedia threads live
too short to have chance of bloating too much.
Re: question about performance [ In reply to ]
Tomasz Wegrzanowski wrote:

>On Thu, Jul 11, 2002 at 05:21:40PM +0100, Neil Harris wrote:
>
[snip]

>>
>>Locking is a problem because it serializes accesses, reducing
>>opportunities for parallel processing, and creating bottlenecks on the
>>locked resources.
>>Locking can be made better by:
>>* locking for as short a time as possible
>>* locking with the finest grain possible
>>* using a database which supports concurrent transactions with reduced
>>locking
>>
>>
>
>What about switching to Postgres ?
>It is said to have better locking.
>
>
Indeed so. I was going to get to that later. ;-)

>
>
>>Disk I/O can be made faster by
>>* using disks which spin fast (rotational latency is reduced), and
>>* putting them in a big RAID with lots of spindles and a high-speed
>>attachment
>>* using an operating system which multi-threads I/O properly
>>
>>Wikipedia script performance is unlikely to be the bottleneck. We now
>>have the opportunity to
>>load the test system heavily and measure CPU load, to be able to
>>estimate this factor accurately.
>>
>>
>
>Even if is far from consuming 100% CPU, if it's slow, it occupies memory
>for longer time. Or it may simply be using too much memory per thread.
>
>
>
>>Something else could be:
>>* Memory hogging
>>
>>This is a little-known nasty factor in server programming. Here, the
>>problem is worker threads being tied up by slow or malfunctioning
>>clients, such as those on modems, or with high packet loss, or both.
>>Say a worker thread consumes W Mbytes of store, and an access transfers
>>50k bytes (400 kbits) of data.
>>Then a really slow link at say 20 kbps will take 20 seconds to download
>>this page. In doing so, it locks W Mbytes in store for that entire time.
>>If we have X megabytes of store, and slow clients are the dominating
>>factor, then we can only accomodate X/W concurrent workers, serving
>>(1/20)*X/W pages per second.
>>
>>For X = 256, W=2, that's 6.4 hits per second. Therefore, a server needs
>>to have lots of RAM to prevent slow clients from blocking it. Hmm...
>>increasing the OS socket buffer size to > 50k might be a win here.
>>Fortunately, the new server has lots of RAM.
>>
>>
>
>2 megabytes of non-shared memory per thread ?
>That would be enormous.
>
>What's the real value like ?
>
>Also if the thread is up it may be unnecessarily holding database
>connection. But it's not likely to be major problem.
>
>
Think in terms of a C stack, and a complete set of kernel data
structures, sockets etc.
OK, maybe more like 1 Mbyte+

Of course, a lightweight thread takes significantly less resources.

>
>
>>* Swapping
>>Once you are doing VM swapping on a webserver or database, performance
>>plummets. Memory leaks somewhere could be bloating processes, causing
>>the server to swap.
>>
>>
>
>Swaping isn't a problem, it's a symptom.
>
>Heavy apache or mysql bloat is unlikely and Wikipedia threads live
>too short to have chance of bloating too much.
>
>
True. But something weird's got to be happening. Maybe the different
national 'pedias are too big to all fit in the working set at once.

>
>