Tomasz Wegrzanowski wrote:
>Wikipedia is often extremely slow.
>What's the bottleneck ?
>* network i/o ?
>* database performance ?
>* wikipedia script performance ?
>* something else ?
>
>
>
No-one really knows. Another possibility would be
* dodgy hardware
Perhaps a disk, or a network switch, is intermittently hanging.
One interesting observation is that the even when the English-language
Wikipedia is jammed up,
the international ones, which I believe run on the same server, are
often working OK. This suggests a software, not a hardware, problem.
The new software is currently under test, and it's being filled up with
test articles, and exercised using test scripts.
Network I/O is not a big bottleneck: the new server has been clocked
doing 2.8 hits/sec sustained, and that was limited by the 512k bandwidth
of the ADSL link to the testing machine. Further tests are being carried
out, and I hope to have some results soon.
Assuming Bomis has 10 Mbit/s bandwidth available, such a link could
support 20 x 2.8 = 56 hits/sec, or 4.8 million hits/day assuming evenly
distributed load.
Database performance is likely to be contstrained by two things:
* locking
* disk I/O
Locking is a problem because it serializes accesses, reducing
opportunities for parallel processing, and creating bottlenecks on the
locked resources.
Locking can be made better by:
* locking for as short a time as possible
* locking with the finest grain possible
* using a database which supports concurrent transactions with reduced
locking
Disk I/O can be made faster by
* using disks which spin fast (rotational latency is reduced), and
* putting them in a big RAID with lots of spindles and a high-speed
attachment
* using an operating system which multi-threads I/O properly
Wikipedia script performance is unlikely to be the bottleneck. We now
have the opportunity to
load the test system heavily and measure CPU load, to be able to
estimate this factor accurately.
Something else could be:
* Memory hogging
This is a little-known nasty factor in server programming. Here, the
problem is worker threads being tied up by slow or malfunctioning
clients, such as those on modems, or with high packet loss, or both.
Say a worker thread consumes W Mbytes of store, and an access transfers
50k bytes (400 kbits) of data.
Then a really slow link at say 20 kbps will take 20 seconds to download
this page. In doing so, it locks W Mbytes in store for that entire time.
If we have X megabytes of store, and slow clients are the dominating
factor, then we can only accomodate X/W concurrent workers, serving
(1/20)*X/W pages per second.
For X = 256, W=2, that's 6.4 hits per second. Therefore, a server needs
to have lots of RAM to prevent slow clients from blocking it. Hmm...
increasing the OS socket buffer size to > 50k might be a win here.
Fortunately, the new server has lots of RAM.
* Swapping
Once you are doing VM swapping on a webserver or database, performance
plummets. Memory leaks somewhere could be bloating processes, causing
the server to swap.
* Congestion collapse
Whatever the mechanism for going slow, once a system is overloaded it
can enter a state known as 'congestion collapse', where it's slow
because it's overloaded, and overloaded because it's slow. This is made
worse if users keep on retrying their requests. A system may take some
time to recover from congestion collapse, and after recovery it will be
back to normal as if the collapse had never happened. This resembles the
current state of affairs.
* Cracking
Someone may have cracked the server, and be using it for malicious
purposes. Yow! I've just checked, and wikipedia.com is still running
Apache/1.3.23 (Unix) PHP/4.0.6 mod_fastcgi/2.2.12, which may well be
vulnerable to the chunked-transfer-encoding bug unless it's been patched.
Beta.wikipedia.com is running Apache/1.3.26 (Unix) PHP/4.2.1, so it's
not vulnerable to this bug.
* Denial-of-service attacks
SYN floods, that sort of thing.
Well, that's a list of the first few things off the top of my head.
Neil