I have been thinking about the performance of Wikipedia, and how it
might be improved.
Before I go off and investigate in detail, I'd just like to check my
basic concept of how the code works,
(based on reading this list -- I haven't pulled down the CVS to look at
it yet).
=== Total guesswork follows ===
Am I right in thinking that, for each ordinary page request,
* the raw text is pulled out of the database
* the taxt is parsed and reformatted
* links are looked up to see if they are linked and treated appropriately
* final page generation to HTML, with page decorations as per theme is added
My general impressions about activity rate is:
* about 100 pages per day are created or deleted
* roughly one edit every 30 seconds
* roughly one page hit every second
Packet loss seems negligible, so you don't seem to be running out of
bandwidth.
Although I guesstimate the hit rate at around one-per-second, pages seem
to be taking around 5 seconds to serve,
suggesting that the system is probably running at a loadav of say 5 or so.
My best guess is that the parsing and lookups on regular pages are
currently the main load, not editing or exotic database queries -- is
this right?
Jimbo has mentioned that the machine has a lot of RAM, so disk I/O is
unlikely to be the bottleneck: it's more likely to be CPU and
inter-process locking problems.
If so, I think careful page content caching could greatly improve
performance, by reducing the number of page parsings, renderings and
lookups across the board, at the cost of a slight increase in the cost
of page deletion and creation. However, by freeing up resources,
performance should improve across the board on all operations.
If I'm right, I think suitably intelligent caching could be applied not
only to ordinary pages, but also to some special pages, without any
major redesign or excessive complexity.
Before I start to look at things in more detail, could anyone confirm
whether I am even vaguely making sense?
-- Neil
might be improved.
Before I go off and investigate in detail, I'd just like to check my
basic concept of how the code works,
(based on reading this list -- I haven't pulled down the CVS to look at
it yet).
=== Total guesswork follows ===
Am I right in thinking that, for each ordinary page request,
* the raw text is pulled out of the database
* the taxt is parsed and reformatted
* links are looked up to see if they are linked and treated appropriately
* final page generation to HTML, with page decorations as per theme is added
My general impressions about activity rate is:
* about 100 pages per day are created or deleted
* roughly one edit every 30 seconds
* roughly one page hit every second
Packet loss seems negligible, so you don't seem to be running out of
bandwidth.
Although I guesstimate the hit rate at around one-per-second, pages seem
to be taking around 5 seconds to serve,
suggesting that the system is probably running at a loadav of say 5 or so.
My best guess is that the parsing and lookups on regular pages are
currently the main load, not editing or exotic database queries -- is
this right?
Jimbo has mentioned that the machine has a lot of RAM, so disk I/O is
unlikely to be the bottleneck: it's more likely to be CPU and
inter-process locking problems.
If so, I think careful page content caching could greatly improve
performance, by reducing the number of page parsings, renderings and
lookups across the board, at the cost of a slight increase in the cost
of page deletion and creation. However, by freeing up resources,
performance should improve across the board on all operations.
If I'm right, I think suitably intelligent caching could be applied not
only to ordinary pages, but also to some special pages, without any
major redesign or excessive complexity.
Before I start to look at things in more detail, could anyone confirm
whether I am even vaguely making sense?
-- Neil