Mailing List Archive

Wikipedia performance and caching of responses -- some questions
I have been thinking about the performance of Wikipedia, and how it
might be improved.

Before I go off and investigate in detail, I'd just like to check my
basic concept of how the code works,
(based on reading this list -- I haven't pulled down the CVS to look at
it yet).

=== Total guesswork follows ===

Am I right in thinking that, for each ordinary page request,

* the raw text is pulled out of the database
* the taxt is parsed and reformatted
* links are looked up to see if they are linked and treated appropriately
* final page generation to HTML, with page decorations as per theme is added

My general impressions about activity rate is:

* about 100 pages per day are created or deleted
* roughly one edit every 30 seconds
* roughly one page hit every second

Packet loss seems negligible, so you don't seem to be running out of
bandwidth.

Although I guesstimate the hit rate at around one-per-second, pages seem
to be taking around 5 seconds to serve,
suggesting that the system is probably running at a loadav of say 5 or so.

My best guess is that the parsing and lookups on regular pages are
currently the main load, not editing or exotic database queries -- is
this right?

Jimbo has mentioned that the machine has a lot of RAM, so disk I/O is
unlikely to be the bottleneck: it's more likely to be CPU and
inter-process locking problems.

If so, I think careful page content caching could greatly improve
performance, by reducing the number of page parsings, renderings and
lookups across the board, at the cost of a slight increase in the cost
of page deletion and creation. However, by freeing up resources,
performance should improve across the board on all operations.

If I'm right, I think suitably intelligent caching could be applied not
only to ordinary pages, but also to some special pages, without any
major redesign or excessive complexity.

Before I start to look at things in more detail, could anyone confirm
whether I am even vaguely making sense?

-- Neil
Wikipedia performance and caching of responses -- some questions [ In reply to ]
I have been thinking about the performance of Wikipedia, and how it
might be improved.

Before I go off and investigate in detail, I'd just like to check my
basic concept of how the code works,
(based on reading this list -- I haven't pulled down the CVS to look at
it yet).

=== Total guesswork follows ===

Am I right in thinking that, for each ordinary page request,

* the raw text is pulled out of the database
* the taxt is parsed and reformatted
* links are looked up to see if they are linked and treated appropriately
* final page generation to HTML, with page decorations as per theme is added

My general impressions about activity rate is:

* about 100 pages per day are created or deleted
* roughly one edit every 30 seconds
* roughly one page hit every second

Packet loss seems negligible, so you don't seem to be running out of
bandwidth.

Although I guesstimate the hit rate at around one-per-second, pages seem
to be taking around 5 seconds to serve,
suggesting that the system is probably running at a loadav of say 5 or so.

My best guess is that the parsing and lookups on regular pages are
currently the main load, not editing or exotic database queries -- is
this right?

Jimbo has mentioned that the machine has a lot of RAM, so disk I/O is
unlikely to be the bottleneck: it's more likely to be CPU and
inter-process locking problems.

If so, I think careful page content caching could greatly improve
performance, by reducing the number of page parsings, renderings and
lookups across the board, at the cost of a slight increase in the cost
of page deletion and creation. However, by freeing up resources,
performance should improve across the board on all operations.

If I'm right, I think suitably intelligent caching could be applied not
only to ordinary pages, but also to some special pages, without any
major redesign or excessive complexity.

Before I start to look at things in more detail, could anyone confirm
whether I am even vaguely making sense?

-- Neil
Re: Wikipedia performance and caching of responses -- some questions [ In reply to ]
On mer, 2002-04-10 at 05:10, Neil Harris wrote:
> My best guess is that the parsing and lookups on regular pages are
> currently the main load, not editing or exotic database queries -- is
> this right?

Not a clue. Initially, the database certainly was the main load, but I
haven't heard any newer figures. Jimbo?

> Jimbo has mentioned that the machine has a lot of RAM, so disk I/O is
> unlikely to be the bottleneck: it's more likely to be CPU and
> inter-process locking problems.
>
> If so, I think careful page content caching could greatly improve
> performance, by reducing the number of page parsings, renderings and
> lookups across the board, at the cost of a slight increase in the cost
> of page deletion and creation. However, by freeing up resources,
> performance should improve across the board on all operations.

We used to cache rendered articles, but Jimbo disabled this feature some
time ago, claiming he was unable to find a performance advantage. (See
mailing list archives circa February 13.)

Personally, I've always find that idea suspicious; caching is definitely
faster on my test machine, and is going to be a particularly big help
with, say, long pages full of HTML tables! But then, my test machine has
a much much lower load to deal with than the real Wikipedia. :)
Nonetheless, if cacheing really isn't helping, that's because it's not
doing something right. It should be found, fixed, and reenabled.

(There were also side issues with the caching -- the meta keyword tags
and the interlanguage links didn't get filled out when viewing a cached
page. But again, these should be fixed, and aren't a reason for
disabling caching altogether.)

> If I'm right, I think suitably intelligent caching could be applied not
> only to ordinary pages, but also to some special pages, without any
> major redesign or excessive complexity.

For a brief time we cached the contents of RecentChanges when using the
default settings, and I believe the Orphans page was manually refreshed;
but these were removed after the queries were made more efficient.

-- brion vibber (brion @ pobox.com)
Re: Wikipedia performance and caching of responses -- some questions [ In reply to ]
Jim accidentally sent this just to me, I'm sending it back to the list:

On mer, 2002-04-10 at 18:27, Jimmy Wales wrote:
> Brion L. VIBBER wrote:
> > > My best guess is that the parsing and lookups on regular pages are
> > > currently the main load, not editing or exotic database queries -- is
> > > this right?
> >
> > Not a clue. Initially, the database certainly was the main load, but I
> > haven't heard any newer figures. Jimbo?
>
> I'll reset the slow-query log and make a new version available after a few
> hours of data collection.
>
> > We used to cache rendered articles, but Jimbo disabled this feature some
> > time ago, claiming he was unable to find a performance advantage. (See
> > mailing list archives circa February 13.)
>
> But, I'm willing to try it again.
>
> > Personally, I've always find that idea suspicious; caching is definitely
> > faster on my test machine, and is going to be a particularly big help
> > with, say, long pages full of HTML tables! But then, my test machine has
> > a much much lower load to deal with than the real Wikipedia. :)
> > Nonetheless, if cacheing really isn't helping, that's because it's not
> > doing something right. It should be found, fixed, and reenabled.
>
> I would say that I agree with that.
>
> Here's a question for everyone.
>
> Let's say we have some portion of the page pre-calculated and cached.
> Is it faster to keep that cached text *in the database*, or *on the
> hard drive*?
>
> I'm very strongly biased towards thinking that keeping it on the hard
> drive is faster, and by a significant margin, but only because I've
> never tested it and because I know (from long experience at Bomis) that
> opening up a text file on disk and spitting it out can be *really* fast,
> if the machine has enough ram such that the filesystem can cache lots of
> popular files in memory.
>
> But, everything I read about MySQL talks about how screamingly fast it
> allegedly is, so...
>
> --Jimbo
>
Re: Wikipedia performance and caching of responses -- some questions [ In reply to ]
On mer, 2002-04-10 at 18:27, Jimmy Wales wrote:
> Here's a question for everyone.
>
> Let's say we have some portion of the page pre-calculated and cached.
> Is it faster to keep that cached text *in the database*, or *on the
> hard drive*?
>
> I'm very strongly biased towards thinking that keeping it on the hard
> drive is faster, and by a significant margin, but only because I've
> never tested it and because I know (from long experience at Bomis) that
> opening up a text file on disk and spitting it out can be *really* fast,
> if the machine has enough ram such that the filesystem can cache lots of
> popular files in memory.

That's a good question, and one which I haven't made any attempt to
test. As it is, we'll already be digging into the database to check
things like the user settings, page view count, last edited date, and
language links and meta-tag keywords (these last two gleaned from the
list of links during parsing, and thus left out altogether when using
the existing cache code). So it's probably not significantly slower to
grab the stored HTMLized article while we're there.

On the other hand, some of this stuff (except for the page view count
and user settings) could also be stored in a cache file and plunked
ready-made into the output along with the HTML. User settings perhaps
could be stored in a session cookie, refreshed only when the user first
visits/logs in/changes preferences, saving a little extra on database
access as well.

Worth it? No idea. But, hey, it's a suggestion.

-- brion vibber (brion @ pobox.com)