Mailing List Archive

RecentChanges Performance
Hi,

as promised, I have looked a bit into the RC code for the purpose of
possibly implementing a filter. Since Ram-Man seems to be finished with
his script, this is no longer an urgent issue, so we might want to take
a step back and look at the design of SpecialRecentchanges.php etc.
before any further changes.

I have taken the liberty to import the Wikipedia database locally to
find out which operations are fast and which ones are slow even locally.
Long RC queries are quite slow on my machine, in spite of cur_timestamp
being indexed. Even with an index, searching a 350 MB+ table with fairly
random row order may be heavy. (Is there any way to determine if the
index is used, BTW? MySQL has some weird conditions under which it
ignores indexes.) It certainly is slow on my system.

Since this is one of the most commonly accessed functions, we should
really do some performance tuning here, especially as we will sooner or
later have to do stuff like the aforementioned bot-filtering, making
queries even more complex. I noticed that there is already a
recentchanges table that new rows are inserted into upon edits. Is a
move away from SELECTs on the CUR table already in the works? If so,
what is its current status and who's doing it?

Regards,

Erik Moeller
--
FOKUS - Fraunhofer Insitute for Open Communication Systems
Project BerliOS - http://www.berlios.de
Re: RecentChanges Performance [ In reply to ]
Erik Moeller wrote:
> (Is there any way to determine if the index is used, BTW? MySQL has
> some weird conditions under which it ignores indexes.)

Stick "EXPLAIN " in front of a query; it comes back and tells you if
it's using indexes, how many rows it turns up, yada yada.

> I noticed that there is already a
> recentchanges table that new rows are inserted into upon edits. Is a
> move away from SELECTs on the CUR table already in the works? If so,
> what is its current status and who's doing it?

I had been under the impression we were already using that table, but,
well, we're not! I dashed off a quick change to use it (so a single
query to the recentchanges table instead of two separate ones to cur and
old), it seems workable. Committed to CVS, not yet installed on the live
server.

Unlike the current code, initial page creation events are still marked
as new even when there's a later revision. Yay!

Caveat; various things are _not_ currently registered in the
recentchanges table:
* Updates to deletion, upload logs
* Creation of image description pages on file upload
* Deleted pages remain listed as though they existed

Those should be fixed. Any other problems?

-- brion vibber (brion @ pobox.com)
Re: RecentChanges Performance [ In reply to ]
Brion VIBBER wrote:

> Caveat; various things are _not_ currently registered in the
> recentchanges table:
> * Updates to deletion, upload logs
> * Creation of image description pages on file upload
> * Deleted pages remain listed as though they existed
>
> Those should be fixed. Any other problems?

How about the move function?
Re: RecentChanges Performance [ In reply to ]
Am Mit, 2002-10-30 um 13.28 schrieb Brion VIBBER:

> Those should be fixed. Any other problems?

There should be an INDEX on the rc_timestamp column.

I think we should adopt a principle similar to KeptPages for RC logs,
that is, to maintain a consistent log for 14 days or so, no matter how
many entries happen in those days, instead of just saying "the last 5000
changes". So I guess we should just DELETE FROM the RecentChanges table
all entries that are older than n days regularly.

Regards,

Erik
--
FOKUS - Fraunhofer Insitute for Open Communication Systems
Project BerliOS - http://www.berlios.de
Re: Re: RecentChanges Performance [ In reply to ]
> I think we should adopt a principle similar to KeptPages for
> RC logs, that is, to maintain a consistent log for 14 days or
> so, no matter how many entries happen in those days, instead
> of just saying "the last 5000 changes". So I guess we should
> just DELETE FROM the RecentChanges table all entries that are
> older than n days regularly.

Either should be fine, but keeping it at a constant 5000 should
make it faster. I created the table originally in an attempt to
speed up the RC page, and also to generate the new formats, which
gathered multiple changes to the same page. Alas, no matter how
many methods I tried for that, none of them had acceptable
performance, so I stepped back to think about it for a while, and
never got around to actually using the table even for the present
RC format. It should be simple to do that; I'll test Magnus's
fix and install it if all is well.
Re: RecentChanges Performance [ In reply to ]
> Am Mit, 2002-10-30 um 13.28 schrieb Brion VIBBER:
>>Those should be fixed. Any other problems?

I tweaked GlobalFunctions.php, Article.php, and SpecialMovepage.php to
update the recentchanges database on page deletion, image upload, and
page rename. The RC behavior on renames over existing redirect pages is
slightly different, and may or may not be more user-friendly than the
current behavior.

My little old PII Linux box isn't bad-ass enough for me to dare putting
the huuuge English Wikipedia database on it, so I just did some quick
tests with the smaller Danish and German dbs. Using the recentchanges
table seems to be consistently a few percent faster than reading
straight from cur and old, but not significantly so. (This may just be
the difference in overhead between querying two tables versus one.)
Under load and with a larger database, there may or may not be a more
pronounced difference.

I haven't installed any of this on the live server; somebody please
check out the updates in CVS and look 'em over to make sure nothing else
is broken.

Erik Moeller wrote:
> There should be an INDEX on the rc_timestamp column.

Hmm... I'm not sure how to go about doing this. EXPLAIN already
indicates that it's using rc_timestamp as the key, and creating an index
vai CREATE INDEX or ALTER TABLE ADD INDEX doesn't seem to change
anything except to add more to the list of 'possible keys'.

However it's also teling me "Using filesort", which I get the impression
is a bad thing.

> I think we should adopt a principle similar to KeptPages for RC logs,
> that is, to maintain a consistent log for 14 days or so, no matter how
> many entries happen in those days, instead of just saying "the last 5000
> changes". So I guess we should just DELETE FROM the RecentChanges table
> all entries that are older than n days regularly.

At ~2000 edits a day, that's a lot of entries...

-- brion vibber (brion @ pobox.com)
Re: RecentChanges Performance [ In reply to ]
Am Fre, 2002-11-01 um 15.11 schrieb Brion VIBBER:

> My little old PII Linux box isn't bad-ass enough for me to dare putting
> the huuuge English Wikipedia database on it, so I just did some quick
> tests with the smaller Danish and German dbs. Using the recentchanges
> table seems to be consistently a few percent faster than reading
> straight from cur and old, but not significantly so. (This may just be
> the difference in overhead between querying two tables versus one.)
> Under load and with a larger database, there may or may not be a more
> pronounced difference.

I'll try this with the full DB on Monday, I can't do it immediately
because I have to build the RC table first, which will probably take a
while.

> Erik Moeller wrote:
> > There should be an INDEX on the rc_timestamp column.

This is not done in the default table build, but the index is created in
the recentchanges build script. Since I haven't run that yet, my table
still lacks the index. Sorry about the confusion.

> However it's also teling me "Using filesort", which I get the impression
> is a bad thing.

The MySQL manual has a page on this:
http://www.mysql.com/doc/en/ORDER_BY_optimisation.html
I don't see how any of the conditions they list for when the index
cannot be used to resolve ORDER BY is true in our query, though. In any
case, my guess is that this doesn't significantly affect performance as
we are dealing with fairly small sets of data that need to be sorted.

Regards,
Erik
--
FOKUS - Fraunhofer Insitute for Open Communication Systems
Project BerliOS - http://www.berlios.de