Mailing List Archive

Optimizing and tuning
There are two very good IRC channels on irc.freenode.net:

#php and #mysql

There are lots of knowledgeable people there who are eager to help open
source projects, many have already heard of Wikipedia. I learned a lot
from just being there for half an hour or so.

Here are some things we need to look into:

1) Composite indexes and index use in general. I do not know which indexes
are currently used on the live Wikipedia. However, after the *default*
database generation script, there is just a single composite index, namely
in the table watchlist. All other indexes are on a single field. If I
understand correctly, this means that when we do a query with multiple
comparisons, e.g. WHERE foo=bar AND baz=bat, only one index is used. At
least, that is what this article claims:

http://www.linux-mag.com/cgi-bin/printer.pl?issue=2001-06&article=mysql

There are also a couple of tables with no indexes (including ARCHIVE,
which may cause Special:Undelete to create high server load) and some
unexplainable ones (site_stats has an index, but only one row). We really
need to clean up our indexes. I can't help much with this without server
access because I don't know if the table structures have been altered.

2) The HANDLER command. HANDLER, which works with both InnoDB and MyISAM,
provides low-level access to tables *without locking*. You get reads, but
you don't get consistency, which may cause problems when you're dealing
with critical, heavily modified data. But it seems perfect for our archive
stuff. The HANDLER syntax is explained here:
http://www.mysql.com/doc/en/HANDLER.html

We definitely should look into this.

3) Upgrading to MySQL4. MySQL4 supports query caching (also subqueries,
but I haven't looked into that), which means that heavily queried pages
will load faster. When someone mentioned query caching, a lot of other
people made comments like "query caching rocks", so I assume it provides
quite some performance benefits in practice.

4) Tuning my.cnf. According to one IRC resident, upping table_cache can
greatly increase performance with lots of UPDATEs. If we create new
indexes, we may also need to raise index_cache (there's a formula to
calculate your index cache efficiency, currently we are at 99.75%, which
is pretty good).

5) Caching our article index. I have mentioned this idea before, I think:
We could just keep our entire article index in memory *persistently*
(currently the linkcache is filled for each request), which should make
link lookup almost instant. There are several ways to accomplish this:

- We could ask one of our resident C programmers to help. There's a
specified interface to access persistent resources in MySQL, which is used
by functions such as mysql_pconnect. This page describes how it works:
http://www.php-center.de/en-html-manual/phpdevel.html

- We could use session variables for the purpose. There is supposed to be
an option to keep session stuff in shared memory. This might help with
other caching problems as well.

- We could just put a database on a ramdisk for this purpose.

The order of this list reflects the priority at which I think these
different questions need to be addressed. Getting our indexes to work
properly should IMHO be of greatest importance. Even if EXPLAIN shows the
use of an index, we may frequently still require large table scans because
we do not use composite indexes.

Regards,

Erik
Re: Optimizing and tuning [ In reply to ]
Correction:

> - We could ask one of our resident C programmers to help. There's a
> specified interface to access persistent resources in MySQL

Resources in PHP, not MySQL. The MySQL pconnect functions are an example
that uses this interface.

Regards,

Erik
Re: Optimizing and tuning [ In reply to ]
On ven, 2003-01-31 at 17:15, Erik Moeller wrote:
> 1) Composite indexes and index use in general.... If I
> understand correctly, this means that when we do a query with multiple
> comparisons, e.g. WHERE foo=bar AND baz=bat, only one index is used.

Might want to check into that, yes...

> There are also a couple of tables with no indexes (including ARCHIVE,
> which may cause Special:Undelete to create high server load) and some
> unexplainable ones (site_stats has an index, but only one row).

Archive is very rarely used. But, it probably should have an index stuck
on the namespace & title fields at least.

> We really
> need to clean up our indexes. I can't help much with this without server
> access because I don't know if the table structures have been altered.

They should all be the same...

> 2) The HANDLER command. HANDLER, which works with both InnoDB and MyISAM,
> provides low-level access to tables *without locking*. You get reads, but
> you don't get consistency, which may cause problems when you're dealing
> with critical, heavily modified data.

Now you're scaring me. :)

> 3) Upgrading to MySQL4. MySQL4 supports query caching (also subqueries,
> but I haven't looked into that), which means that heavily queried pages
> will load faster. When someone mentioned query caching, a lot of other
> people made comments like "query caching rocks", so I assume it provides
> quite some performance benefits in practice.

I'm planning to do that this weekend. (Weekend nights are relatively low
traffic. Relatively. ;)

> 4) Tuning my.cnf. According to one IRC resident, upping table_cache can
> greatly increase performance with lots of UPDATEs. If we create new
> indexes, we may also need to raise index_cache (there's a formula to
> calculate your index cache efficiency, currently we are at 99.75%, which
> is pretty good).

Attached are current my.cnf file and SHOW STATUS and SHOW VARIABLES
output. More tweaking suggestions are welcome.

> 5) Caching our article index. I have mentioned this idea before, I think:
> We could just keep our entire article index in memory *persistently*
> (currently the linkcache is filled for each request), which should make
> link lookup almost instant. There are several ways to accomplish this:

I'm still dubious about consistency, but if someone can figure it out,
great!

> - We could use session variables for the purpose. There is supposed to be
> an option to keep session stuff in shared memory. This might help with
> other caching problems as well.

It's been suggested that cookies/session vars be used for user prefs.

> - We could just put a database on a ramdisk for this purpose.

MySQL has a 'HEAP' table type which is in-memory; we wouldn't have to
set up a separate ramdisk. It's not persistent, so we'd either have to
pre-fill the whole thing or fall back to checking 'cur' on demand.

-- brion vibber (brion @ pobox.com)
Re: Optimizing and tuning [ In reply to ]
On Sat, Feb 01, 2003 at 02:15:00AM +0100, Erik Moeller wrote:
> There are two very good IRC channels on irc.freenode.net:
>
> #php and #mysql
>
> There are lots of knowledgeable people there who are eager to help open
> source projects, many have already heard of Wikipedia. I learned a lot
> from just being there for half an hour or so.
>
> Here are some things we need to look into:
>
> 1) Composite indexes and index use in general. I do not know which indexes
> are currently used on the live Wikipedia. However, after the *default*
> database generation script, there is just a single composite index, namely
> in the table watchlist. All other indexes are on a single field. If I
> understand correctly, this means that when we do a query with multiple
> comparisons, e.g. WHERE foo=bar AND baz=bat, only one index is used.

That is correct, and you could be right about this (I don't know what the
current database schema looks exactly like, but last time I was involved
this was still ok.) But keep in mind that a composite index also adds a lot
of overhead for updates, takes up more room in the index cache, and doesn't
really add much speed if the extra columns don't have a high selectivity,
i.e., filter out many unwanted records.

> 2) The HANDLER command. HANDLER, which works with both InnoDB and MyISAM,
> provides low-level access to tables *without locking*. You get reads, but
> you don't get consistency, which may cause problems when you're dealing
> with critical, heavily modified data. But it seems perfect for our archive
> stuff.

If there are not many updates then this accomplishes next to nothing, except
that you have made the software more complicated.

> 3) Upgrading to MySQL4.

Actually I was once in favour of that but Lee convinced me that as long as
the developers of MySQL don't think it is ready for production, we shouldn't
either. There is only one thing more important than speed and that is
stability.

> 5) Caching our article index. I have mentioned this idea before, I think:
> We could just keep our entire article index in memory *persistently*
> (currently the linkcache is filled for each request), which should make
> link lookup almost instant.

What makes you think it isn't already? I suspect the biggest part of the
involved tables is in the cache anyway, so it is probably pretty fast
already. So again, please make sure that this is really worth making the
software harder to understand. Keeping the software straightforward and
understandable and for newcomers that also want to help with programming is
a big priority for any open source project.

-- Jan Hidders

.---------------------------------------------------------------------.
| Post-doctoral researcher e-mail: jan.hidders@ua.ac.be |
| Dept. Math. & Computer Science tel: (+32) 3 820 2427 |
| University of Antwerp, room J1.06 fax: (+32) 3 820 2421 |
| (UIA) Universiteitsplein 1, B-2610 Wilrijk - Antwerpen (Belgium) |
`---------------------------------------------------------------------'
Re: Optimizing and tuning [ In reply to ]
> That is correct, and you could be right about this (I don't know what the
> current database schema looks exactly like, but last time I was involved
> this was still ok.) But keep in mind that a composite index also adds a lot
> of overhead for updates, takes up more room in the index cache, and doesn't
> really add much speed if the extra columns don't have a high selectivity,
> i.e., filter out many unwanted records.

True. However, I suspect with some of our queries, our existing single-
field index only has very low selectivity, which makes stuff like "Most
wanted" and "History" so slow. Not using any composite indexes seems like
a bad idea to me, especially given the fact that some of our queries are
run on huge random-sequence tables.

>> 2) The HANDLER command. HANDLER, which works with both InnoDB and MyISAM,
>> provides low-level access to tables *without locking*. You get reads, but
>> you don't get consistency, which may cause problems when you're dealing
>> with critical, heavily modified data. But it seems perfect for our archive
>> stuff.

> If there are not many updates then this accomplishes next to nothing, except
> that you have made the software more complicated.

HANDLER works almost exactly the same as SELECT from a syntax POV. We have
many updates on the OLD table, and it is still in MyISAM, meaning table-
level locking. So unless we upgrade that particular table to InnoDB (which
Brion seems reluctant to do because of the long downtime), HANDLER may be
worth giving a try.

However, your point regarding complexity is well taken. We could abstract
the HANDLER function in wfQuery using a "consistent read" parameter, by
default being true.

>> 3) Upgrading to MySQL4.

> Actually I was once in favour of that but Lee convinced me that as long as
> the developers of MySQL don't think it is ready for production, we shouldn't
> either. There is only one thing more important than speed and that is
> stability.

They are currently labeling it as a "gamma". I think it might be worth
giving a try, we can always downgrade if it creates problems.

>> 5) Caching our article index. I have mentioned this idea before, I think:
>> We could just keep our entire article index in memory *persistently*
>> (currently the linkcache is filled for each request), which should make
>> link lookup almost instant.

> What makes you think it isn't already?

I have done tests on the link lookup with a local database and it's very
slow with link-heavy pages. This is also what our profiling has shown.
Again, that may have something to do with our index use and we may want to
look into that. Note that I haven't yet benchmarked the new prefill
mechanism.

> I suspect the biggest part of the
> involved tables is in the cache anyway,

I don't trust automatic caching, especially since our CUR table is huge
and it is only a very small subset of it (the titles) which we need in
memory.

> Keeping the software straightforward and
> understandable and for newcomers that also want to help with programming is
> a big priority for any open source project.

The current link caching mechanism is far from easy to understand, it
might just be considerably less efficient than the alternatives.

Regards,

Erik
Re: Optimizing and tuning [ In reply to ]
On Sat, Feb 01, 2003 at 01:14:00PM +0100, Erik Moeller wrote:
> > That is correct, and you could be right about this (I don't know what
> > the current database schema looks exactly like, but last time I was
> > involved this was still ok.) But keep in mind that a composite index
> > also adds a lot of overhead for updates, takes up more room in the index
> > cache, and doesn't really add much speed if the extra columns don't have
> > a high selectivity, i.e., filter out many unwanted records.
>
> True. However, I suspect with some of our queries, our existing single-
> field index only has very low selectivity, which makes stuff like "Most
> wanted" and "History" so slow.

Adding composite indexes won't help there. What you need is an extra
redundant table that contains the number of wanted links.

> >> 2) The HANDLER command. HANDLER, which works with both InnoDB and MyISAM,
> >> provides low-level access to tables *without locking*. You get reads, but
> >> you don't get consistency, which may cause problems when you're dealing
> >> with critical, heavily modified data. But it seems perfect for our archive
> >> stuff.
>
> > If there are not many updates then this accomplishes next to nothing, except
> > that you have made the software more complicated.
>
> HANDLER works almost exactly the same as SELECT from a syntax POV.

That's only part of the problem. If anyone wants to mess with this they have
to understand what it does and why. That doesn't go away if you have similar
syntaxes.

> >> 3) Upgrading to MySQL4.
>
> > Actually I was once in favour of that but Lee convinced me that as long as
> > the developers of MySQL don't think it is ready for production, we shouldn't
> > either. There is only one thing more important than speed and that is
> > stability.
>
> They are currently labeling it as a "gamma". I think it might be worth
> giving a try, we can always downgrade if it creates problems.

Not if we start using MySQL4 stuff and the problems only start to appear
after a while. I agree that the risc is very small, but still.

> >> 5) Caching our article index. I have mentioned this idea before, I think:
> >> We could just keep our entire article index in memory *persistently*
> >> (currently the linkcache is filled for each request), which should make
> >> link lookup almost instant.
>
> > What makes you think it isn't already?
>
> I have done tests on the link lookup with a local database and it's very
> slow with link-heavy pages. This is also what our profiling has shown.

Could you be a bit more specific about that?

> > I suspect the biggest part of the involved tables is in the cache
> > anyway,
>
> I don't trust automatic caching, especially since our CUR table is huge
> and it is only a very small subset of it (the titles) which we need in
> memory.

For an existence query it only needs the index, not the table itself. Did
you check if we are doing something silly here so that the database has to
access the table anyway? What does EXPLAIN say here?

-- Jan Hidders

.---------------------------------------------------------------------.
| Post-doctoral researcher e-mail: jan.hidders@ua.ac.be |
| Dept. Math. & Computer Science tel: (+32) 3 820 2427 |
| University of Antwerp, room J1.06 fax: (+32) 3 820 2421 |
| (UIA) Universiteitsplein 1, B-2610 Wilrijk - Antwerpen (Belgium) |
`---------------------------------------------------------------------'
Re: Optimizing and tuning [ In reply to ]
> Adding composite indexes won't help there. What you need is an extra
> redundant table that contains the number of wanted links.

We have a table that contains broken links, which is used in the Most
Wanted query. I have no information about the size of these tables in our
current database, but I suspect that, again, indexes are not properly
used, otherwise the bad performance of this query (which has prompted
Magnus to cache the whole page and Brion to create a "miser" mode in which
it cannot be loaded) is not explainable.

>> HANDLER works almost exactly the same as SELECT from a syntax POV.

> That's only part of the problem. If anyone wants to mess with this they have
> to understand what it does and why. That doesn't go away if you have similar
> syntaxes.

I think your arguments against HANDLER are misguided. We clearly have
situations where we do not nead consistent reads, and in these situations,
HANDLER may be a great help. As I said, we can abstract the functionality
to reduce complexity.

>> They are currently labeling it as a "gamma". I think it might be worth
>> giving a try, we can always downgrade if it creates problems.

> Not if we start using MySQL4 stuff and the problems only start to appear
> after a while. I agree that the risc is very small, but still.

How about using it on meta only? It is reasonably active, but not mission
critical.

>>>> 5) Caching our article index. I have mentioned this idea before, I think:
>>>> We could just keep our entire article index in memory *persistently*
>>>> (currently the linkcache is filled for each request), which should make
>>>> link lookup almost instant.
>>
>>> What makes you think it isn't already?
>>
>> I have done tests on the link lookup with a local database and it's very
>> slow with link-heavy pages. This is also what our profiling has shown.

> Could you be a bit more specific about that?

see
http://www.wikipedia.org/pipermail/wikitech-l/2002-November/001166.html
http://www.wikipedia.org/pipermail/wikitech-l/2003-January/002238.html

Regards,

Erik
Re: Optimizing and tuning [ In reply to ]
On Sat, Feb 01, 2003 at 04:44:00PM +0100, Erik Moeller wrote:
> > Adding composite indexes won't help there. What you need is an extra
> > redundant table that contains the number of wanted links.
>
> We have a table that contains broken links, which is used in the Most
> Wanted query.

I know. I was the one who originally introduced that table. :-)

> I have no information about the size of these tables in our current
> database, but I suspect that, again, indexes are not properly used,
> otherwise the bad performance of this query (which has prompted Magnus to
> cache the whole page and Brion to create a "miser" mode in which it cannot
> be loaded) is not explainable.

I can assure you that the indices are well-defined there and used in the
proper way. Some queries are simply hard to compute and can only be improved
by precomputing stuff in redundant tables.

> >> HANDLER works almost exactly the same as SELECT from a syntax POV.
>
> > That's only part of the problem. If anyone wants to mess with this they
> > have to understand what it does and why. That doesn't go away if you
> > have similar syntaxes.
>
> I think your arguments against HANDLER are misguided. We clearly have
> situations where we do not nead consistent reads, and in these situations,
> HANDLER may be a great help. As I said, we can abstract the functionality
> to reduce complexity.

As I said, syntactically hiding the complexity doesn't make it go away.

> >> They are currently labeling it as a "gamma". I think it might be worth
> >> giving a try, we can always downgrade if it creates problems.
>
> > Not if we start using MySQL4 stuff and the problems only start to appear
> > after a while. I agree that the risc is very small, but still.
>
> How about using it on meta only? It is reasonably active, but not mission
> critical.

Good idea, but I assumed that Brion was going to test it on the test server
anyway.

> >> I have done tests on the link lookup with a local database and it's
> >> very slow with link-heavy pages. This is also what our profiling has
> >> shown.
>
> > Could you be a bit more specific about that?
>
> see
> http://www.wikipedia.org/pipermail/wikitech-l/2002-November/001166.html
> http://www.wikipedia.org/pipermail/wikitech-l/2003-January/002238.html

Ok. I commented already on that in my other mail. Btw. did you see Lee's
original comments in linkcache.doc in /docs on why the caching is done the
way it is done?

-- Jan Hidders

.---------------------------------------------------------------------.
| Post-doctoral researcher e-mail: jan.hidders@ua.ac.be |
| Dept. Math. & Computer Science tel: (+32) 3 820 2427 |
| University of Antwerp, room J1.06 fax: (+32) 3 820 2421 |
| (UIA) Universiteitsplein 1, B-2610 Wilrijk - Antwerpen (Belgium) |
`---------------------------------------------------------------------'
Re: Optimizing and tuning [ In reply to ]
On sab, 2003-02-01 at 07:44, Erik Moeller wrote:
> > Adding composite indexes won't help there. What you need is an extra
> > redundant table that contains the number of wanted links.
>
> We have a table that contains broken links, which is used in the Most
> Wanted query. I have no information about the size of these tables in our
> current database, but I suspect that, again, indexes are not properly
> used, otherwise the bad performance of this query (which has prompted
> Magnus to cache the whole page and Brion to create a "miser" mode in which
> it cannot be loaded) is not explainable.

EXPLAIN
SELECT bl_to, COUNT( DISTINCT bl_from ) as nlinks
FROM brokenlinks
GROUP BY bl_to
HAVING nlinks > 1
ORDER BY nlinks DESC
LIMIT 0, 50;
+-------------+-------+---------------+-------+---------+------+--------+---------------------------------+
| table | type | possible_keys | key | key_len | ref | rows | Extra |
+-------------+-------+---------------+-------+---------+------+--------+---------------------------------+
| brokenlinks | index | NULL | bl_to | 255 | NULL | 319493 | Using temporary; Using filesort |
+-------------+-------+---------------+-------+---------+------+--------+---------------------------------+

The problem is that it has to count up *every* row, then sort based on
the counts just to return the top 50. The grouping may be sped up by the
index, but it's still going to be a bit of work. (Just tried it, took
37.59 seconds.)

Jan has suggested having a summary table that keeps track of the counts,
so we can access them directly. I'm inclined to agree; we could update
it on page saves (and deletes and renames and undeletes!) to save the
bother of the group/count/sort behemoth. Something like:

CREATE TABLE wantedpages (
wanted_title varchar(255) binary NOT NULL default '',
wanted_count int unsigned NOT NULL default 0,
UNIQUE KEY wanted_title,
INDEX wanted_count
);

Index on the count, certainly, to speed the sort for the most wanted
display. Index on the title should speed picking the right ones on
updates, right?

Anyway, this would turn the most wanted query into:

SELECT wanted_title,wanted_count
FROM wantedpages
WHERE wanted_count > 1
ORDER BY wanted_count DESC
LIMIT 0, 50;

which should be quite a bit faster.


I suspect we can be smarter about how we update links and brokenlinks as
well; currently we just delete every row pertaining to the page we're
dealing with and re-insert it.

-- brion vibber (brion @ pobox.com)
Re: Optimizing and tuning [ In reply to ]
On Sat, Feb 01, 2003 at 09:29:24AM -0800, Brion Vibber wrote:
>
> Jan has suggested having a summary table that keeps track of the counts,
> so we can access them directly. I'm inclined to agree; we could update
> it on page saves (and deletes and renames and undeletes!) to save the
> bother of the group/count/sort behemoth.

Yes. Every time you update the brokenlinks table you also have to update the
wantedpages table, so that shouldn't be too hard to find in the code.

> Something like:
>
> CREATE TABLE wantedpages (
> wanted_title varchar(255) binary NOT NULL default '',
> wanted_count int unsigned NOT NULL default 0,
> UNIQUE KEY wanted_title,
> INDEX wanted_count
> );
>
> Index on the count, certainly, to speed the sort for the most wanted
> display. Index on the title should speed picking the right ones on
> updates, right?

Exactly. Btw., shouldn't namespace be in there somehwere or is that still a
part of the title?

> I suspect we can be smarter about how we update links and brokenlinks as
> well; currently we just delete every row pertaining to the page we're
> dealing with and re-insert it.

Probably, because when a page is saved you have to retrieve the "old" broken
links from the DB anyway because there is no other way of knowing them and
you need them to updated the wantedpages table. So you might as well compare
them to the new list of broken links and remove/add only what has changed,
which is usually only a few links anyway.

-- Jan Hidders
Re: Optimizing and tuning [ In reply to ]
On sab, 2003-02-01 at 10:31, Jan Hidders wrote:
> Btw., shouldn't namespace be in there somehwere or is that still a
> part of the title?

On the link tables, the title strings include the (text) namespace. That
might not be the best way about it, though.

(For that matter, the links table should probably just be
cur_id-to-cur_id, so we can grab the titles & namespaces on a join. Or
is it useful to store it separately?)

> > I suspect we can be smarter about how we update links and brokenlinks as
> > well; currently we just delete every row pertaining to the page we're
> > dealing with and re-insert it.
>
> Probably, because when a page is saved you have to retrieve the "old" broken
> links from the DB anyway because there is no other way of knowing them and
> you need them to updated the wantedpages table. So you might as well compare
> them to the new list of broken links and remove/add only what has changed,
> which is usually only a few links anyway.

Sounds reasonable.

-- brion vibber (brion @ pobox.com)