There are two very good IRC channels on irc.freenode.net:
#php and #mysql
There are lots of knowledgeable people there who are eager to help open
source projects, many have already heard of Wikipedia. I learned a lot
from just being there for half an hour or so.
Here are some things we need to look into:
1) Composite indexes and index use in general. I do not know which indexes
are currently used on the live Wikipedia. However, after the *default*
database generation script, there is just a single composite index, namely
in the table watchlist. All other indexes are on a single field. If I
understand correctly, this means that when we do a query with multiple
comparisons, e.g. WHERE foo=bar AND baz=bat, only one index is used. At
least, that is what this article claims:
http://www.linux-mag.com/cgi-bin/printer.pl?issue=2001-06&article=mysql
There are also a couple of tables with no indexes (including ARCHIVE,
which may cause Special:Undelete to create high server load) and some
unexplainable ones (site_stats has an index, but only one row). We really
need to clean up our indexes. I can't help much with this without server
access because I don't know if the table structures have been altered.
2) The HANDLER command. HANDLER, which works with both InnoDB and MyISAM,
provides low-level access to tables *without locking*. You get reads, but
you don't get consistency, which may cause problems when you're dealing
with critical, heavily modified data. But it seems perfect for our archive
stuff. The HANDLER syntax is explained here:
http://www.mysql.com/doc/en/HANDLER.html
We definitely should look into this.
3) Upgrading to MySQL4. MySQL4 supports query caching (also subqueries,
but I haven't looked into that), which means that heavily queried pages
will load faster. When someone mentioned query caching, a lot of other
people made comments like "query caching rocks", so I assume it provides
quite some performance benefits in practice.
4) Tuning my.cnf. According to one IRC resident, upping table_cache can
greatly increase performance with lots of UPDATEs. If we create new
indexes, we may also need to raise index_cache (there's a formula to
calculate your index cache efficiency, currently we are at 99.75%, which
is pretty good).
5) Caching our article index. I have mentioned this idea before, I think:
We could just keep our entire article index in memory *persistently*
(currently the linkcache is filled for each request), which should make
link lookup almost instant. There are several ways to accomplish this:
- We could ask one of our resident C programmers to help. There's a
specified interface to access persistent resources in MySQL, which is used
by functions such as mysql_pconnect. This page describes how it works:
http://www.php-center.de/en-html-manual/phpdevel.html
- We could use session variables for the purpose. There is supposed to be
an option to keep session stuff in shared memory. This might help with
other caching problems as well.
- We could just put a database on a ramdisk for this purpose.
The order of this list reflects the priority at which I think these
different questions need to be addressed. Getting our indexes to work
properly should IMHO be of greatest importance. Even if EXPLAIN shows the
use of an index, we may frequently still require large table scans because
we do not use composite indexes.
Regards,
Erik
#php and #mysql
There are lots of knowledgeable people there who are eager to help open
source projects, many have already heard of Wikipedia. I learned a lot
from just being there for half an hour or so.
Here are some things we need to look into:
1) Composite indexes and index use in general. I do not know which indexes
are currently used on the live Wikipedia. However, after the *default*
database generation script, there is just a single composite index, namely
in the table watchlist. All other indexes are on a single field. If I
understand correctly, this means that when we do a query with multiple
comparisons, e.g. WHERE foo=bar AND baz=bat, only one index is used. At
least, that is what this article claims:
http://www.linux-mag.com/cgi-bin/printer.pl?issue=2001-06&article=mysql
There are also a couple of tables with no indexes (including ARCHIVE,
which may cause Special:Undelete to create high server load) and some
unexplainable ones (site_stats has an index, but only one row). We really
need to clean up our indexes. I can't help much with this without server
access because I don't know if the table structures have been altered.
2) The HANDLER command. HANDLER, which works with both InnoDB and MyISAM,
provides low-level access to tables *without locking*. You get reads, but
you don't get consistency, which may cause problems when you're dealing
with critical, heavily modified data. But it seems perfect for our archive
stuff. The HANDLER syntax is explained here:
http://www.mysql.com/doc/en/HANDLER.html
We definitely should look into this.
3) Upgrading to MySQL4. MySQL4 supports query caching (also subqueries,
but I haven't looked into that), which means that heavily queried pages
will load faster. When someone mentioned query caching, a lot of other
people made comments like "query caching rocks", so I assume it provides
quite some performance benefits in practice.
4) Tuning my.cnf. According to one IRC resident, upping table_cache can
greatly increase performance with lots of UPDATEs. If we create new
indexes, we may also need to raise index_cache (there's a formula to
calculate your index cache efficiency, currently we are at 99.75%, which
is pretty good).
5) Caching our article index. I have mentioned this idea before, I think:
We could just keep our entire article index in memory *persistently*
(currently the linkcache is filled for each request), which should make
link lookup almost instant. There are several ways to accomplish this:
- We could ask one of our resident C programmers to help. There's a
specified interface to access persistent resources in MySQL, which is used
by functions such as mysql_pconnect. This page describes how it works:
http://www.php-center.de/en-html-manual/phpdevel.html
- We could use session variables for the purpose. There is supposed to be
an option to keep session stuff in shared memory. This might help with
other caching problems as well.
- We could just put a database on a ramdisk for this purpose.
The order of this list reflects the priority at which I think these
different questions need to be addressed. Getting our indexes to work
properly should IMHO be of greatest importance. Even if EXPLAIN shows the
use of an index, we may frequently still require large table scans because
we do not use composite indexes.
Regards,
Erik