Mailing List Archive

Chat about Wikipedia performance?
How about organizing a chat this week about the ongoing Wikipedia
performance crisis and how to solve it? Talking to people can provide
additional motivation for getting things done, and help us organize our
priorities. It might also reduce some frustration. If we do this, all the
relevant people should be present:

- Jimbo
- Jason
- Brion
- Lee
- Magnus
- ...

It might be best to meet on the weekend, so that work does not interfere.
My suggestion would be Saturday, 20:00 UTC.

What do you think?

Regards,

Erik
Re: Chat about Wikipedia performance? [ In reply to ]
> (Erik Moeller <erik_moeller@gmx.de>):
> How about organizing a chat this week about the ongoing Wikipedia
> performance crisis and how to solve it? Talking to people can provide
> additional motivation for getting things done, and help us organize our
> priorities. It might also reduce some frustration. If we do this, all the
> relevant people should be present:
>
> - Jimbo
> - Jason
> - Brion
> - Lee
> - Magnus
> - ...
>
> It might be best to meet on the weekend, so that work does not interfere.
> My suggestion would be Saturday, 20:00 UTC.
>
> What do you think?

I'll have some more performance numbers by then, and I'm happy
to participate in whatever the group wants to do. But personally,
I've never been big on online chats, and I don't see that this
one could accomplish anything that wouldn't be better accomplished
here on wikitech-l.

--
Lee Daniel Crocker <lee@piclab.com> <http://www.piclab.com/lee/>
"All inventions or works of authorship original to me, herein and past,
are placed irrevocably in the public domain, and may be used or modified
for any purpose, without permission, attribution, or notification."--LDC
Re: Chat about Wikipedia performance? [ In reply to ]
> I'll have some more performance numbers by then, and I'm happy
> to participate in whatever the group wants to do. But personally,
> I've never been big on online chats, and I don't see that this
> one could accomplish anything that wouldn't be better accomplished
> here on wikitech-l.

Cool. The problem with mailing list discussion is that they can die
quickly, for many reasons, which can delay things unnecessarily. I've seen
many situations where a mailing list was used to report a serious problem,
but the post (in spite of hundreds of members) was ignored.

We all know that the performance issue is one of our most pressing
problems right now -- many people can't use the site anymore, and the
international Wikipedians are getting a bit irritated. So I think the best
way to address this *on time* is to sit down (virtually) and go through an
agenda.

Regards,

Erik
Re: Chat about Wikipedia performance? [ In reply to ]
If everyone decides that this is an important thing to do, I am
willing to attend. I'd prefer not to, though. I cherish my
weekends...

Jason

Erik Moeller wrote:

> How about organizing a chat this week about the ongoing Wikipedia
> performance crisis and how to solve it? Talking to people can provide
> additional motivation for getting things done, and help us organize our
> priorities. It might also reduce some frustration. If we do this, all the
> relevant people should be present:
>
> - Jimbo
> - Jason
> - Brion
> - Lee
> - Magnus
> - ...
>
> It might be best to meet on the weekend, so that work does not interfere.
> My suggestion would be Saturday, 20:00 UTC.
>
> What do you think?
>
> Regards,
>
> Erik
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l@wikipedia.org
> http://www.wikipedia.org/mailman/listinfo/wikitech-l

--
"Jason C. Richey" <jasonr@bomis.com>
Re: Chat about Wikipedia performance? [ In reply to ]
On Mon, 2003-04-28 at 14:14, Erik Moeller wrote:
> The problem with mailing list discussion is that they can die
> quickly, for many reasons, which can delay things unnecessarily. I've seen
> many situations where a mailing list was used to report a serious problem,
> but the post (in spite of hundreds of members) was ignored.

Reporting a serious problem is all well and good, but isn't the same as
_fixing_ it.

> We all know that the performance issue is one of our most pressing
> problems right now -- many people can't use the site anymore, and the
> international Wikipedians are getting a bit irritated. So I think the best
> way to address this *on time* is to sit down (virtually) and go through an
> agenda.

We don't need to sit and chat. We need *code* and we need a second
server to divide the "must-do-fast" web work the and "chug-chug-chug"
database labor.

Here are some things you can work on if you've got time to spend on
Wikipedia coding:

* Page viewing is still kinda inefficient. Rendering everything on every
view is not so good... Caching can save both processing time in
conversion to HTML, and in various database accesses (checking link
tables, etc) with its associated potential locking overhead.

We need to either be able to cache the HTML of entire pages (followed by
insertion of user-specific data/links or simple options through style
sheet selection or string replacement) or to cache just the generated
HTML of the wiki pages for insertion into the page structure (plus
associated data, like interlanuage links, need to be accessible without
parsing the page).

We need to tell which pages are or aren't cacheable (not a diff, not a
special page, not a history revision, not a user with really weird
display options -- or on the other hand, maybe we _could_ cache those,
if only we can distinguish them), we need to be able to generate and
save the cached material appropriately, we need to make sure it's
invalidated properly, and we need to be able to do mass invalidation
when, for instance, the software is upgraded. Cached pages may be kept
in files, rather than the database.

I should point out that while there are several possible choices here,
any of them is better than what we're running now. We need living,
running _code_, which can then be improved upon later.

* The page saving code is rather inefficient, particularly with how it
deals with the link tables (and potentially buggy -- sometimes pages end
up with their link table entries missing, possibly due to the system
timing out between the main save chunk and the link table update). If
someone would like to work on this, it would be very welcome. Nothing
that needs to be _discussed_, it just needs to be _done_ and changes
checked in.

* Various special pages are so slow they've been disabled. Most of them
could be made much more efficient with better queries and/or by
maintaining summary tables. Some remaining ones are also pretty
inefficient, like the Watchlist. Someone needs to look into these and
make the necessary adjustments to the code. Nothing to _chat_ about; if
you know how to make them more efficient, please rewrite them and check
in the _code_.

* Can MySQL 4 handle fulltext searches better under load? Is boolean
mode faster or slower? Someone needs to test this (Lee has a test rig
with mysql4 already, but as far as I know hasn't tested the fulltext
search with boolean mode yet), and if it's good news, we need to make an
upgrade a high priority. Not much to _chat_ about, it just needs to get
_done_.

* Alternately, would a completely separate search system (not using
MySQL) be more efficient? Or even just running searches on a dedicated
box with a replicated database to keep it from bogging down the main db?
Which leads us back to hardware...

For the server; I don't know what's going on here. What I do know is
that Jimbo posted this to wikitech-l in February:

-----Forwarded Message-----

From: Jimmy Wales <jwales@bomis.com>
To: wikitech-l@wikipedia.org
Subject: [Wikitech-l] Hardware inventory
Date: 07 Feb 2003 02:56:57 -0800

Jason and I are taking stock of our hardware, and I'm going to find a
secondary machine to devote exclusively to doing apache for wikipedia,
i.e. with no other websites on it or anything. I'll loan the machine
to the Wikipedia Foundation until the Foundation has money to buy a
new machine later on this year.

We'll keep the MYSQL where it is, on the powerful machine. The new
machine will be no slouch, either.

Today is Friday, and I think we'll have to wait for Jason to take a
trip to San Diego next week sometime (or the week following) to get
this all setup. (The machine I have in mind is actually in need of
minor repair right now.)

By having this new machine be exclusively wikipedia, I can give the
developers access to it, which is a good thing.

This will *not* involve a "failover to read-only" mechanism, I guess,
but then, it's still going to be a major improvement -- such a
mechanism is really a band-aid on a fundamental problem, anyway.

------

Lots of people think it's a good thing to set up mirror servers all
over the Internet. It's really not that simple. There are issues of
organizational trust with user data, issues with network latency, etc.
Some things should be decentralized, some things should be
centralized.

--- end forwarded message ---


and this to wikipedia-l in March:


-----Forwarded Message-----

From: Jimmy Wales <jwales@bomis.com>
To: wikipedia-l@wikipedia.org, wikien-l@wikipedia.org
Subject: [Wikipedia-l] Off today
Date: 19 Mar 2003 04:47:52 -0800

My wife and little girl are feeling ill today with a cold, so I'm
going to be taking off work to help out. I'm already a little behind
in wikipedia email, so I'll probably be slow for a few days as I dig
out.

We're getting a new (second) machine for wikipedia -- the parts have
been ordered and are being shipped to Jason, and then at some point
soon, he'll drive down to San Diego to install everything.

--Jimbo

--- end forwarded message ---


I e-mailed Jimbo and Jason the other day about this; I haven't heard
back from Jimbo, and Jason still doesn't know anything concrete about
the new server.

Jimbo, we really need some news on this front. If parts and/or a whole
machine really *is* on order and can be set up in the near future, we
need to know that. If it's *not*, then it may be time to pass around the
plate and have interested parties make sure one does get ordered, as had
begun to be discussed prior to the March 19 announcement.

-- brion vibber (brion @ pobox.com)
Re: Chat about Wikipedia performance? [ In reply to ]
> We don't need to sit and chat. We need *code* and we need a
> second server to divide the "must-do-fast" web work the and
> "chug-chug-chug" database labor.

The code issue is mostly a matter of focus: one or two developers
is probably sufficient to keep the codebase up to date, but neither
Brion nor I are focused on that right now.

So while Brion has issued a call for coders, that could be answered
in other ways: for example, if a good admin stepped up to take some
of admin tasks Brion is currently swamped with, he might be more free
to code (assuming he's interested, which is not a given either).
I've chosen to focus more on long-term goals because like Brion I was
expecting hardware to bail us out in the short term. If that's going
to be delayed, then I can put off things like testing file systems
and focus on caching and tuning.

--
Lee Daniel Crocker <lee@piclab.com> <http://www.piclab.com/lee/>
"All inventions or works of authorship original to me, herein and past,
are placed irrevocably in the public domain, and may be used or modified
for any purpose, without permission, attribution, or notification."--LDC
Re: Chat about Wikipedia performance? [ In reply to ]
Brion, you're missing my point. I agree with you entirely that things need
to "get done". My suggestion to have a public discussion was to find out
which things we can get done reasonably quickly (because, realistically,
we all have other things to do) with substantial impact; to figure out the
server situation, which features should be disabled, who might contribute
which piece of code etc. If we can sort these things out in the next few
days via mail, fine. I'm no IRC junkie. But we need to implement at least
some reasonable emergency fixes, and think about a mid term strategy.

As for code, this is one thing I'd like to talk about: If we have the
Nupedia Foundation set up, we can collect donations. It would be stupid
not to use some of that money for funding development. I don't care who is
funded, but I think this could greatly speed things up. If we can't get
the NF set up reasonably quickly, we should collect donations regardless,
tax-deductible or not.

> * Page viewing is still kinda inefficient. Rendering everything on every
> view is not so good...

Why? It's just PHP stuff. Our bottleneck is the database server. Fetching
stuff from CUR and converting it into HTML is not an issue. 20 pass
parser? Add another zero. Until I see evidence that this has any impact on
performance, I don't care. Turn off link checking and all pages are
rendered lightning fast.

What would be useful is to maintain a persistent (over several sessions)
index of all existing and non existing pages in memory for the link
checking. A file on a ramdisk maybe? I think it would be worth giving it a
try at least, and not a lot of work.

> We need to tell which pages are or aren't cacheable (not a diff, not a
> special page, not a history revision, not a user with really weird
> display options -- or on the other hand, maybe we _could_ cache those,
> if only we can distinguish them), we need to be able to generate and
> save the cached material appropriately, we need to make sure it's
> invalidated properly, and we need to be able to do mass invalidation
> when, for instance, the software is upgraded. Cached pages may be kept
> in files, rather than the database.

Wasted effort, IMHO. Cache improvements have added little measurable
performance benefits, and there are many, many different situations to
test here (different browsers, different browser cache settings etc.).
Meanwhile, our real bottlenecks (search, special pages, out of control
queries) remain in place.

> * The page saving code is rather inefficient, particularly with how it
> deals with the link tables (and potentially buggy -- sometimes pages end
> up with their link table entries missing, possibly due to the system
> timing out between the main save chunk and the link table update). If
> someone would like to work on this, it would be very welcome. Nothing
> that needs to be _discussed_, it just needs to be _done_ and changes
> checked in.

I doubt that a *relatively* rare activity like that makes much of an
impact, but I'll be happy to be proven wrong. Bugs are annoying, but I'm
writing this for one reason: we need to make Wikipedia usable again on a
regular basis. There are countless small problems that need to be fixed.
This is not the issue here.

> * Various special pages are so slow they've been disabled. Most of them
> could be made much more efficient with better queries and/or by
> maintaining summary tables. Some remaining ones are also pretty
> inefficient, like the Watchlist. Someone needs to look into these and
> make the necessary adjustments to the code.

Caching special pages seems like a reasonable approach. Watchlists could
definitely be improved, haven't seen a good way to do this yet, though. It
could be done on page save, but with a much-watched page, this again would
add severe drain, with possibly no overall benefit. Improve the SQL and
indexes? Maybe, but I'm no SQL guru.

> * Can MySQL 4 handle fulltext searches better under load? Is boolean
> mode faster or slower? Someone needs to test this (Lee has a test rig
> with mysql4 already, but as far as I know hasn't tested the fulltext
> search with boolean mode yet), and if it's good news, we need to make an
> upgrade a high priority.

Sounds good to me. If safe enough, we should update in any case; it is my
understanding that MySQL4 has support for subqueries which could, if we
know what we're doing, potentially be used to write significantly more
effective queries.

Regards,

Erik
Re: Chat about Wikipedia performance? [ In reply to ]
On Mon, 2003-04-28 at 18:04, Kurt Jansson wrote:
> Could we set the set the length of the watchlist to 50 or something like
> that per default, and not making it dependent of the length you choose
> in the preferences for the RecentChanges?

Well, that wouldn't help for performance as the data goes through a
temporary table. Basically the DB's grabbing your *entire* watchlist,
then only sending the most recent X items to the wiki for formatting in
a list.

> Even with a second server, and the software and database being faster,
> how long will it take until this again isn't enough because articles,
> editors and visitors should be growing exponentially in theory. There
> will be the foundation, and maybe we'll get some money through it and
> can buy new hardware, but will it be sufficient? And for how long?

How long will the internet be able to deal with all those new users?
Won't we run out of IP addresses if IPv6 never rolls out? When will the
sun burn out, leaving the earth a lifeless ball of coal?? :) Hopefully,
we'll be able to keep up.

> And
> aren't there other important things we could spend the money on, if an
> other free project or a university would host us for free? But maybe I
> cherish an illusion here.

Do feel free to ask other free projects and universities if they'd be
interested in supporting the project...

-- brion vibber (brion @ pobox.com)
Re: Chat about Wikipedia performance? [ In reply to ]
Brion Vibber schrieb:

> * Various special pages are so slow they've been disabled. Most of them
> could be made much more efficient with better queries and/or by
> maintaining summary tables. Some remaining ones are also pretty
> inefficient, like the Watchlist. Someone needs to look into these and
> make the necessary adjustments to the code.

Could we set the set the length of the watchlist to 50 or something like
that per default, and not making it dependent of the length you choose
in the preferences for the RecentChanges? (I for example have set it to
150, because otherwise the "show changes since ..." is cut off too
early. On the English Wp I'd have to set it even higher.)


> Jimbo, we really need some news on this front. If parts and/or a whole
> machine really *is* on order and can be set up in the near future, we
> need to know that. If it's *not*, then it may be time to pass around the
> plate and have interested parties make sure one does get ordered, as had
> begun to be discussed prior to the March 19 announcement.

Even with a second server, and the software and database being faster,
how long will it take until this again isn't enough because articles,
editors and visitors should be growing exponentially in theory. There
will be the foundation, and maybe we'll get some money through it and
can buy new hardware, but will it be sufficient? And for how long? And
aren't there other important things we could spend the money on, if an
other free project or a university would host us for free? But maybe I
cherish an illusion here.


Kurt
Re: Chat about Wikipedia performance? [ In reply to ]
On Mon, 2003-04-28 at 16:46, Erik Moeller wrote:
> Brion, you're missing my point. I agree with you entirely that things need
> to "get done". My suggestion to have a public discussion was to find out
> which things we can get done reasonably quickly (because, realistically,
> we all have other things to do) with substantial impact; to figure out the
> server situation, which features should be disabled, who might contribute
> which piece of code etc. If we can sort these things out in the next few
> days via mail, fine. I'm no IRC junkie. But we need to implement at least
> some reasonable emergency fixes, and think about a mid term strategy.

Well maybe, but my experience with using online chats like this is:
* Everyone sits around for several hours babbling, waiting for the other
folks to show up and complaining about the problems they're having
logging in.
* By the end, someone has scribbled up a page with a work plan, which
everyone ignores in the future.
* During all this time, they _could_ have been doing something
productive instead...

> As for code, this is one thing I'd like to talk about: If we have the
> Nupedia Foundation set up...

The status of the non-profit is indeed another thing Jimbo could shed
some light on...

> > * Page viewing is still kinda inefficient. Rendering everything on every
> > view is not so good...
>
> Why? It's just PHP stuff.

Obviously not, since that PHP stuff needs data from the database to
work. :) Buying milk at the grocery store is more convenient than
keeping and milking cows at home not because the milking process is more
time consuming than grabbing a bottle from the fridge, but because
maintaining the cow is a huge effort and milk is only available from the
cow under certain conditions. Or, um, something like that.

> Our bottleneck is the database server. Fetching
> stuff from CUR and converting it into HTML is not an issue. 20 pass
> parser? Add another zero. Until I see evidence that this has any impact on
> performance, I don't care. Turn off link checking and all pages are
> rendered lightning fast.

And that would be a pretty piss-poor wiki, wouldn't it? :)

> What would be useful is to maintain a persistent (over several sessions)
> index of all existing and non existing pages in memory for the link
> checking. A file on a ramdisk maybe? I think it would be worth giving it a
> try at least, and not a lot of work.

Sure, it _might_ help. Code it up and see!

> > * The page saving code is rather inefficient, particularly with how it
> > deals with the link tables...
>
> I doubt that a *relatively* rare activity like that makes much of an
> impact, but I'll be happy to be proven wrong.

Slow saving impacts everyone who tries to edit articles; four edits per
minute may be _relatively_ rare compared to page views, but we're still
running thousands of edits per day and it's a fundamental part of what a
wiki is. It's absolutely vital that editing be both swift and bug-free,
and if we can reduce the opportunities for saving to get hung up, so
much the better.

> Caching special pages seems like a reasonable approach.

Unfortunately that doesn't really solve the problem any more than
replacing the search with a link to Google solves the search problem.

If updating these cached pages is so slow and db-intensive that it takes
the 'pedia offline for fifteen-twenty minutes (which it does), then
nobody's going to want to update the caches (last updated April 9...)
and they become outdated and useless.

It works as a temporary crutch in place of "blank page -- this feature
has been disabled, please find a less popular web site to play on", but
it's not a solution.

> Watchlists could
> definitely be improved, haven't seen a good way to do this yet, though. It
> could be done on page save, but with a much-watched page, this again would
> add severe drain, with possibly no overall benefit. Improve the SQL and
> indexes? Maybe, but I'm no SQL guru.

Which Himalayan mountain do we have to climb to find one? :)

-- brion vibber (brion @ pobox.com)
Re: Chat about Wikipedia performance? [ In reply to ]
> Well maybe, but my experience with using online chats like this is:
> * Everyone sits around for several hours babbling, waiting for the other
> folks to show up and complaining about the problems they're having
> logging in.
> * By the end, someone has scribbled up a page with a work plan, which
> everyone ignores in the future.
> * During all this time, they _could_ have been doing something
> productive instead...

Depends on the moderator. No moderation=unpredictable, bad moderator=bad
result, good moderator=possibly good result. Just like in real life
meetings.

> Obviously not, since that PHP stuff needs data from the database to
> work. :)

Duh. But if our database is so slow that it can't even answer simple
SELECTs, we can't do anything useful, cache or no cache. And if it isn't,
then we should concentrate on the queries which aren't simple. The
linkcache might still be one of those bottlenecks (simply because of the
sheer number of queries involved), I haven't checked your latest changes
to that code.

>> Our bottleneck is the database server. Fetching
>> stuff from CUR and converting it into HTML is not an issue. 20 pass
>> parser? Add another zero. Until I see evidence that this has any impact on
>> performance, I don't care. Turn off link checking and all pages are
>> rendered lightning fast.

> And that would be a pretty piss-poor wiki, wouldn't it? :)

Yes, but this is really one of the more expensive wiki features that also
limits all caching options severely. Impossible to work without it, but
apparently hard to implement in a scalable fashion.

>> What would be useful is to maintain a persistent (over several sessions)
>> index of all existing and non existing pages in memory for the link
>> checking. A file on a ramdisk maybe? I think it would be worth giving it a
>> try at least, and not a lot of work.

> Sure, it _might_ help. Code it up and see!

I might. I'll have to see if it makes any difference on the relatively
small de database which I'm currently using locally. It would have to be
optional -- setting up the software is already difficult enough.

> Slow saving impacts everyone who tries to edit articles; four edits per
> minute may be _relatively_ rare compared to page views, but we're still
> running thousands of edits per day and it's a fundamental part of what a
> wiki is. It's absolutely vital that editing be both swift and bug-free,
> and if we can reduce the opportunities for saving to get hung up, so
> much the better.

Yeah yeah yeah. I still think we should care more about the real
showstoppers. But hey, you can always _code it_. (Finally an opportunity
to strike back ;-)

> If updating these cached pages is so slow and db-intensive that it takes
> the 'pedia offline for fifteen-twenty minutes (which it does), then
> nobody's going to want to update the caches (last updated April 9...)
> and they become outdated and useless.

If this downtime is unacceptable, we might indeed have to think about a
query only server with somewhat delayed data availability. This could be a
replacement for the sysops, too. Mirroring the Wikipedia database files
(raw) should be no issue with a SCSI system, or a low priority copy
process.

>> Watchlists could
>> definitely be improved, haven't seen a good way to do this yet, though. It
>> could be done on page save, but with a much-watched page, this again would
>> add severe drain, with possibly no overall benefit. Improve the SQL and
>> indexes? Maybe, but I'm no SQL guru.

> Which Himalayan mountain do we have to climb to find one? :)

Maybe we should stop looking in the Himalayan mountains and start
searching the lowlands .. In other words: Don't search those who will do
it for society or for the glory. Just hand over the cash and be done with
it.

Regards,

Erik
Re: Chat about Wikipedia performance? [ In reply to ]
Brion Vibber schrieb:
> On Mon, 2003-04-28 at 18:04, Kurt Jansson wrote:
>
>>Could we set the set the length of the watchlist to 50 or something like
>>that per default, and not making it dependent of the length you choose
>>in the preferences for the RecentChanges?
>
> Well, that wouldn't help for performance as the data goes through a
> temporary table. Basically the DB's grabbing your *entire* watchlist,
> then only sending the most recent X items to the wiki for formatting in
> a list.

I see. I hadn't thought this through completely.
(Maybe a link in the Watchlist for easy removal of articles would help
people to keep their watchlist small and tidy.)


>>Even with a second server, and the software and database being faster,
>>how long will it take until this again isn't enough because articles,
>>editors and visitors should be growing exponentially in theory. There
>>will be the foundation, and maybe we'll get some money through it and
>>can buy new hardware, but will it be sufficient? And for how long?
>
> How long will the internet be able to deal with all those new users?
> Won't we run out of IP addresses if IPv6 never rolls out? When will the
> sun burn out, leaving the earth a lifeless ball of coal?? :) Hopefully,
> we'll be able to keep up.

Okay, I'll remind you in a year or two :-)


> Do feel free to ask other free projects and universities if they'd be
> interested in supporting the project...

I'll do. Could you describe what our requirements are? (Sorry, I'm not
very experienced with this server stuff. I'm just trying to install
Debian with a friends help.) - I'll go tromping (right word?) in the
three universities in Berlin then.


Kurt
Re: Chat about Wikipedia performance? [ In reply to ]
On Mon, 2003-04-28 at 19:21, Erik Moeller wrote:
[on a persistent link-existence table]
> I might. I'll have to see if it makes any difference on the relatively
> small de database which I'm currently using locally. It would have to be
> optional -- setting up the software is already difficult enough.

I don't know whether you've already looked into this, but PHP does seem
to have some support for shared memory:

http://www.php.net/manual/en/ref.sem.php
or
http://www.php.net/manual/en/ref.shmop.php

These seem to require enabling compile-time options for PHP.

It's also possible to create an in-memory-only table in MySQL
(type=HEAP), which may be able to bypass other MySQL slownesses (but it
may not, I haven't tested it).

> > Slow saving impacts everyone who tries to edit articles; four edits per
> > minute may be _relatively_ rare compared to page views, but we're still
> > running thousands of edits per day and it's a fundamental part of what a
> > wiki is. It's absolutely vital that editing be both swift and bug-free,
> > and if we can reduce the opportunities for saving to get hung up, so
> > much the better.
>
> Yeah yeah yeah. I still think we should care more about the real
> showstoppers. But hey, you can always _code it_. (Finally an opportunity
> to strike back ;-)

Touché. :) My point is just that we need to keep that critical path
clean and smooth -- and working. (I would consider not differentiating
live from broken links, or getting frequent failures on page save to be
fatal flaws, whereas not having a working search or orphans function is
just danged annoying.)

> If this downtime is unacceptable, we might indeed have to think about a
> query only server with somewhat delayed data availability. This could be a
> replacement for the sysops, too. Mirroring the Wikipedia database files
> (raw) should be no issue with a SCSI system, or a low priority copy
> process.

Sure, MySQL's database replication can provide for keeping a synched db
on another server. (Which, too, could provide for some emergency
fail-over in case the main machine croaks.)

The wiki would just need a config option to query the replicated server
for certain slow/nonessential operations (search, various special pages,
sysop queries) and leave the main db server free to take care of the
business of showing and saving pages and logging in users.

However this is all academic until we have reason to believe that a
second server will be available to us in the near future.

> Maybe we should stop looking in the Himalayan mountains and start
> searching the lowlands .. In other words: Don't search those who will do
> it for society or for the glory. Just hand over the cash and be done with
> it.

A lovely idea, but there _isn't_ any cash as of yet, nor a non-profit
foundation to formally solicit donations with which to fund programmers.
Until this gets done, or unless someone wants to fund people more
directly, all we've got is volunteer developers, who are only rarely
unemployed database gurus who can spend all day working on Wikipedia. :)

-- brion vibber (brion @ pobox.com)
Re: Chat about Wikipedia performance? [ In reply to ]
Brion Vibber schrieb:
> On Mon, 2003-04-28 at 16:46, Erik Moeller wrote:
>> Improve the SQL and indexes? Maybe, but I'm no SQL guru.
>
> Which Himalayan mountain do we have to climb to find one? :)

Wasn't there someone on this list a while ago who has written a book
about MySQL? Or am I fantasising about this?


Kurt
Re: Chat about Wikipedia performance? [ In reply to ]
On Mon, Apr 28, 2003 at 05:45:18PM -0500, Lee Daniel Crocker wrote:
> > We don't need to sit and chat. We need *code* and we need a
> > second server to divide the "must-do-fast" web work the and
> > "chug-chug-chug" database labor.
>
> The code issue is mostly a matter of focus: one or two developers
> is probably sufficient to keep the codebase up to date, but neither
> Brion nor I are focused on that right now.
>
> So while Brion has issued a call for coders, that could be answered
> in other ways: for example, if a good admin stepped up to take some
> of admin tasks Brion is currently swamped with, he might be more free
> to code (assuming he's interested, which is not a given either).
> I've chosen to focus more on long-term goals because like Brion I was
> expecting hardware to bail us out in the short term. If that's going
> to be delayed, then I can put off things like testing file systems
> and focus on caching and tuning.

I'm certainly willing to help out here. I'm not in SoCal, but I should
be able to help out with most administrivial tasks. I'm going to be
able to help out a much more with tuning at a file system/OS/Apache
level than I will be at a PHP/SQL level.

--
Nick Reinking -- eschewing obfuscation since 1981 -- Minneapolis, MN
Re: Chat about Wikipedia performance? [ In reply to ]
On Tue, 29 Apr 2003, Nick Reinking wrote:

> Date: Tue, 29 Apr 2003 11:44:45 -0500
> From: Nick Reinking <nick@twoevils.org>
> Subject: Re: [Wikitech-l] Chat about Wikipedia performance?
>
> On Mon, Apr 28, 2003 at 05:45:18PM -0500, Lee Daniel Crocker wrote:
> >
> > The code issue is mostly a matter of focus: one or two developers
> > is probably sufficient to keep the codebase up to date, but neither
> > Brion nor I are focused on that right now.
> >
> > So while Brion has issued a call for coders, that could be answered
> > in other ways: for example, if a good admin stepped up to take some
> > of admin tasks Brion is currently swamped with, he might be more free
> > to code (assuming he's interested, which is not a given either).
> > I've chosen to focus more on long-term goals because like Brion I was
> > expecting hardware to bail us out in the short term. If that's going
> > to be delayed, then I can put off things like testing file systems
> > and focus on caching and tuning.
>
> I'm certainly willing to help out here. I'm not in SoCal, but I should
> be able to help out with most administrivial tasks. I'm going to be
> able to help out a much more with tuning at a file system/OS/Apache
> level than I will be at a PHP/SQL level.

Since I've just joined on to the tech list, might as well introduce myself
in the tech context. My specialty is administration of routers and WANs,
and along the way I've come to know general Linux quite well, HTML & HTTP,
Apache, Perl & CGI, & MySQL. A few other things that I don't think would
be very relevant, but you never know, would be DNS (Bind, of course),
Sendmail, and TCP/IP details. The only thing holding me back from being a
useful coder so far seems to be that I don't know beans about PHP, but I
could certainly be similarly helpful in that "relief pitcher" kind of way.

--
John R. Owens http://www.ghiapet.homeip.net/
Sleep is a common substitute for adequate caffeine intake.
--John Owens
Chat about Wikipedia performance? [ In reply to ]
Hi - clearly, it'd be great if Wikipedia had better performance.

I looked at some of the "Database benchmarks" postings,
but I don't see any analysis of what's causing the ACTUAL bottlenecks
on the real system (with many users & full database).
Has someone done that analysis?

I suspect you guys have considered far more options, but as a
newcomer who's just read the source code documentation, maybe
some of these ideas will be helpful:

1. Perhaps for simple reads of the current article (cur),
you could completely skip using MySQL and use the filesystem instead.
Simple encyclopedia articles could be simply stored in the
filesystem, one article per file. To avoid the huge directory problem
(which many filesystems don't handle well, though Reiser does),
you could use the terminfo trick.. create subdirectories for the
first, second, and maybe even the third characters. E.G., "Europe"
is in "wiki/E/u/r/Europe.text". The existence of a file can be used as
the link test. This may or may not be faster than MySQL, but
it's probably faster: the OS developers have been optimizing
file access for a very long time, and instead of having
userspace<->kernel<->userspace interaction, it's
userspace<->kernel interaction. You also completely avoid
locking and other joyless issues.

2. The generation of HTML from the Wiki format could be cached,
as has been discussed. It could also be sped up, e.g., by
rewriting it in flex. I suspect it'd be easy to rewrite the
translation of Wiki to HTML in flex and produce something quite fast.
My "html2wikipedia" is written in flex - it's really fast and didn't
take long to write. The real problem is, I suspect that
isn't the bottleneck.

3. You could start sending out text ASAP, instead of batching it.
Many browsers start displaying text as it's available, so to
users it might _feel_ faster. Also, holding text in-memory
may create memory pressure that forces more useful stuff out of
memory.


Anyway, I don't know if these ideas are all that helpful,
but I hope they are.
Re: Chat about Wikipedia performance? [ In reply to ]
> (David A. Wheeler <dwheeler@dwheeler.com>):
>
> 1. Perhaps for simple reads of the current article (cur), you
> could completely skip using MySQL and use the filesystem instead.

In other words, caching. Yes, various versions of that have been
tried and proposed, and more will be. The major hassles are (1) links,
which are displayed differently when they point to existing pages, so
a page may appear differently from one view to the next depending on
the existence of other pages, and (2) user settings, which will cause
a page to appear differentlt for different users. But caching is
still possible within limits, and using the filesystem rather than
the database to store cached page info is certainly one possible
implementation to be tried.

> [Rendering] could also be sped up, e.g., by rewriting it in flex.
> My "html2wikipedia" is written in flex - it's really fast and didn't
> take long to write. The real problem is, I suspect that
> isn't the bottleneck.

It isn't. And there's no reason to expect flex to be any faster
than any other language.

> 3. You could start sending out text ASAP, instead of batching it.
> Many browsers start displaying text as it's available, so to
> users it might _feel_ faster. Also, holding text in-memory
> may create memory pressure that forces more useful stuff out of
> memory.

Not an issue. HTML is sent out immediately after it's rendered.
Things like database updates are deferred until after sending;
the only time taken before that is spent in rendering, and as I
said, that's not a bottleneck.

One things that would be nice is if the HTTP connection could be
dropped immediately after sending and before those database updates.
That's easy to do with threads in Java Servlets, but I haven't
found any way to do it with Apache/PHP.

--
Lee Daniel Crocker <lee@piclab.com> <http://www.piclab.com/lee/>
"All inventions or works of authorship original to me, herein and past,
are placed irrevocably in the public domain, and may be used or modified
for any purpose, without permission, attribution, or notification."--LDC
Re: Chat about Wikipedia performance? [ In reply to ]
On Tue, 2003-04-29 at 23:33, Lee Daniel Crocker wrote:
> > (David A. Wheeler <dwheeler@dwheeler.com>):
> > 1. Perhaps for simple reads of the current article (cur), you
> > could completely skip using MySQL and use the filesystem instead.
>
> In other words, caching.

Not necessarily; it would also be possible to keep the wiki text in
files. But I'm not sure what great benefit this would have, as you still
have to go looking up various information to render it.

> Yes, various versions of that have been
> tried and proposed, and more will be. The major hassles are (1) links,
> which are displayed differently when they point to existing pages, so
> a page may appear differently from one view to the next depending on
> the existence of other pages,

That's not a problem; one simply invalidates the caches of all linking
pages when creating/deleting.

This is already done in order to handle browser-side caching; each
page's cur_touched timestamp is updated whenever a linked page is
created or deleted. Simply regenerate the page if cur_touched is more
recent than the cached HTML.

> > 3. You could start sending out text ASAP, instead of batching it.
> > Many browsers start displaying text as it's available, so to
> > users it might _feel_ faster.

A few things (like language links) currently require parsing the entire
wikitext before we output the topbar. Hypothetically we could output the
topbar after the text and let CSS take care of its location as we do for
the sidebar, but this may be problematic (ie in case of varying vertical
size due to word wrap) and would leave users navigationally stranded
while loading.

> > Also, holding text in-memory
> > may create memory pressure that forces more useful stuff out of
> > memory.
>
> Not an issue. HTML is sent out immediately after it's rendered.

Well... many passes of processing are done over the wikitext on its way
to HTML, then the whole bunch is dumped out in a chunk.

> Things like database updates are deferred until after sending;

I'm not 100% sure how safe this is; if the user closes the connection
from their browser deliberately (after all, the page _seems_ to be done
loading, why is the icon still spinning?) or due to an automatic
timeout, does the script keep running through the end or is it halted in
between queries?

> One things that would be nice is if the HTTP connection could be
> dropped immediately after sending and before those database updates.
> That's easy to do with threads in Java Servlets, but I haven't
> found any way to do it with Apache/PHP.

For some things (search index updates) we use INSERT/REPLACE DELAYED
queries, whose actual action will happen at some point in the future,
taken care of for us by the database. There doesn't seem to be an
equivalent for UPDATE queries.

Hypothetically we could have an entirely separate process to perform
asynchronous updates and just shove commands at it via a pipe or shared
memory, but that's probably more trouble than it's worth.

-- brion vibber (brion @ pobox.com)
Re: Chat about Wikipedia performance? [ In reply to ]
>From: Lee Daniel Crocker <lee@piclab.com>
>One things that would be nice is if the HTTP connection could be
>dropped immediately after sending and before those database updates.
>That's easy to do with threads in Java Servlets, but I haven't
>found any way to do it with Apache/PHP.

:P No, I looked into exactly this problem in connection with my own little
project (improved Special:Movepage). PHP and threads don't mix. As far as I
could see, the PHP subprocess has to exit (taking all threads with it)
before Apache will drop the connection. Like Brion said, you'd have to set
up another process, and use PHP's poorly documented IPC functions. As for
what improvement it would achieve: it wouldn't reduce database load per
view, it would just allow users to hit more pages sooner.

I think caching HTML is the way to go, in the short term. If people don't
want to code something complicated, you could ignore user preferences for
now and only cache pages for "anonymous" users. The cached version could
leave little notes in the HTML like

<strong>Isaac Newton</strong> was a <<WIKILINK[[physics|physicist]]>> born
in...

and maybe

<<USERIP>> (<a href ="http://www.wikipedia.org/wiki/User_talk:<<USERIP>>"
class='internal' title="User talk:<<USERIP>>">Talk</a>

Then a cache processing script would look up the link table and replace the
links with real HTML. I imagine looking up the link table is much, much
faster than looking up cur_text. Plus the cached text would be stored on the
web server, thereby distributing disk load more evenly.

As for invalidation, the easiest, and possibly ugliest way I can think of is
implementing it in wfQuery() *cringe*. That's a very simple function with
very diverse uses, but every single update query passes through that point.
Just use a hash table (always in RAM) to store the article name of every
cache entry, and remove the rows when they're invalidated.

There'd also have to be a check for an altered user talk page. This could be
handled with another of my <<TAGS>>.

This idea is likely to be met with apathy. I'd like to code it myself, but I
don't have Linux on my PC, or a broadband connection, or much free hard
drive space, or... time. So there you have it: my two cents, backed up by
hot air.

-- Tim Starling.


_________________________________________________________________
Hotmail now available on Australian mobile phones. Go to
http://ninemsn.com.au/mobilecentral/hotmail_mobile.asp
Re: Chat about Wikipedia performance? [ In reply to ]
> (Tim Starling <ts4294967296@hotmail.com>):
> [Notes on caching]

When I recompiled Apache to add mod_mmap_static, I also updated
PHP to 4.3, and compiled the latter with the shared memory functions.
After we get things stable again, the first optimization I plan to
try is to have a shared-memory cache of existing page titles, so we
don't have to go to the database dozens of times per page render.
I suspect that will gain us more than caching whole pages.

But for now, we're in rescue mode.

--
Lee Daniel Crocker <lee@piclab.com> <http://www.piclab.com/lee/>
"All inventions or works of authorship original to me, herein and past,
are placed irrevocably in the public domain, and may be used or modified
for any purpose, without permission, attribution, or notification."--LDC
Re: Chat about Wikipedia performance? [ In reply to ]
Lee Daniel Crocker <lee@piclab.com> said:
> (David A. Wheeler <dwheeler@dwheeler.com>) said:
> >
> > 1. Perhaps for simple reads of the current article (cur), you
> > could completely skip using MySQL and use the filesystem instead.
>
> In other words, caching.

Sorry, I wasn't clear.
I wasn't thinking of caching - I was thinking of accessing the
filesystem INSTEAD of MySQL when getting the current wikitext.

Why? Well, I suspect that accessing the filesystem directly
is much faster than accessing the data via MySQL - if most
accesses are simple reads, then you can access it without
user-level locks, etc., etc. Even more importantly,
checking for existence is a simple filesystem check - which
is likely to be much faster than the MySQL request.

Would it be faster? I don't know; the only _real_ way to find out
is to benchmark it.

Of course, if wikipedia is near the breaking point for performance,
another approach would be to change the design so that reading
only requires one lookup (for the data itself).
You noted the two big problems, and I agree that they're the
sticking points.
You could abandon many user settings, except ones that the user
can supply themselves to select between different stylesheets, and
abandon displaying links differently depending on whether or not
they're there. Less desirable, but you've already abandoned supporting
search! Then you can cache the generated HTML as well.

If it's a choice between having a working wikipedia, and
having the bells & whistles, I think working is the better plan.
You can always include them as settable options, to be returned once
the system doesn't have performance problems.

Although databases are more flexible for storing structured
data, for simple unstructured data, a simple filesystem-based
approach might be more suitable. This also lets you use other
existing tools (like the many tools that let you store
indexes for later rapid searching based on files).

A quick start might be to temporarily disable all checking
of links, and see if that helps much.

> > [Rendering] could also be sped up, e.g., by rewriting it in flex.
> > My "html2wikipedia" is written in flex - it's really fast and
> didn't
> > take long to write. The real problem is, I suspect that
> > isn't the bottleneck.
>
> It isn't. And there's no reason to expect flex to be any faster
> than any other language.

Actually, for some lexing applications flex can be MUCH faster.
That's because it can pre-compile a large set of patterns
into C, and compile the result. Its "-C" option can, for
some applications, result in blazingly fast operations.
You CAN do the same thing by hand, but it takes a long time to
hand-optimize that kind of code.

However, there's no point in rewriting what is not the bottleneck.
Which I why I was hoping to hear if someone has done measurements
to identify the real bottlenecks, e.g., "50% of the system
time is spent doing X". If most time is spent rendering
articles for display (without editing), then it's worth examining
what's taking the time. If the time is spent on checking if
links exist, then clearly that's worth examining.

Oh, one note - if you want to simply store whether or not a
given article entry exists or not, and quickly check it, one
fancy way of doing this is by using a Bloom filter.
You can hash the article title, and then using a fancy data
structure can store its existance or non-existance.
More info, and MIT-licensed code, for a completely different
application are at:
http://www.ir.bbn.com/projects/SPIE
(there, they hash packets so that later queries can ask
"did you see this packet"?). Given the relatively small
size of article text, it's not clear you need this
(you can store all the titles in memory), but I just thought
I'd mention it.

Anyway, thanks for listening. My hope is that the Wikipedia
doesn't become a victim of its own success :-).
Re: Re: Chat about Wikipedia performance? [ In reply to ]
David A. Wheeler wrote:
> If it's a choice between having a working wikipedia, and
> having the bells & whistles, I think working is the better plan.

I agree completely.

We've been advised by... I'm sorry, but I forgot who it was, but he's
the author of a well-known book on this sort of thing... that
separating webserving and database should be a huge win. If that's
right, then we should be good to go after the new server is installed
this weekend, and after some time spent getting it into service.

In general, I think that it is absolutely true that responsiveness is
more important than frills. I have never thought of the feature of
links appearing differently depending on whether or not the article
exists as a frill, but I suppose it is. We could conceivably abandon
that and any other feature that requires "on the fly" anything, and
make the site very fast.

But it's probably better, for some features, to throw hardware at it.

--Jimbo
RE: Re: Chat about Wikipedia performance? [ In reply to ]
> A quick start might be to temporarily disable all checking
> of links, and see if that helps much.

This seems to be a helpful suggestion. Without profiling, it's hard to
tell where the bottleneck is, but I think link checking is a good guess.


And this fairly simple (now that Lee has created a functioning test
suite) could probably tell us if this is a bottleneck. If so, then at
least we know where to focus our optimization efforts.

If this is the problem, we are in luck because there have been a lot of
good improvement suggestions. But they all add complexity to the code
(or database setup) and "premature optimization is the root of all kinds
of evil," so if link checking isn't a bottleneck it would be
counterproductive to spend a lot of time to try to optimize it.

--Mark
Re: Re: Chat about Wikipedia performance? [ In reply to ]
> (David A. Wheeler <david_a_wheeler@yahoo.com>):
>
>>> 1. Perhaps for simple reads of the current article (cur), you
>>> could completely skip using MySQL and use the filesystem instead.
>>
>> In other words, caching.
>
> Sorry, I wasn't clear.
> I wasn't thinking of caching - I was thinking of accessing the
> filesystem INSTEAD of MySQL when getting the current wikitext.

No, you were clear. I am using "caching" in the plain English
sense of the word. Using the file system as a cache in front of
the database is just one possible implementation of the idea.

>> It isn't. And there's no reason to expect flex to be any
>> faster than any other language.

> Actually, for some lexing applications flex can be MUCH faster.
> That's because it can pre-compile a large set of patterns
> into C, and compile the result. Its "-C" option can, for
> some applications, result in blazingly fast operations.

I suppose that's true. I do want to formalize the wikitext
grammar at some point, and using something like Lex/Yacc
code compiled and linked into PHP as a module is certainly
a possibility.

> Oh, one note - if you want to simply store whether or not a
> given article entry exists or not, and quickly check it, one
> fancy way of doing this is by using a Bloom filter.
> You can hash the article title, and then using a fancy data
> structure can store its existance or non-existance.

Yes, that's a very good idea. I just recompiled the PHP on
the server to have the shared memory extensions, so putting
a Bloom filter into that memory is probably better than a
more typical hash table.

--
Lee Daniel Crocker <lee@piclab.com> <http://www.piclab.com/lee/>
"All inventions or works of authorship original to me, herein and past,
are placed irrevocably in the public domain, and may be used or modified
for any purpose, without permission, attribution, or notification."--LDC

1 2 3 4  View All