Mailing List Archive

Performance
Many thanks to Neil Harris for the bots; They've been pounding on the
site for while now, and here's what we've learned:

- The size of the database isn't an issue. If Wikipedia doubles, or
more, performance won't be affected at all. This is what I would have
expected.

- Concurrent access does slow things down, but not pathologically.
I've got 16 bots running over two high-speed connectiions right now,
and the server isn't swapping. However, some pages do take longer to
serve than when load is light.

- Some special pages are still moderately slow (particularly "wanted"
and "random page"), but the real time hogs now are very long pages
with lots of links. Some particular hogs are "Current events",
"Chinese sovereign", "List of rare diseases" and long history pages
with lots of changes like main page and bug reports. The sample
scrabble game is the longest page, but it has very few links so it's
not as much of a hog, though it's still a problem.

"Current events" strikes me as a particularly big, yet solvable,
problem. We should come up with a way of breaking it into manageable
pieces. Chinese sovereign can clearly be broken up as well, and we
can do things like replace the Scrabble diagrams with images.

0
Re: Performance [ In reply to ]
> Many thanks to Neil Harris for the bots;

Do we have an emergency tool to redo any edits from a certain IP on a
certain time? Just in case the bot falls into the wrong hands, or a
script kiddy makes one himself.

I know bots can also be useful. Ben-Zin has made one to automatically
generate the year pages. It could also be helpful in other cases, so it
would be a good idea to share it with others, but I am afraid someone
might misuse it.

Kurt
Re: performance [ In reply to ]
> Some special pages are still moderately slow (particularly "wanted"
> and "random page"), but the real time hogs now are very long pages
> with lots of links.

I looked at the random page code, and right now it fetches a complete
list of all article IDs, in order to pick one out randomly. There must
be a better way to do this. This should be an O(log n) operation, not O(n).

The "wanted" special page will get a lot faster if we implement
Jan's idea of a table recording the number of broken links to every
unwritten article.

If long pages with lots of links cause trouble, maybe we should revive
the caching idea of the current code: the cur table gets another
column cur_cache where we store the rendered HTML. When displaying an
article, we simply pump out the contents of cur_cache, or, if cur_cache
is empty, we render, display and store in cur_cache. If a newly saved
article necessitates the updating of links, we junk the cur_cache of
all affected articles.

Axel
Re: Performance [ In reply to ]
On 7/13/02 4:19 AM, "lcrocker@nupedia.com" <lcrocker@nupedia.com> wrote:
> "Current events" strikes me as a particularly big, yet solvable,
> problem. We should come up with a way of breaking it into manageable
> pieces. Chinese sovereign can clearly be broken up as well, and we
> can do things like replace the Scrabble diagrams with images.
>
We really shouldn't replace the diagrams with images. That would be a major
step backward. It's bearable that the page is slow--it should be a test case
for improvement.

However, I agree that super-high-traffic pages like "Current events" should
handled expediently.
Re: Re: performance [ In reply to ]
On Sun, Jul 14, 2002 at 03:50:20AM +0200, Axel Boldt wrote:
> > Some special pages are still moderately slow (particularly "wanted"
> > and "random page"), but the real time hogs now are very long pages
> > with lots of links.
>
> I looked at the random page code, and right now it fetches a complete
> list of all article IDs, in order to pick one out randomly. There must
> be a better way to do this. This should be an O(log n) operation, not O(n).

It can be. If you do a COUNT(*) without any conditions it looks it up in the
index, so that's very fast. Then you can guess a number n under this
upperbound and ask for the n'th record with LIMIT. MySQL will use the index
for that so it should be O(log n). If the record doesn't satisfy the
conditions (not the right namespace) then you simply guess again, but of
course it usually will be Ok the first time.

> The "wanted" special page will get a lot faster if we implement
> Jan's idea of a table recording the number of broken links to every
> unwritten article.

Well, I'm back from my short trip, so if Lee tells me how I can get
started I will. Actually, after that, if everybody agrees, I may begin to work
on the TeX extension.

> If long pages with lots of links cause trouble, maybe we should revive
> the caching idea of the current code: the cur table gets another
> column cur_cache where we store the rendered HTML. When displaying an
> article, we simply pump out the contents of cur_cache, or, if cur_cache
> is empty, we render, display and store in cur_cache. If a newly saved
> article necessitates the updating of links, we junk the cur_cache of
> all affected articles.

A good idea.

-- Jan Hidders