Mailing List Archive

Re: [Wikipedia-l] proposal to speed up Wikipedia
(moving to the wikitech-l list; see sign-up and archive page at
http://www.wikipedia.org/mailman/listinfo/wikitech-l )

Jonathan Walther wrote:
> I've done some work at converting the Wikipedia to Postgres, but am not
> there yet. So, let's put that aside for now.

Great! I did get postgresql installed on my machine, but got bogged down
in details of converting the table definitions and various interface
behaviors. Someone with prior experience working with postgres would be
a big help there.

> It seems that the wiki "source" is "interpreted" into html every single
> time someone accesses a link. That seems like a lot of overhead.
> Given that for every time a change is made to the wiki source to a page,
> several people "view" it, why not just regenerate the html only when
> changes are made, and store it? It would take more storage space, but
> should be MUCH faster. And if storage is an issue, I can donate some
> hard drives...

We used to cache in the phase II days on the old server. This was
removed for two reasons:
1) Wiki->HTML rendering is still pretty darn fast, particularly with our
new dedicated server; database contention seems to be our main problem
during high-load periods.
2) We had problems keeping the cache consistent with the old code.

On number 2, I would certainly welcome an improved cache subsystem
that's designed right from the ground up. The old one was hacked in as a
"crap! the system's unusably slow, let's hack in some improved code"

On number 1, note that LinkCache::addLink() does a brief query on the
cur table for every link when rendering a wikipage. These could probably
be consolidated somehow or other. (Note that this does not apply to
Recentchanges, which loads everything in a big chunk.)

> The savings on the Recent Changes page alone should work wonders.

On the English wikipedia, Recentchanges is loaded at default options
about 3000 times per day; the number of edits per day is a similar
figure, and every edit means the page has to change to reflect it.
Caching the rendered display wouldn't seem to save significantly over
rerendering it on each view.

-- brion vibber (brion @ pobox.com)