Mailing List Archive

Thinking about Phase IV
After getting back into wikiland catching up with wikipedia-l
was pretty easy, but catching up with the wikitech list took a
little longer. It seems you guys have had interesting times
lately (in the Chinese curse sense). Sorry I abandoned you,
but you guys do seem to have risen to the challenge.

Magnus did a great service by giving us code with features
that made Wikipedia usable and popular. When that code bogged
down to the point where the wiki became nearly unusable, there
wasn't much time to sit down and properly architect and develop
a solution, so I just reorganized the existing architecture for
better performance and hacked all the code. This got us over
the immediate crisis, but now my code is bogging down, and we
are having to remove useful features to keep performance up.

I think it's time for Phase IV. We need to sit down and design
an architecture that will allow us to grow without constantly
putting out fires, and that can become a stable base for a fast,
reliable Wikipedia in years to come. I'm now available and
equipped to help in this, but I thought I'd start out by asking
a few questions here and making a few suggestions.

* Question 1: How much time do we have?

Can we estimate how long we'll be able to limp along with
the current code, adding performance hacks and hardware to
keep us going? If it's a year, that will give us certain
opportunities and guide some choices; if it's only a month
or two, that will constrain a lot of those choices.

* Suggestion 1: The test suite.

I think the most critical piece of code to develop right now
is a comprehensive test suite. This will enable lots of
things. For example, if we have a performance question, I
can set up one set of wiki code on my test server, run the
suite to get timing data, tweak the code, then run the suite
again to get new timing. The success of the suite will tell
us if anything broke, and timing will tell us if we're on
the right track. This will be useful even during the
limp-along with current code phase. I have a three-machine
network at home, with one machine I plan to dedicate 100% to
wiki code testing, and my test server in San Antonio that we
can use. This will also allow us to safely refactor code.
I'd like to use something like Latka for the suite (see
http://jakarta.apache.org/commons/latka/index.html ).

* Question 2: How wedded are we to the current tools?

Apache/MySQL/PHP seems a good combo, and it probably would
be possible to scale them up further, but there certainly
are other options. Also, are we willing to take chances on
semi-production quality versions like Apache 2.X and MySQL 4.X?
I'd even like to revisit the decision of using a database
at all. After all, a good file system like ReiserFS (or to
a lesser extent, ext3) is itself a pretty well-optimized
database for storing pieces of free-form text, and there are
good tools available for text indexing, etc. Plus it's
easier to maintain and port.

* Suggestion 2: Use the current code for testing features.

In re-architecting the codebase, we will almost certainly
come to points where we think a minor feature change will
make a big performance difference that won't hurt usability,
or just features that we want to implement anyway. For
example, we could probably make it easier to cache page
requests if we made most of the article content HTML not
dependent on skin by tagging elements well and using CSS
appropriately. Also, we probably want to eventually render
valid XHTML. I propose that while we are building the
phase IV code, we add little features like this to the
existing code to guage things like user reactions and
visual impact.

Other suggestions/questions/answers humbly requested
(including "Are you nuts? Let's stick with Phase III!" if
you have that opinion).

--
Lee Daniel Crocker <lee@piclab.com> <http://www.piclab.com/lee/>
"All inventions or works of authorship original to me, herein and past,
are placed irrevocably in the public domain, and may be used or modified
for any purpose, without permission, attribution, or notification."--LDC
Re: Thinking about Phase IV [ In reply to ]
Lee,

I don't think we should completely redesign things from scratch. See
http://www.joelonsoftware.com/articles/fog0000000069.html
about rewriting in general.

We are getting to the point where we know what the performance bottlenecks
are, and we are fixing them. Brion has built some basic profiling into the
code, and we've checked the slow query log. We still haven't fully
understood everything, but I think we are pretty certain about the
following:

1) PHP is not and has never been a problem. Virtually all our performance
problems have been related to specific SQL queries (either a very high
number of them, or complex ones). I do not see any reason at all to stop
using PHP.

2) Getting MySQL to perform properly largly depends on using indexes the
right way. This means providing composite indexes where needed. In the
case of timestamps, we had to add a reverse timestamp column for it to be
index-sorted fast, but while this is a hack, it is a needed hack until
MySQL4.

It is specifically complex SQL queries which require ordering the whole
result set that create headaches. These have been disabled for the moment,
but I believe we can fix them. None of them are mission critical.

As for MySQL4, I support trying it out on our test server,
test.wikipedia.org, and possibly on meta.wikipedia.org as well. We
shouldn't switch the main site(s) until these two have run on it for a
while.

We also need to keep in mind that we are growing very fast. We now have
several highly active Wikipedias, all of them residing on the same server.
While I think our server still has some room, at some point we will have
to upgrade and no amount of hacking will prevent that. Separating web and
database server, as is planned, should help, but I don't know how much.

I think our priorities should be this:

1) Get some of the other language Wikipedias up that people are waiting
for. If there are motivated users who want to start a Wikipedia in their
language, we should not let them wait.

2) Fix known bugs and try to improve the speed in case of remaining
bottlenecks.

3) Implement suggested improvements.
- Improve search + redirect handling
- Finish Magnus' interlanguage links redesign
- Fix Recent changes layout
- Redesign image pages
- Redesign talk pages
- Improved edit conflict handling (CVS style merge)
- Backends (SVG, Lilypond), syntax improvements etc.

Aside from this, an entirely new project is the dedicated Wikipedia
client, for offline reading and, hopefully, ultimately for editing as
well. Magnus has started working on this.

What I understand to be "Phase IV" is, then, a point where we have
finished all the important fixes and improvements and then decide to move
on to "nice to have" stuff. Among this is the much requested multilanguage
portal for Wikipedia, with a multilanguage search, RC etc., and possibly
merging the databases of the different Wikipedias (at least the user
data). I do think the current software can and should be used as a basis
for the next phase(s).

Regards,

Erik
Re: Thinking about Phase IV [ In reply to ]
On Fri, 21 Feb 2003, Lee Daniel Crocker wrote:
> Can we estimate how long we'll be able to limp along with
> the current code, adding performance hacks and hardware to
> keep us going? If it's a year, that will give us certain
> opportunities and guide some choices; if it's only a month
> or two, that will constrain a lot of those choices.

The immediate crisis is over. Now that we're on the track of proper
indexing, performance should no longer significantly degrade with
increased size.

The special pages that are currently disabled just need to be rewritten to
have and use appropriate indexes or summary tables. Performance hacks?
Sure.

We're planning to move the database and web server to two separate
machines, which should help quite a bit as well, and there's still a lot
of optimization to be done in the common path. (Caching HTML would save
trips to the database as well as rendering time, though it's not the
biggest priority yet.)

I'd feel quite confident giving us another year with the current codebase.

> * Suggestion 1: The test suite.

AMEN BROTHER!

> I'd even like to revisit the decision of using a database
> at all. After all, a good file system like ReiserFS (or to
> a lesser extent, ext3) is itself a pretty well-optimized
> database for storing pieces of free-form text, and there are
> good tools available for text indexing, etc. Plus it's
> easier to maintain and port.

Really though, our text _isn't_ free-form. It's tagged with metadata that
either needs to be tucked into a filesystem (non-portably) or a structured
file format (XML?). And now we have to worry about locking multiple files
for consistency, which likely means separate lockfiles... and we quickly
find we've reinvented the database, just using more file descriptors. ;)

The great advantage of the database though is the ability to perform
ad-hoc queries. Obviously our regular operations have to be optimized, and
special queries have to be set up such that they don't bog down the
general functioning of the wiki, but in general the coolest thing about
the phase II/III PediaWiki is the SQL query ability: savvy (and
responsible) users can cook up their own queries to do useful little
things such as:

* looking up new user accounts who haven't yet been greeted
* checking for "orphan" talk pages
* most frequent contributors

etc, without downloading a 4-gigabyte database to their home machines or
begging the developers to write a special-purpose script.

Now, it may well be that it would make sense to store the rendered HTML in
files which could be rapidly spit out on request, but that's supplementary
to what the database does for us.

> For
> example, we could probably make it easier to cache page
> requests if we made most of the article content HTML not
> dependent on skin by tagging elements well and using CSS
> appropriately.

You mean, like we had in phase II before you rewrote it? ;)

-- brion vibber (brion @ pobox.com)
Re: Thinking about Phase IV [ In reply to ]
> (Erik Moeller <erik_moeller@gmx.de>):
> Lee,
>
> I don't think we should completely redesign things from scratch. See
> http://www.joelonsoftware.com/articles/fog0000000069.html
> about rewriting in general.

I'm well aware of refactoring; that's the main reason I want the
test suite first. But this is a case a little different from what
Joel is describing: we're starting with a complete feature set and
(at least initially) not making any changes at all to that. That
in a sense makes it a refactor job even if we do replace the actual
code. If we decide that Apache/PHP/MySQL is the tool of choice,
we will of course not throw away code at all but just refactor all
the way.

> 1) PHP is not and has never been a problem. Virtually all our
> performance problems have been related to specific SQL queries
> (either a very high number of them, or complex ones). I do not
> see any reason at all to stop using PHP.

I can believe that.

> 2) Getting MySQL to perform properly largly depends on using
> indexes the right way. This means providing composite indexes
> where needed. In the case of timestamps, we had to add a reverse
> timestamp column for it to be index-sorted fast, but while this
> is a hack, it is a needed hack until MySQL4...

I'm still concerned, though, that even if we optimize all the
indexing, we'll never achievea speedup of more than 2-3x. I don't
know if that will be enough in the long run. After the test suite
is done, I can do some head-to-head testing of things like indexes
and MySQL 4.X.

> I think our priorities should be this:
> 1) Get some of the other language Wikipedias up that people are
> waiting for. If there are motivated users who want to start a
> Wikipedia in their language, we should not let them wait.

I agree that this should be a priority for the project. I'm not
as convinced that it's the best use of /my/ time, and since my
absence I'm more committed to ensuring that I don't burn out again
by spending my own time on things I'm not best suited to. I think
it should be up to motivated foreign users to migrate their own
wiki. The presence of someone skilled enough to do that should
be evidence of the level of desire.

> 2) Fix known bugs and try to improve the speed in case of
> remaining bottlenecks.

Agreed. This can also be done in parallel with new development
if needed, and if the speedups are dramatic enough, perhaps it
will show that new development isn't needed after all.

> 3) Implement suggested improvements.
> - Improve search + redirect handling
> - Finish Magnus' interlanguage links redesign
> - Fix Recent changes layout

That's one of my concerns too: I spent about three weeks trying
to do this, but it just wasn't possible to get the features I
wanted with the current architecture and acceptable performance.

> - Redesign image pages
> - Redesign talk pages
> - Improved edit conflict handling (CVS style merge)

Hmm. I'm not sure about that one.

> - Backends (SVG, Lilypond), syntax improvements etc.
>
> Aside from this, an entirely new project is the dedicated
> Wikipedia client, for offline reading and, hopefully,
> ultimately for editing as well. Magnus has started working
> on this.

That's great. That's probably better suited to his talents.

> What I understand to be "Phase IV" is, then, a point where
> we have finished all the important fixes and improvements and
> then decide to move on to "nice to have" stuff. Among this is
> the much requested multilanguage portal for Wikipedia, with a
> multilanguage search, RC etc., and possibly merging the
> databases of the different Wikipedias (at least the user data).
> I do think the current software can and should be used as a basis
> for the next phase(s).

Cross-language stuff is a big issue too. I confess that I
ignored that issue 100% in the present design. If we can add
those features without looking like a hack, I'm all for it, but
I suspect a new architecture will help there more than anywhere.

--
Lee Daniel Crocker <lee@piclab.com> <http://www.piclab.com/lee/>
"All inventions or works of authorship original to me, herein and past,
are placed irrevocably in the public domain, and may be used or modified
for any purpose, without permission, attribution, or notification."--LDC
Re: Thinking about Phase IV [ In reply to ]
> The great advantage of the database though is the ability
> to perform ad-hoc queries.

Yep, that's a big one, no doubt about it. That's why I'm mostly
just brainstorming at this point. I _suspect_ that the database
costs us a lot in things like unnecessary locking and indexing,
but that may well be offset by the gains.

>> For example, we could probably make it easier to cache page
>> requests if we made most of the article content HTML not
>> dependent on skin by tagging elements well and using CSS
>> appropriately.
>
> You mean, like we had in phase II before you rewrote it? ;)

Now, waitaminnit, that's not true at all. Magnus had all kinds
of dynamic nonsense that I removed--his code changed the actual
HTML of links depending on the user's preference for link color,
for example. I removed a lot of those, but I don't know if I
caught all of them. I know for sure that the sidebars are fully
dynamic and likely uncacheable, but the article content should
be mostly cacheable now or close to it.

--
Lee Daniel Crocker <lee@piclab.com> <http://www.piclab.com/lee/>
"All inventions or works of authorship original to me, herein and past,
are placed irrevocably in the public domain, and may be used or modified
for any purpose, without permission, attribution, or notification."--LDC
Re: Thinking about Phase IV [ In reply to ]
On Fri, 21 Feb 2003, Lee Daniel Crocker wrote:
> > You mean, like we had in phase II before you rewrote it? ;)
>
> Now, waitaminnit, that's not true at all. Magnus had all kinds
> of dynamic nonsense that I removed--his code changed the actual
> HTML of links depending on the user's preference for link color,
> for example.

You must have been working from an older version of the codebase, because
I know for a fact that I replaced those with stylesheets, which is how we
had caching of rendered HTML in phase II.

-- brion vibber (brion @ pobox.com)
Re: Thinking about Phase IV [ In reply to ]
> (Brion Vibber <vibber@aludra.usc.edu>):
> On Fri, 21 Feb 2003, Lee Daniel Crocker wrote:
> > > You mean, like we had in phase II before you rewrote it? ;)
> >
> > Now, waitaminnit, that's not true at all. Magnus had all kinds
> > of dynamic nonsense that I removed--his code changed the actual
> > HTML of links depending on the user's preference for link color,
> > for example.
>
> You must have been working from an older version of the codebase, because
> I know for a fact that I replaced those with stylesheets, which is how we
> had caching of rendered HTML in phase II.

That's possible, I suppose, but I'm sure I didn't _add_ any
dynamic HTML except outside the article content. If I did, then
mea maxima culpa.

Yes, it is necessary to eliminate dynamically rendered article
content to make caching effective. I don't think that's where
the biggest bang-for-the-buck will be, though. The things I
personally think would be the biggest wins are (1) Fine-tuning
the hell out the queries needed to do "RecentChanges, and
(2) making the link cache persistent across queries so we don't
have to look up every page that's linked to when rendering.

--
Lee Daniel Crocker <lee@piclab.com> <http://www.piclab.com/lee/>
"All inventions or works of authorship original to me, herein and past,
are placed irrevocably in the public domain, and may be used or modified
for any purpose, without permission, attribution, or notification."--LDC
Re: Thinking about Phase IV [ In reply to ]
> I think our priorities should be this:
> 3) Implement suggested improvements.
> - Improve search + redirect handling
> - Finish Magnus' interlanguage links
> redesign
> - Fix Recent changes layout
> - Redesign image pages
> - Redesign talk pages
> - Improved edit conflict handling (CVS style
> merge)
> - Backends (SVG, Lilypond), syntax
> improvements etc.

Improvements in management of the deleted files. Right
now, it's *very* difficult to find a file which has
been erased, then to check it to estimate whether it
should be restored or not.
It would be great if this page could at least be
sorted out by date of deletion, author of deletion,
name space and alphabetical order. Does anybody else
have trouble with this log ?

__________________________________________________
Do you Yahoo!?
Yahoo! Tax Center - forms, calculators, tips, more
http://taxes.yahoo.com/
Re: deletion management [ In reply to ]
> Improvements in management of the deleted files.
> Right
> now, it's *very* difficult to find a file which has
> been erased, then to check it to estimate whether it
> should be restored or not.
> It would be great if this page could at least be
> sorted out by date of deletion, author of deletion,
> name space and alphabetical order. Does anybody else
> have trouble with this log ?

Okay, the noise of comment hurt my ears, so, I'll do
it another time

Here's our deletion log
http://fr.wikipedia.org/wiki/Wikip%E9dia:Deletion_log

About maybe 100 pages were deleted in the past four
days.

I think (though we of course all trust one another),
an *error* is always possible.

Especially when the page "votes for deletion" is not
used at all, so any sysop basically delete those pages
he wants to delete, whenever he decides to (yup, me
too, in rome, do as the romans).

We assume *peer pressure* when creating/editing
articles. Peers being given the same tool than the
editors.

Right now, we have *very* little tool to do *checking*
of our fellows sysop in terms of deletion (you know
*sample checking*...)

It takes about 15 seconds to delete an article, about
5 mn to find it (when we find it) in the page for
deleted articles.

Look at that huge deletion log !!!!!!!

Ok, if nobody wants to help me on that, could someone
just tell me...a sort of sql query...which would allow
me to get a list of the last deletions, with direct
access to the article in the bin (something like the
deletion log but with access to the deleted articles)


--------

Or imagine the worse, a sysop gets mad and start
deleting articles very unwisely. Gonna be a mess to
recover everything quickly....




__________________________________________________
Do you Yahoo!?
Yahoo! Tax Center - forms, calculators, tips, more
http://taxes.yahoo.com/
Re: deletion management [ In reply to ]
On Mon, 24 Feb 2003, Anthere wrote:
> Ok, if nobody wants to help me on that, could someone
> just tell me...a sort of sql query...which would allow
> me to get a list of the last deletions, with direct
> access to the article in the bin (something like the
> deletion log but with access to the deleted articles)

The archive table (which holds deleted pages) does not keep track of when
articles were deleted. Currently the only way to see when a page was
deleted was to look in the deletion log, which currently provides no
connection to the undelete system.

Yeah, it sucks.

What I've suggested before, and if I have time I'll try to implement it,
but it'd be lovely if someone else got to it because I'm stretched a
little thin right now, is to have a deletion log _table_. (Or perhaps a
general 'event log' which also keeps track of bans, protections, unbans,
unprotections, creation of user accounts, sysopization, etc; of which we
can extract just deletions for purposes of showing a deletion log).

This could then be easily sorted by timestamp or by deleting user, and for
sysops a link to the undelete could be made instantly available.

-- brion vibber (brion @ pobox.com)
Re: Thinking about Phase IV [ In reply to ]
On Fri, Feb 21, 2003 at 05:31:51PM -0600, Lee Daniel Crocker wrote:
> Other suggestions/questions/answers humbly requested
> (including "Are you nuts? Let's stick with Phase III!" if
> you have that opinion).

My suggestion:
* first, add all most important features we need (SVG,
more rending engines, XHTML, PS/PDF output, what else ?)
and move everything to UTF-8 and some nice skin like
Cologne Blue without link underlining with minimal
amount of changes
* then, make Phase IV

If we do step 2 before step 1 we will have to redesign
it into Phase V soon.