After getting back into wikiland catching up with wikipedia-l
was pretty easy, but catching up with the wikitech list took a
little longer. It seems you guys have had interesting times
lately (in the Chinese curse sense). Sorry I abandoned you,
but you guys do seem to have risen to the challenge.
Magnus did a great service by giving us code with features
that made Wikipedia usable and popular. When that code bogged
down to the point where the wiki became nearly unusable, there
wasn't much time to sit down and properly architect and develop
a solution, so I just reorganized the existing architecture for
better performance and hacked all the code. This got us over
the immediate crisis, but now my code is bogging down, and we
are having to remove useful features to keep performance up.
I think it's time for Phase IV. We need to sit down and design
an architecture that will allow us to grow without constantly
putting out fires, and that can become a stable base for a fast,
reliable Wikipedia in years to come. I'm now available and
equipped to help in this, but I thought I'd start out by asking
a few questions here and making a few suggestions.
* Question 1: How much time do we have?
Can we estimate how long we'll be able to limp along with
the current code, adding performance hacks and hardware to
keep us going? If it's a year, that will give us certain
opportunities and guide some choices; if it's only a month
or two, that will constrain a lot of those choices.
* Suggestion 1: The test suite.
I think the most critical piece of code to develop right now
is a comprehensive test suite. This will enable lots of
things. For example, if we have a performance question, I
can set up one set of wiki code on my test server, run the
suite to get timing data, tweak the code, then run the suite
again to get new timing. The success of the suite will tell
us if anything broke, and timing will tell us if we're on
the right track. This will be useful even during the
limp-along with current code phase. I have a three-machine
network at home, with one machine I plan to dedicate 100% to
wiki code testing, and my test server in San Antonio that we
can use. This will also allow us to safely refactor code.
I'd like to use something like Latka for the suite (see
http://jakarta.apache.org/commons/latka/index.html ).
* Question 2: How wedded are we to the current tools?
Apache/MySQL/PHP seems a good combo, and it probably would
be possible to scale them up further, but there certainly
are other options. Also, are we willing to take chances on
semi-production quality versions like Apache 2.X and MySQL 4.X?
I'd even like to revisit the decision of using a database
at all. After all, a good file system like ReiserFS (or to
a lesser extent, ext3) is itself a pretty well-optimized
database for storing pieces of free-form text, and there are
good tools available for text indexing, etc. Plus it's
easier to maintain and port.
* Suggestion 2: Use the current code for testing features.
In re-architecting the codebase, we will almost certainly
come to points where we think a minor feature change will
make a big performance difference that won't hurt usability,
or just features that we want to implement anyway. For
example, we could probably make it easier to cache page
requests if we made most of the article content HTML not
dependent on skin by tagging elements well and using CSS
appropriately. Also, we probably want to eventually render
valid XHTML. I propose that while we are building the
phase IV code, we add little features like this to the
existing code to guage things like user reactions and
visual impact.
Other suggestions/questions/answers humbly requested
(including "Are you nuts? Let's stick with Phase III!" if
you have that opinion).
--
Lee Daniel Crocker <lee@piclab.com> <http://www.piclab.com/lee/>
"All inventions or works of authorship original to me, herein and past,
are placed irrevocably in the public domain, and may be used or modified
for any purpose, without permission, attribution, or notification."--LDC
was pretty easy, but catching up with the wikitech list took a
little longer. It seems you guys have had interesting times
lately (in the Chinese curse sense). Sorry I abandoned you,
but you guys do seem to have risen to the challenge.
Magnus did a great service by giving us code with features
that made Wikipedia usable and popular. When that code bogged
down to the point where the wiki became nearly unusable, there
wasn't much time to sit down and properly architect and develop
a solution, so I just reorganized the existing architecture for
better performance and hacked all the code. This got us over
the immediate crisis, but now my code is bogging down, and we
are having to remove useful features to keep performance up.
I think it's time for Phase IV. We need to sit down and design
an architecture that will allow us to grow without constantly
putting out fires, and that can become a stable base for a fast,
reliable Wikipedia in years to come. I'm now available and
equipped to help in this, but I thought I'd start out by asking
a few questions here and making a few suggestions.
* Question 1: How much time do we have?
Can we estimate how long we'll be able to limp along with
the current code, adding performance hacks and hardware to
keep us going? If it's a year, that will give us certain
opportunities and guide some choices; if it's only a month
or two, that will constrain a lot of those choices.
* Suggestion 1: The test suite.
I think the most critical piece of code to develop right now
is a comprehensive test suite. This will enable lots of
things. For example, if we have a performance question, I
can set up one set of wiki code on my test server, run the
suite to get timing data, tweak the code, then run the suite
again to get new timing. The success of the suite will tell
us if anything broke, and timing will tell us if we're on
the right track. This will be useful even during the
limp-along with current code phase. I have a three-machine
network at home, with one machine I plan to dedicate 100% to
wiki code testing, and my test server in San Antonio that we
can use. This will also allow us to safely refactor code.
I'd like to use something like Latka for the suite (see
http://jakarta.apache.org/commons/latka/index.html ).
* Question 2: How wedded are we to the current tools?
Apache/MySQL/PHP seems a good combo, and it probably would
be possible to scale them up further, but there certainly
are other options. Also, are we willing to take chances on
semi-production quality versions like Apache 2.X and MySQL 4.X?
I'd even like to revisit the decision of using a database
at all. After all, a good file system like ReiserFS (or to
a lesser extent, ext3) is itself a pretty well-optimized
database for storing pieces of free-form text, and there are
good tools available for text indexing, etc. Plus it's
easier to maintain and port.
* Suggestion 2: Use the current code for testing features.
In re-architecting the codebase, we will almost certainly
come to points where we think a minor feature change will
make a big performance difference that won't hurt usability,
or just features that we want to implement anyway. For
example, we could probably make it easier to cache page
requests if we made most of the article content HTML not
dependent on skin by tagging elements well and using CSS
appropriately. Also, we probably want to eventually render
valid XHTML. I propose that while we are building the
phase IV code, we add little features like this to the
existing code to guage things like user reactions and
visual impact.
Other suggestions/questions/answers humbly requested
(including "Are you nuts? Let's stick with Phase III!" if
you have that opinion).
--
Lee Daniel Crocker <lee@piclab.com> <http://www.piclab.com/lee/>
"All inventions or works of authorship original to me, herein and past,
are placed irrevocably in the public domain, and may be used or modified
for any purpose, without permission, attribution, or notification."--LDC