Mailing List Archive: Re: [Wikipedia-l] Feature request

(Wikitech-l: this is more on automatic subject classification, which Axel
brought up recently on Wikipedia-l.)

On Mon, 23 Sep 2002, Axel Boldt wrote:

[snip excellent comments that I agree with]

> I still believe that all of this can and should be
> done automatically, by tracing link paths from the
> main page.

I'm going to repeat some of what you've said earlier, adding my own
perspective. I really hope some programmers pursue this--they needn't ask
anyone's permission. The proof's in the pudding.

If automatic categorization could be done, and it sounds very plausible to
me, it would be *far* superior to a hand-maintained list of subject area
links. And incredibly useful, too.

OK, the following will reiterate some of the earlier discussion.

Presumably, nearly every page on Wikipedia can be reached from nearly
every other page. (There are orphans; and there are pages that do not
link to any other pages, though other pages link to them.)

This suggests that we can basically assign a number--using *some*
algorithm (not necessarily any one in particular: here is where
programmers can be creative)--giving the "closeness" of a page to all the
listed subjects. (This is very much like the Kevin Bacon game, of course,
and the "six degrees of separation" phenomenon.)

The question whether a *useful* algorithm can be stated is interesting
from a theoretical point of view. As I understand it, the suggestion is
that there is a simple and reliable (but how reliable?) algorithm, such
that, given simply a list of all the links in Wikipedia (viz., the source
page and destination page), and a list of subject categories, we can
reliably sort all pages into their proper categories.

It will not do to say, "There are obvious counterexamples, so let's not
even try." We can live with some slop. This is Wikipedia! We could even
fix errors by hand (ad hoc corrections are possible; why not?). As far as
I'm concerned, the real question is, once we try *various* algorithms,
what's the highest reliability we can actually generate? I'll bet it'll
be reasonably high, certainly high enough to be quite useful.

Here's an attempt at expressing an algorithm:

For a given page P (e.g., [[Plato's allegory of the cave]]), if the
average number of clicks (not backtracking to any page already reached--
otherwise you deal with infinite regresses) needed to reach P from the
subject page S (e.g., [[Philosophy]]) through all possible links between P
and S (or, perhaps, all possible links below a certain benchmark number?)
is lower than the average number of clicks need to reach P from any other
subject page, then P is "about" S.

The algorithm could be augmented in useful ways. In case of ties, or near
ties, a page could be listed as under multiple subjects. I have no idea
if this algorithm is correct, but that doesn't matter--it's just an
example. If you think harder and longer, I'm sure you'll think of a
better one.

This would be fascinating, I'm sure, for the programmers. Can't we just
take the question about how long processing will require as a constraint
on the algorithm rather than as a knock-down argument that it's not
feasible? The *exercise* is to find (and implement!) an algorithm that
*is* feasible. We don't even have to do this using Wikipedia's server, if
it would be too great of a load; anyone could download the tarball and
process it. You could do a cron job once a day, compile the 40-odd
"subject numbers" for each article in Wikipedia, and sort articles into
subject groups (in some cases, multiple subject groups for a given
article--why not?). From there we could use scripts already written to
create the many "recent changes" pages.

I really, really, really want to see [[Philosophy Recent Changes]]. We
desperately need pages like that, and this is one of the best possible
ways we have of getting them. It's worth actually exploring.

--Larry