Mailing List Archive: Automatic subject classification

Automatic subject classification

axel at uni-paderborn

Sep 9, 2002, 11:34 AM

Post #1 of 11 (1774 views)

I think that a subject classification of articles would vastly improve
"soft security" and would save regulars a lot of time, since not
everyone would have to check every edit as currently seems to be the
case.

>I'd still like to see if we couldn't build those subjects
>automatically in some way based on links in the database.

How about this: the possible topics coincide with the major pages
listed on [[Main Page]] (from "Astronomy" to "Visual Arts"). The
shortest link path from such a topic page to an article defines that
article's topic. If there is no such path, then the article is
classified as a topic orphan.

To compute these topics quickly, the cur table gets two new columns:
topic and distance, where distance stands for the link distance from the
Main Page topic page. If a new article is created, looking at the
distance entries of all articles that link to the new one, and taking
the minimum, immediately classifies the new one. If an existing
article is saved, the topic and distance entries of all articles it
links to (and their children) may need to be updated; these changes
can be propagated in a recursive manner.

Would that work?

Axel

Re: Automatic subject classification [ In reply to ]

saintonge at telus

Sep 9, 2002, 11:39 AM

Post #2 of 11 (1770 views)

Axel Boldt wrote:

>I think that a subject classification of articles would vastly improve
>"soft security" and would save regulars a lot of time, since not
>everyone would have to check every edit as currently seems to be the
>case.
>
>>I'd still like to see if we couldn't build those subjects
>>automatically in some way based on links in the database.
>>
>How about this: the possible topics coincide with the major pages
>listed on [[Main Page]] (from "Astronomy" to "Visual Arts"). The
>shortest link path from such a topic page to an article defines that
>article's topic. If there is no such path, then the article is
>classified as a topic orphan.
>
>To compute these topics quickly, the cur table gets two new columns:
>topic and distance, where distance stands for the link distance from the
>Main Page topic page. If a new article is created, looking at the
>distance entries of all articles that link to the new one, and taking
>the minimum, immediately classifies the new one. If an existing
>article is saved, the topic and distance entries of all articles it
>links to (and their children) may need to be updated; these changes
>can be propagated in a recursive manner.
>
>Would that work?
>
>Axel
>
Interesting! I had a very similar thought a couple months ago, and
never bothered to mention it. I guess that qualifies Axel for the4
moral copyrights.on the idea.

The orphan (like the main page) would simply have a distance value of 0.

The path for a page could appear at the top of the article. This
already happens in some places, and we manually do something similar in
our Tree of Life project. I hope nobody complains about polyphyletic
Wikipedia pages.

As for saving time for regulars, that may be a very limited benefit. We
all have certain subject areas where we tend to track things and tend to
ignore topics outside of that. The real time loss often doesn't appear
until we have looked at an article to see if a minor change really is minor.

Eclecticology

Re: Automatic subject classification [ In reply to ]

Sep 9, 2002, 11:58 AM

Post #3 of 11 (1764 views)

On Mon, Sep 09, 2002 at 08:34:24PM +0200, Axel Boldt wrote:
> I think that a subject classification of articles would vastly improve
> "soft security" and would save regulars a lot of time, since not
> everyone would have to check every edit as currently seems to be the
> case.
>
> >I'd still like to see if we couldn't build those subjects
> >automatically in some way based on links in the database.
>
> How about this: the possible topics coincide with the major pages
> listed on [[Main Page]] (from "Astronomy" to "Visual Arts"). The
> shortest link path from such a topic page to an article defines that
> article's topic. If there is no such path, then the article is
> classified as a topic orphan.
>
> To compute these topics quickly, the cur table gets two new columns:
> topic and distance, where distance stands for the link distance from the
> Main Page topic page. If a new article is created, looking at the
> distance entries of all articles that link to the new one, and taking
> the minimum, immediately classifies the new one. If an existing
> article is saved, the topic and distance entries of all articles it
> links to (and their children) may need to be updated; these changes
> can be propagated in a recursive manner.
>
> Would that work?

1.
No, it wouldn't.

Both deletion and creation of links are hard problems:

Main page -> Biology -> A1
-> Chemistry -> A2 -> A3 -> A4 -> Target
A5 -> Target

Now what happens when we add link from A1 to A5 ?
There are lot of links from A5 to other articles,
recursion here would mean recalculating major part of topology.

Deletion of any of links in current shortest path (if we store it
somewhere) require recalculation of whole topology too.

2.
But it would be possible to create initial classification that way.

Re: Automatic subject classification [ In reply to ]

Magnus.Manske at t-online

Sep 9, 2002, 12:10 PM

Post #4 of 11 (1770 views)

I don't think it would be a good idea to hardwire article *subject
categories* at all. We had that discussion some time ago, as Lee said. I
was talking about *article types*.

Re: Automatic subject classification [ In reply to ]

Sep 9, 2002, 3:30 PM

Post #5 of 11 (1773 views)

On Mon, Sep 09, 2002 at 09:10:22PM +0200, Magnus Manske wrote:
> I don't think it would be a good idea to hardwire article *subject
> categories* at all.

What do you mean with "hardwiring"? I don't think anyone suggested a fixed
list of subject categories.

-- Jan Hidders

Re: Automatic subject classification [ In reply to ]

axel at uni-paderborn

Sep 9, 2002, 8:15 PM

Post #6 of 11 (1782 views)

>> Would that work?

>No, it wouldn't.

You're probably right; too much computation for every article save
operation. On the other hand, it is not of utmost importance that
every article always carries the precisely correct and current topic
classification; rough approximations are good enough. New pages can
then still be assigned topics quickly. I doubt that topic assignments
of a given article would change very often anyway. And every couple of
weeks or so, we can recompute every article's topic from scratch in a
breadth first manner.

Would that work?

>As for saving time for regulars, that may be a very limited benefit.
>We all have certain subject areas where we tend to track things and
>tend to ignore topics outside of that.

Yup, but I would like to see a list of the 10-20 daily edits in my
subject area without having to wade through 2000+ RecentChanges
entries.

Axel

Re: Automatic subject classification [ In reply to ]

ImranG at btinternet

Sep 10, 2002, 2:52 AM

Post #7 of 11 (1770 views)

On 9 Sep 2002, at 20:34, Axel Boldt wrote:

> I think that a subject classification of articles would vastly improve
> "soft security" and would save regulars a lot of time, since not
> everyone would have to check every edit as currently seems to be the
> case.

Maybe a way around it would be to have a new level of op, say op1,
one which would be awarded to anyone who has had an account
for 30 days or so and hasn't been banned.

Whenever someone who isn't at op1 level makes an edit to a page
a "edit check" counter appears and counts days. When anyone
with op1 status looks at this page after checking it for vandalism
they could reset the counter back to zero.

That way we wouldn't get multiple people needlessly checking the
same page for vandalism, and we could ensure that every newbie
edit was checked for vandalism.

> How about this: the possible topics coincide with the major pages
> listed on [[Main Page]] (from "Astronomy" to "Visual Arts"). The
> shortest link path from such a topic page to an article defines that
> article's topic. If there is no such path, then the article is
> classified as a topic orphan.

An alternative idea:

For any page follow all the links from it down to about 3-4 levels,
and assume these are all on related topics. To make this more
accurate we could follow only two way links. Then strip out any
article which has more then say 50 double links as it's likely to be
the front page, or something similar unrealted to the topic.

Not only will this provide autoclassification but we could also use it
for finding pages that needed to be written on a specific topic by
automatically generating a list of unwritten articles related to a
topic.

Imran

Re: Automatic subject classification [ In reply to ]

saintonge at telus

Sep 10, 2002, 8:05 AM

Post #8 of 11 (1771 views)

Imran Ghory wrote:

>>How about this: the possible topics coincide with the major pages
>>listed on [[Main Page]] (from "Astronomy" to "Visual Arts"). The
>>shortest link path from such a topic page to an article defines that
>>article's topic. If there is no such path, then the article is
>>classified as a topic orphan.
>>
>An alternative idea:
>
>For any page follow all the links from it down to about 3-4 levels,
>and assume these are all on related topics. To make this more
>accurate we could follow only two way links. Then strip out any
>article which has more then say 50 double links as it's likely to be
>the front page, or something similar unrealted to the topic.
>
I think that this would be more problematical than using "what links
here". The links on a page include ones to years and countries where
the discussion usually has nothing to do with our subject of interest.
"What links here" had more specific reason to link to our subject.

Eclecticology

Re: Automatic subject classification [ In reply to ]

saintonge at telus

Sep 10, 2002, 12:24 PM

Post #9 of 11 (1770 views)

> Axel Boldt wrote:
>
>> How about this: the possible topics coincide with the major pages
>> listed on [[Main Page]] (from "Astronomy" to "Visual Arts"). The
>> shortest link path from such a topic page to an article defines that
>> article's topic. If there is no such path, then the article is
>> classified as a topic orphan.
>
I looked a little more into this, manualy tracing the path of 10
randomly chosen articles. I don't know what it does to the automatic
path tracing idea but it did lead to a number of observations.

First the data:

1. Abu Zubaydah <- Ibn al-Shaykh al-Libi <- Abu Zubaydah
(Circular orphan) nothing else leads to these two
(Score 0)

2. Analysis of variance <- Statistics <- Main Page
(Score 2)

3. Indianapolis Colts <- National Football League <- American football
<- Sport <- Main Page
(Score 4)
-- ". <- Indiana <- United States (=United States of America) <- List of
Countries (=Countries of the World) <- Geography <- Main Page
(Score 5)
- ", <- 1969 <- 20th century <- Historical timeline|Centuries <- Main Page
(Score 4)

4. Jerry Springer <-List of television programs <-List of reference
tables (=Reference tables) <- Main Page
(Score 3)

5.Heinrich Schliemann <- Archaelogy <- Main Page
(Score 2)

6. Hitchhiking <- User: Branko <- Special Pages: Registered Users (=User
list) <- Main Page
(Score 3) Same via User: Rootbeer
Access is only through two user pages; it's an orphaned orphan!

7. Nursing <- Health science <- Main page
(Score 2)

8. Vsevolod I, Prince of Kiev <- Kievan Rus' <-History of Russia
(=Russian history) <- History <- Main Page
(Score 4)
A score of 3 was possible through the page [[User: H. Jonat]]. (I
swear this was random; I didn't ask for THAT user to appear)

9. Morrisville <- Wikipedia:Links to disambiguating pages <- Wikipedia:
utilities <- Main Page
(Score 3)

10. Celestial sphere <- Astronomy and astrophysics <- Main Page
(Score 2)

Observations:
1. In the samples the longest minimum path to the Main Page was only 4
articles. Any article linked from a user page would be 3 steps away
from the user page, but this should not be considered a meaningful path.

2. Two kinds of effectively orphan pages became evident, but these would
never appear on the special page listing of orphans. In the first
example two pages link to each other but nothing else links to them. In
example 6 the only links to the article are on user pages. Who would
ever think to look there for a reference to an article?

3. [[List of countries]] and [[United States]] should probably be
linked from the Main Page. The numberr of paths through these is enormous.

4. Many of the links to [[United States]] are excessive. Many of the
uses are in passing where more information about the United States is
unlikely to be needed. I think we can always assume a very basic level
of understanding about what is meant by "United States" What would
surprise me most about those who don't have that very basic level of
understanding is how they managed to find Wikipedia in the first place.

Eclecticology

Re: Automatic subject classification [ In reply to ]

Sep 10, 2002, 2:25 PM

Post #10 of 11 (1782 views)

On Tue, Sep 10, 2002 at 12:24:27PM -0700, Ray Saintonge wrote:
> I looked a little more into this, manualy tracing the path of 10
> randomly chosen articles. I don't know what it does to the automatic
> path tracing idea but it did lead to a number of observations.
[...]
> Observations:
> 1. In the samples the longest minimum path to the Main Page was only 4
> articles. Any article linked from a user page would be 3 steps away
> from the user page, but this should not be considered a meaningful path.
>
> 2. Two kinds of effectively orphan pages became evident, but these would
> never appear on the special page listing of orphans. In the first
> example two pages link to each other but nothing else links to them. In
> example 6 the only links to the article are on user pages. Who would
> ever think to look there for a reference to an article?
>
> 3. [[List of countries]] and [[United States]] should probably be
> linked from the Main Page. The numberr of paths through these is enormous.
>
> 4. Many of the links to [[United States]] are excessive. Many of the
> uses are in passing where more information about the United States is
> unlikely to be needed. I think we can always assume a very basic level
> of understanding about what is meant by "United States" What would
> surprise me most about those who don't have that very basic level of
> understanding is how they managed to find Wikipedia in the first place.

I have done much more complex and almost-automatic topological
analysis (of Polish Wikipedia).

If you can read Polish or think that you can find out what's going on
by just looking at numbers and lists, check:
http://pl.wikipedia.com/wiki.cgi?Taw/Topologia_Wikipedii (stats are a couple days old)

Things that are done before computations:
* all empty, talk and user pages are removed
* all links to redirects are replaced by links to final articles,
and then redirects are removed

(About 1)
Stats for Polish Wikipedia:
* not accessible 227 ( 4.716393102%)
* main page 1 ( 0.02077706212%)
* 1 hop 78 ( 1.620610846%)
* 2 hops 1199 (24.91169749%)
* 3 hops 2492 (51.77643881%)
* 4 hops 614 (12.75711614%)
* 5 hops 175 ( 3.635985872%)
* 6 hops 22 ( 0.4570953667%)
* 7 hops 5 ( 0.1038853106%)

(About 2)
Much more interesting patterns can be found.
Don't forget about articles linked from talk pages and yearbook pages.

(About 3)
One most interesting thing computed "importance of links on main
page". Algoritm is simple - sum of distances from main page to each
non-orphan node is computed, and link is as valuable as much it
improves this number. Both links-to-be-added and links-to-be-removed
are computed. We have now rather more useful main page.

(and ...)
If anybody wants the scripts, tell me,
but expect to do some work to adapt it to other Wikipedia.

Automatic subject classification [ In reply to ]

axel at uni-paderborn

Sep 10, 2002, 3:48 PM

Post #11 of 11 (1771 views)

I would exclude "Reference tables" and "Historical timeline" from the
list of topic pages (only Astrophysics through Visual Arts), since too
many articles are linked through those and we wouldn't get a
meaningful classification. I would also exclude any path that goes
through any namespace but the main one.

Axel