Mailing List Archive: Spam Filtering Contest

I propose a spam filtering contest. Who can most accurately identify
spam and ham.

Here how it would work. The "judges" will buils a corpus of say 25000
spam and 25000 ham messages. These are messages that everyone would
agree that they are clearly spam and ham. Best if it were taken from
real email and had a wide variety of context - including adult
conversation - mortgage companies - attached web pages with ads - etc.
Contestants can not donate any of the spam/ham to the corpus.

So - you mix of the messages and the contestents download the file -
make the run - and upload the marked up results.

Scoring would be done based on how many were flagged correctly. Extra
points for autolearn ham/spam that are correct.

Who likes this idea?