Mailing List Archive

Spam Filtering Contest - Challenge
I propose a spam filtering contest. Who can most accurately identify
spam and ham.

Here how it would work. The "judges" will buils a corpus of say 25000
spam and 25000 ham messages. These are messages that everyone would
agree that they are clearly spam and ham. Best if it were taken from
real email and had a wide variety of context - including adult
conversation - mortgage companies - attached web pages with ads - etc.
Contestants can not donate any of the spam/ham to the corpus.

So - you mix of the messages and the contestents download the file -
make the run - and upload the marked up results.

Scoring would be done based on how many were flagged correctly. Extra
points for autolearn ham/spam that are correct.

Who likes this idea?
Re: Spam Filtering Contest - Challenge [ In reply to ]
Marc Perkel wrote:
> I propose a spam filtering contest. Who can most
> accurately identify spam and ham.

If you were to do it, there has to be some way of getting results that
are not based on people tweaking (or learning) on the test corpus. Some
way to be sure that contestants prepare their software without access to
the test corpus, and then run it through once for the contest. Otherwise
the winner would be overtrained for the contest, not the best overall
spam filter.

But that means you would have to make available a training corpus that
you can be sure is representative of the contest corpus, so that
contestants have something to train and test with.

Tricky problem.

-- sidney