Mailing List Archive

[Spamassassin Wiki] Update of "UploadedCorpora" by JustinMason
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Spamassassin Wiki" for change notification.

The following page has been changed by JustinMason:
http://wiki.apache.org/spamassassin/UploadedCorpora

The comment on the change is:
refactor shared stuff into new page

New page:
= Uploaded Corpora =

There are two ways to perform mass-checks on HandClassifiedCorpora; one, by downloading the SpamAssassin SVN trunk to your machine and running them locally, and two, by uploading your corpora to our mass-check server, as described in this document.

== Administrivia: how the corpus is laid out ==

The filesystem layout of the corpora rsynced up to the server, is like this:

{{{
/home/bbmass/rawcor/WHO/TYPE/FOLDER
}}}

"WHO" is the person who submitted it via rsync, e.g. "doc", "jm", "zmi".

Under that, we have "TYPE", which is either "ham" or "spam".

Under that, "FOLDER", which is whatever the person feels is appropriate. For example, some of us use date-stamped dirs here. It is also possible to use mboxes, as long as they are files and their filename ends in ".mbox".

Note that only files which (a) are directly in the "ham" or "spam" directory, not a subdirectory, and (b) are named ending in ".mbox", will be treated as mboxes. Anything else will be considered a ''single email message''.

== How to get your corpus up there ==

This is done via rsync.

Give somebody on the PMC a shout, since they have privileges to create an
rsync area for you to upload stuff to. The easiest way is to mail the dev
list. (If you're on the PMC, just SSH in and copy over a tarball yourself!
or create yourself an rsync account using a random password.)

Once they've done this, they'll send you the username and password;
you can then sync your files like so:

{{{
export RSYNC_PASSWORD=$YOURPASS
rsync -vr /path/to/your/files \
rsync://$YOURUSER@rsync.spamassassin.org/mailcorpus_$YOURUSER
}}}

(where $YOURPASS, $YOURUSER, $YOU are whatever the PMC guy mailed to
you.)

It's important that you have 2 dirs in the {{{/path/to/your/files}}} directory,
{{{ham}}} and {{{spam}}}. Any files ending in {{{.mbox}}} inside those dirs
will be treated as UNIX mbox-format files; any other files will be treated as
individual messages (one message per file).

== Administrivia ==

Some stuff for PMC people hacking on this...

=== Admin: Creating a new rsync area for someone to upload corpora ===

{{{
sudo vi /etc/rsyncd.conf
}}}

add something like this to the end, changing "CORPUSUSER" to the username you want to give out:

{{{
[mailcorpus_CORPUSUSER]
path = /home/bbmass/rawcor/CORPUSUSER
read only = false
auth users = CORPUSUSER
secrets file = /home/corpus-rsync/secrets
}}}

{{{
CORPUSUSER="[username you want to give out]"
cd /home/bbmass/rawcor/
mkdir $CORPUSUSER
chmod 1777 $CORPUSUSER
}}}

Then create a random password string, and add a line to {{{/home/corpus-rsync/secrets}}} with $CORPUSUSER and that password.

Finally, let the submitter know their new username and password.
[Spamassassin Wiki] Update of "UploadedCorpora" by JustinMason [ In reply to ]
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Spamassassin Wiki" for change notification.

The following page has been changed by JustinMason:
http://wiki.apache.org/spamassassin/UploadedCorpora

The comment on the change is:
note about corpus privacy issue

------------------------------------------------------------------------------
will be treated as UNIX mbox-format files; any other files will be treated as
individual messages (one message per file).

+ === Privacy ===
+
+ Uploaded corpora are not considered public knowledge. The people with accounts on that machine should treat the uploaded messages responsibly, and respect the uploader's privacy. If you are concerned about the privacy of these messages, you may be advised to remove the more private mails before uploading, or mass-check on your own machine instead.
+
=== (Administrivia: Creating a new rsync area for someone to upload corpora) ===

Some stuff for PMC people hacking on this...