Mailing List Archive

[Spamassassin Wiki] Update of "RulesProjMoreInput" by JustinMason
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Spamassassin Wiki" for change notification.

The following page has been changed by JustinMason:
http://wiki.apache.org/spamassassin/RulesProjMoreInput

New page:
== Rules Project: Encouraging Contributions ==

'''(part of RulesProjectPlan)'''

Problem description: 'The SpamAssassin committers are not spending much time writing rules. Attempts to recruit people to become committers to write rules have been somewhat unsuccessful. We could always use more committers and contributors; what can we do to encourage more contribution?'

The ideas are as follows:

== Sandboxes ==

DanielQuinlan proposed a new system of 'sandboxes' in SVN [http://news.gmane.org/gmane.mail.spam.spamassassin.devel (mail)]. Summary:

* easy to join
* no quality or experience requirement for the sandbox
* easy to get rules into main rule body
* can produce multiple "output rule sets" in the long run: conservative, aggressive, sub-areas: bounces, drug rules, etc.
* uses SVN and therefore has version control

Some notes picked out of followup discussion:

It is possible to keep rules 'private', and in your own checkout only, by not checking them into SVN.

If you do want to have the rules visible for collaboration, but not used for automatic mass-checks or promotion, that could be done by just keeping them in a file that doesn't end in ".cf". (SpamAssassin's standard is that they have to end in ".cf" to be considered valid rules files.)

== Mass-checking ==

LorenWilton noted 'A big part (perhaps the biggest part) of rules development
is the mass check. Most anyone can develop a rule on their home system and see
how they *think* it works. Some few (but not many) people can do a mass-check
on their home system and see how it *really* works - *for them*. As proposed,
this rules project doesn't address the most important part of a rules project -
some way to check the rules against a fairly large corpus.'

=== Nightly Mass-Checks ===

We currently have the NightlyMassCheck systems which do this, but turnaround
time is too slow for most rule developers.

It does however offer the following good aspects:

* info on how a new rule compares to the full *existing* ruleset
* overlaps between rules, using "hit-frequencies -o"
* collated results across all users' corpora, which can be broken down to view each user separately or all together
* checked rules and their results are kept in a version-control history, so benefits of VC are available
* ongoing visibility of hit-rates of the existing ruleset, against fresh corpora

(TODO: migrate the ruleqa CGI onto the SpamAssassin zone so this is still visible, even though the automc stuff is disabled)

=== List-Driven Mass-Checks ===

Loren outlined the system used in SARE:

* rule developer sends mail to mailing list
* various other participants run scripts that automatically extract certain attachments posted to the list
* turn those into rules files
* lint them
* run a mass-check immediately with just the rules in that file
* post results

For active rule development, this is obviously quite useful! If you can't run
mass-check locally for whatever reason, it offers a way to do this using other
people's corpora in almost-real-time.

JustinMason: 'I'd like to see if there's a way to combine the two (that is,
nightly and list-driven mass-checks) somehow, so that new SVN commits that
update sandbox rules, are immediately mass-checked alone. However, I can't see
a way to do that reliably from SVN commits alone, because (for example) meta
rules may depend on other rules that were not changed as part of the same
commit. So I think the "email with attached rules file" is still a better
model.'