Mailing List Archive: [Spamassassin Wiki] Update of "RulesProjMoreInput" by JustinMason

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Spamassassin Wiki" for change notification.

The following page has been changed by JustinMason:
http://wiki.apache.org/spamassassin/RulesProjMoreInput

The comment on the change is:
updates

------------------------------------------------------------------------------

Problem description: 'The SpamAssassin committers are not spending much time writing rules. Attempts to recruit people to become committers to write rules have been somewhat unsuccessful. We could always use more committers and contributors; what can we do to encourage more contribution?'

- The ideas are as follows:
+ Here's some ideas.

== Sandboxes ==

- DanielQuinlan proposed a new system of 'sandboxes' in SVN [http://news.gmane.org/gmane.mail.spam.spamassassin.devel (mail)]. Summary:
+ Initially, the rules project sandboxes SVN will consist of the existing (empty)
+ rules directory in Subversion (the CVS replacement used by the ASF). Each
+ committer will have their own sandbox to begin development in an unconstrained
+ manner:

- * easy to join
- * no quality or experience requirement for the sandbox
- * easy to get rules into main rule body
+ rules/sandbox/<username>/
+
+ Every person who has listed their rule set on the Apache SpamAssassin
+ Wiki will be invited once the PMC approves the project; there are some rule
+ sets only listed at SARE or Exit0, but those people are invited to join
+ too, of course. There is absolutely no quality or experience
+ requirement for the sandbox although we may later provide some tools to
+ make it easier to avoid name collisions and such.
+
+ It is expected that someone (don't know and don't care who) will
+ eventually write scripts to test, filter, and pull rules automatically
+ into the production rules. I am intentionally deferring decisions
+ around that area, though.
+
+ What does providing a sandbox for everyone do?
+
+ * easy to join (you just have to sign a CLA and get an @apache.org account)
+ * no expectation of... well, much anything; no quality or experience requirement for the sandbox
+ * easy for us to import rules (manually or automatically) into main rule body
+ * easy to move forward with further development around automatic updates and all of the other (hard) ideas we've talked about, but I really want to keep this dirt simple.
+ * ability to help direct future development of the rules project (as it extends beyond sandboxes, sandboxes will remain just sandboxes, of course).
* can produce multiple "output rule sets" in the long run: conservative, aggressive, sub-areas: bounces, drug rules, etc.
* uses SVN and therefore has version control
+
+ In other words, this solves the main part of our "rules problem" -- the
+ hurdle of getting rules "over the wall". No longer will we need
+ individual bugs for rule submissions, or need to go to 3 different sites
+ to look for rule ideas, etc. Many of our best rules have come from SARE
+ and the Wiki.
+
+ Also, it's expected that many of the rules will never go
+ into the main rules body -- someone may write rules for a specific type
+ of annoying mail (not even necessarily spam), or maybe someone will be
+ focused on super-aggressive rules for the brave folks out there. We can
+ even produce multiple "output rule sets" in the long run: conservative,
+ aggressive, sub-areas: bounces, drug rules, etc.

Some notes picked out of followup discussion:

@@ -46, +80 @@

* checked rules and their results are kept in a version-control history, so benefits of VC are available
* ongoing visibility of hit-rates of the existing ruleset, against fresh corpora

+ The ruleqa CGI is now in the SpamAssassin zone, so this is still visible, even though the automc stuff is disabled. Here it is: http://buildbot.spamassassin.org/ruleqa/
+
- (LOAFER: Suggestion: It would be good to know the % runtime figure for a sandbox rule as a missing boundary can take a rule from 1.5% to 0.0n% performance hit easily
+ 'LOAFER': Suggestion: It would be good to know the % runtime figure for a sandbox rule as a missing boundary can take a rule from 1.5% to 0.0n% performance hit easily:
+
{{{
perl -d:DProf mass-check -j=1 spam:dir:some_reasonable_sample_set_including_hits_and_misses
dprofpp -O 2000 > perf.log
}}}
+
+ JustinMason: Agreed, this would be useful.
+
Someway of scheduling a small run during the development day would be useful, rather than waiting for the nightly.
An email of users completed results would be nice to see too.
- )

- (TODO: migrate the ruleqa CGI onto the SpamAssassin zone so this is still visible, even though the automc stuff is disabled)
+ JustinMason: I think the more immediate, email-based, system is better done using List-Driven Mass-Checks as below; this is good for slow-but-comprehensive daily tests.

=== List-Driven Mass-Checks ===

@@ -80, +119 @@

rules may depend on other rules that were not changed as part of the same
commit. So I think the "email with attached rules file" is still a better
model.'
- LOAFER: There are eval rules to consider too.

+ 'LOAFER': There are eval rules to consider too.
+
+ JustinMason: I think we have to do those as plugins, via the sandboxes.
+
+ Here's the current proposal:
+
+ * apache.org-hosted mailing list
+ * subscription is open to invited mass-checkers/rule developers/committers, and Members of the ASF (the latter is a requirement for ASF project lists)
+ * archives are publically available, but delayed 1 month
+ * automated mass-checks of attachments in specific file format
+ * rules considered suitable for use are manually checked into the "sandbox" area by one of the committers who has privs to do that
+ * with luck, they'll go into the core based on the automated testing described in RulesProjStreamlining.
+