Mailing List Archive: [Spamassassin Wiki] Update of "RulesProjSandboxes" by JustinMason

[Spamassassin Wiki] Update of "RulesProjSandboxes" by JustinMason

Aug 13, 2005, 3:14 PM

Post #1 of 3 (133 views)

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Spamassassin Wiki" for change notification.

The following page has been changed by JustinMason:
http://wiki.apache.org/spamassassin/RulesProjSandboxes

The comment on the change is:
consolidate sandbox stuff

New page:
= Rules Project: Sandboxes =

Initially, the rules project sandboxes SVN will consist of the existing (empty)
rules directory in Subversion (the CVS replacement used by the ASF). Each
committer will have their own sandbox to begin development in an unconstrained
manner:

rules/sandbox/<username>/

Every person who has listed their rule set on the Apache SpamAssassin
Wiki will be invited once the PMC approves the project; there are some rule
sets only listed at SARE or Exit0, but those people are invited to join
too, of course. There is absolutely no quality or experience
requirement for the sandbox although we may later provide some tools to
make it easier to avoid name collisions and such.

It is expected that someone (don't know and don't care who) will
eventually write scripts to test, filter, and pull rules automatically
into the production rules. I am intentionally deferring decisions
around that area, though.

What does providing a sandbox for everyone do?

* easy to join (you just have to sign a CLA and get an @apache.org account)
* no expectation of... well, much anything; no quality or experience requirement for the sandbox
* easy for us to import rules (manually or automatically) into main rule body
* easy to move forward with further development around automatic updates and all of the other (hard) ideas we've talked about, but I really want to keep this dirt simple.
* ability to help direct future development of the rules project (as it extends beyond sandboxes, sandboxes will remain just sandboxes, of course).
* can produce multiple "output rule sets" in the long run: conservative, aggressive, sub-areas: bounces, drug rules, etc.
* uses SVN and therefore has version control

In other words, this solves the main part of our "rules problem" -- the
hurdle of getting rules "over the wall". No longer will we need
individual bugs for rule submissions, or need to go to 3 different sites
to look for rule ideas, etc. Many of our best rules have come from SARE
and the Wiki.

Also, it's expected that many of the rules will never go
into the main rules body -- someone may write rules for a specific type
of annoying mail (not even necessarily spam), or maybe someone will be
focused on super-aggressive rules for the brave folks out there. We can
even produce multiple "output rule sets" in the long run: conservative,
aggressive, sub-areas: bounces, drug rules, etc.

Some notes picked out of followup discussion:

It is possible to keep rules 'private', and in your own checkout only, by not checking them into SVN.

If you do want to have the rules visible for collaboration, but not used for automatic mass-checks or promotion, that could be done by just keeping them in a file that doesn't end in ".cf". (SpamAssassin's standard is that they have to end in ".cf" to be considered valid rules files.)

== Repository Organization ==

* rules/core/ = standard rules directory
* rules/sandbox/<username>/ = per-user sandboxes
* rules/extra/<directory>/ = extra rule sets not in core

The proposal is for rules/core to become the rules directory for trunk (3.2 and later, via SVN
externals which will make their inclusion seamless in the standard SA tree). The sandbox is discussed
further in RulesProjMoreInput.

== Extras/ ==

We'll want to discuss the structure and process behind creating new extras
directories further once we reach a critical mass of committers in the rules
project; but here's some initial thoughts on typical 'extra' rulesets.

* 'Aggressive' rulesets, which are too likely to produce FPs for the base release
* non-spam-oriented rules, such as the anti-virus-bounce ruleset
* non-English-language rulesets (although see RulesNotEnglish)

= Rule Promotion =

Getting rules from the sandbox, into the distribution:

* each user gets their own sandbox as discussed on RulesProjSandboxes
* checked-in rules in the sandboxes are mass-checked in the nightly mass-checks
* to migrate a rule from "sandbox" (dev) to "core" (production) ruleset uses C-T-R; ie. votes are not required in advance
* also C-T-R to migrate from "sandbox" to "extra" ruleset

Rules that get promoted from a "sandbox" to "core" should pass the following criteria:

* pass "--lint"!
* S/O ratio of 0.95 or greater (or 0.05 or less for nice rules)
* > 0.25% of target type hit (e.g. spam for non-nice rules)
* < 1.00% of non-target type hit (e.g. ham for non-nice rules)

We can automate those criteria pretty easily. We can also vote for rules that don't pass those criteria, but we think should be put into core for some reason.

Future criteria:

* not too slow ;) TODO: need an automated way to measure that
* TODO: criteria for overlap with existing rules? see 'overlap criteria' below.

== Getting There From Here ==

If we're going to start pulling rules from sandboxes into core/ in
the above fashion, but we leave the current ruleset intact in the
core as well, things will get messy.

I propose we move the current core ruleset into a sandbox, called
'rules/sandbox/legacy/'. The good rules that pass the above
selection criteria, get promoted as any other rules from other
sandboxes do, into the new 'core/'; the old, stale rules (of
which we have a few), will not get back into core.

== The 'extra/' Set ==

A ruleset in the "extra" set would have different criteria; e.g.

* the virus bounce ruleset
* rules that positively identify spam from spamware, but hit <0.25% of spam
* an "aggressive" rules set might include rules that hit with an S/O of only 0.89, but push a lot of spam over the 5.0 threshold without impacting significantly on ham

(ChrisSanterre: Seeing this breakdown of dirs, gave me an idea. Why not set the "aggresiveness" of SA for updates? Like how SARE has ruleset0.cf (no ham hits), ruleset1.cf (few ham, high S/O), etc., with each "level" of rule set file getting slightly more aggressive, risking (though not necessarily seeing) slightly higher FP rates. Users could set some config like supdate=(1-4), with 1 being the most conservative, and 4 being the most aggresive (with the knowledge that more aggresive *could* possibly cause more FPs).

JustinMason: I think for now it's easiest to stick with the 'load aggressive rulesets by name' idea, rather than adding a new configuration variable. For example, aggressiveness is not the only criteria for what rulesets to use; we'd have to include config variables for "I want anti-viral-bounce rulesets", too.)

== Overlap Criteria ==

BobMenschel: The method I used for weeding out SARE rules that overlapped 3.0.0 rules, was to run a full mass-check with overlap analysis, and throw away anything where the overlap is less than 50% (ie: keep only those rules which have "meaningful" overlap). Manually reviewing the remaining (significantly) overlapping rules was fairly easy. The command I use is: perl ./overlap ../rules/tested/$testfile.ham.log ../rules/tested/$testfile.spam.log | grep -v mid= | awk ' NR == 1 { print } ; $2 + 0 == 1.000 && $3 + 0 >= 0.500 { print } ' >../rules/tested/$testfile.overlap.out

DanielQuinlan: 'By "throw away", do you mean put into the bucket that is retained going forward or did you mean to say "greater than 50%"?'

BobMenschel: 'By "throw away anything where the overlap is less than 50%" I
meant to discard (exclude from the final file) anything where the overlap was
(IMO) insignificant.
This would leave those overlaps where RULE_A hit all the emails that
RULE_B also hit (100%), and RULE_B hit somewhere between 50% and 100%
of the rules that RULE_A hit.'

JustinMason: Like Daniel, I'm confused here. as far as I can see, you want to
keep the rules that do NOT have a high degree of overlap with other rules, and
throw out the rules that do (because they're redundant). in other words, you
want to throw away when the mutual overlap is greater than some high value
(like 95% at a guess).

[Spamassassin Wiki] Update of "RulesProjSandboxes" by JustinMason [ In reply to ]

wikidiffs at apache

Aug 22, 2005, 7:29 PM

Post #2 of 3 (122 views)

Permalink

[Spamassassin Wiki] Update of "RulesProjSandboxes" by JustinMason [ In reply to ]

wikidiffs at apache

Aug 22, 2005, 7:39 PM

Post #3 of 3 (122 views)

Permalink

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Spamassassin Wiki" for change notification.

The following page has been changed by JustinMason:
http://wiki.apache.org/spamassassin/RulesProjSandboxes

The comment on the change is:
refactor page, other was getting too big

------------------------------------------------------------------------------
* rules/sandbox/<username>/ = per-user sandboxes
* rules/extra/<directory>/ = extra rule sets not in core

- The proposal is for rules/core to become the rules directory for trunk (3.2 and later, via SVN
+ The proposal is for rules/core/ to become the rules directory for trunk (3.2 and later, via SVN
externals which will make their inclusion seamless in the standard SA tree). The sandbox is discussed
further in RulesProjMoreInput.
+
+ Promotion of rules from sandbox to rules/core/ is discussed in RulesProjPromotion.

== Extras/ ==

@@ -66, +68 @@

directories further once we reach a critical mass of committers in the rules
project; but here's some initial thoughts on typical 'extra' rulesets.

- * 'Aggressive' rulesets, which are too likely to produce FPs for the base release
* non-spam-oriented rules, such as the anti-virus-bounce ruleset
* non-English-language rulesets (although see RulesNotEnglish)
-
- = Rule Promotion =
-
- Getting rules from the sandbox, into the distribution:
-
- * each user gets their own sandbox as discussed on RulesProjSandboxes
- * checked-in rules in the sandboxes are mass-checked in the nightly mass-checks
- * to migrate a rule from "sandbox" (dev) to "core" (production) ruleset uses C-T-R; ie. votes are not required in advance
- * also C-T-R to migrate from "sandbox" to "extra" ruleset
-
- Rules that get promoted from a "sandbox" to "core" should pass the following criteria:
-
- * pass "--lint"!
- * S/O ratio of 0.95 or greater (or 0.05 or less for nice rules)
- * > 0.25% of target type hit (e.g. spam for non-nice rules)
- * < 1.00% of non-target type hit (e.g. ham for non-nice rules)
-
- These numbers are really just ball-park figures and should be fine-tuned as we go. (DuncanFindlay)
-
- We can automate those criteria pretty easily. We can also vote for rules that don't pass those criteria, but we think should be put into core for some reason.
-
- Future criteria:
-
- * not too slow ;) TODO: need an automated way to measure that
- * TODO: criteria for overlap with existing rules? see 'overlap criteria' below.
-
- == Getting There From Here ==
-
- === Moving files out of trunk into the new rules project ===
-
- JustinMason: If we're going to start pulling rules from sandboxes into core/ in
- the above fashion, but we leave the current ruleset intact in the
- core as well, things will get messy.
- I propose we move the current core ruleset into a sandbox, called
- 'rules/sandbox/legacy/'. The good rules that pass the above selection
- criteria, get promoted as any other rules from other sandboxes do, into the new
- 'core/'; the old, stale rules (of which we have a few), will not get back into
- core.
-
- DanielQuinlan: vetoed. Instead: code-tied rules stay with main tree in current
- rules directory, with the exception of 25_replace.cf which is really just
- another way to write body/header rules. Basically, the static stuff that is
- tied to code does not move to the rules project.
-
- In more detail -- files that DO NOT move to rules project:
-
- {{{
- 25_accessdb.cf (plugins in core code)
- 25_antivirus.cf
- 25_dcc.cf
- 25_domainkeys.cf
- 25_hashcash.cf
- 25_pyzor.cf
- 25_razor2.cf
- 25_spf.cf
- 25_textcat.cf
- 25_uribl.cf
- 60_awl.cf
- 60_whitelist_subject.cf
- 20_dnsbl_tests.cf (eval tests in EvalTests.pm)
- 20_html_tests.cf (rawbody ones can move to ROOT/rules/core/)
- 20_net_tests.cf
- 23_bayes.cf
- 60_whitelist.cf
- init.pre (Misc non-cf files)
- local.cf
- name-triplets.txt
- regression_tests.cf
- triplets.txt
- user_prefs.template
- v310.pre
- }}}
-
- Files that DO get moved:
-
- {{{
- 25_body_tests_es.cf -> ROOT/rules/lang/es/
- 25_body_tests_pl.cf -> ROOT/rules/lang/pl/
- 30_text_de.cf -> ROOT/rules/lang/de/
- 30_text_fr.cf -> ROOT/rules/lang/fr/
- 30_text_it.cf -> ROOT/rules/lang/it/
- 30_text_nl.cf -> ROOT/rules/lang/nl/
- 30_text_pl.cf -> ROOT/rules/lang/pl/
- 30_text_pt_br.cf -> ROOT/rules/lang/pt_br/
-
- 20_advance_fee.cf -> ROOT/rules/core/
- 20_drugs.cf -> ROOT/rules/core/
- 20_p**n.cf -> ROOT/rules/core/ [wikicensorship!]
-
- 10_misc.cf -> ROOT/rules/core/
- 20_anti_ratware.cf -> ROOT/rules/core/
- 20_body_tests.cf -> ROOT/rules/core/
- 20_compensate.cf -> ROOT/rules/core/
- 20_fake_helo_tests.cf -> ROOT/rules/core/
- 20_head_tests.cf -> ROOT/rules/core/
- 20_meta_tests.cf -> ROOT/rules/core/
- 20_phrases.cf -> ROOT/rules/core/
- 20_ratware.cf -> ROOT/rules/core/
- 20_uri_tests.cf -> ROOT/rules/core/
- 25_replace.cf (odd case, but will change a lot) -> ROOT/rules/core/
- 50_scores.cf -> ROOT/rules/core/
- 60_whitelist_spf.cf -> ROOT/rules/core/
- }}}
-
- Files that get deleted: 20_anti_ratware.cf: it's empty.
-
- JustinMason: ok, that looks good -- except for one thing. We still have the problem that ROOT/rules/core/ is going to be a mix of legacy files and auto-promoted rules. What do we do about that problem?
-
- === Algorithm for auto-promotion ===
-
- JustinMason: Aside from the criteria, we also need an idea of how the config file lines get from sandbox to core. Here's my proposal.
-
- For each sandbox directory:
- * iterate through all files in the dir
- * if a config line refers to a rule name (e.g. "header", "describe", "tflags"), then:
- * apply the criteria from 'Rule Promotion'. if the rule passes:
- * output the line
- * else:
- * ignore the line and produce no output
- * if the config line doesn't refer to a rule name, output the line.
- * send that output to a file in ROOT/rules/core/ , named according to the sandbox directory's name. e.g. lines from all files matching ROOT/rules/sandbox/jmason/*.cf would be output to ROOT/rules/core/25_jmason.cf
-
- == The 'extra/' Set ==
-
- A ruleset in the "extra" set would have different criteria; e.g.
-
- * the virus bounce ruleset
* rules that positively identify spam from spamware, but hit <0.25% of spam
* an "aggressive" rules set might include rules that hit with an S/O of only 0.89, but push a lot of spam over the 5.0 threshold without impacting significantly on ham