Mailing List Archive: [Spamassassin Wiki] Update of "RulesProjStreamlining" by JustinMason

[Spamassassin Wiki] Update of "RulesProjStreamlining" by JustinMason

Aug 13, 2005, 2:44 PM

Post #1 of 3 (114 views)

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Spamassassin Wiki" for change notification.

The following page has been changed by JustinMason:
http://wiki.apache.org/spamassassin/RulesProjStreamlining

The comment on the change is:
updating

------------------------------------------------------------------------------

First off, the sandboxes idea greatly increases the number of people who can check rules into SVN. Secondly, the barriers to entry for getting a sandboxes account are much lower.

+ = Rule Promotion =
- Some bulletpoints from discussion, needs expanding:
-
- sandbox:

* each user gets their own sandbox as discussed on RulesProjMoreInput
* checked-in rules in the sandboxes are mass-checked in the nightly mass-checks
@@ -26, +24 @@

* S/O ratio of 0.95 or greater (or 0.05 or less for nice rules)
* > 0.25% of target type hit (e.g. spam for non-nice rules)
* < 1.00% of non-target type hit (e.g. ham for non-nice rules)
- * not too slow ;)
- * TODO: criteria for overlap with existing rules? BobMenschel: The method I used for weeding out SARE rules that overlapped 3.0.0 rules, was to run a full mass-check with overlap analysis, and throw away anything where the overlap is less than 50% (ie: keep only those rules which have "meaningful" overlap). Manually reviewing the remaining (significantly) overlapping rules was fairly easy. The command I use is: perl ./overlap ../rules/tested/$testfile.ham.log ../rules/tested/$testfile.spam.log | grep -v mid= | awk ' NR == 1 { print } ; $2 + 0 == 1.000 && $3 + 0 >= 0.500 { print } ' >../rules/tested/$testfile.overlap.out

+ Future criteria:
- A ruleset in the "extra" set would have different criteria.
- * DanielQuinlan suggested: The second, a collection that do not qualify for rules/core. For example, SpamAssassin intentionally doesn't filter virus bounces (yet, at least), but there is a good virus bounce ruleset out there.
- * BobMenschel: Similarly, an "extra" rules set might include rules that positively identify spam from spamware, but hit <0.25% of spam. Or an "aggressive" rules set might include rules that hit with an S/O of only 0.89, but push a lot of spam over the 5.0 threshold without impacting significantly on ham.
- * ChrisSanterre: Seeing this breakdown of dirs, gave me an idea. Why not set the "aggresiveness" of SA for updates? Like how SARE has ruleset0.cf (no ham hits), ruleset1.cf (few ham, high S/O), etc., with each "level" of rule set file getting slightly more aggressive, risking (though not necessarily seeing) slightly higher FP rates. Users could set some config like supdate=(1-4), with 1 being the most conservative, and 4 being the most aggresive (with the knowledge that more aggresive *could* possibly cause more FPs).

- We can also vote for extraordinary stuff that doesn't fit into those criteria...
+ * not too slow ;) TODO: need an automated way to measure that
+ * TODO: criteria for overlap with existing rules? see 'overlap criteria' below.

- private list for mass-checks:
+ We can also vote for rules that don't pass those criteria, but we think should be put into core for some reason.

+ A ruleset in the "extra" set would have different criteria; e.g.
- * archives delayed 1 month?
- * moderated signups
- * automated mass-checks of attachments in specific file format
- * rules considered suitable for use are checked into the "sandbox" area for a quick nightly-mass-check, for release

+ * the virus bounce ruleset
+ * rules that positively identify spam from spamware, but hit <0.25% of spam
+ * an "aggressive" rules set might include rules that hit with an S/O of only 0.89, but push a lot of spam over the 5.0 threshold without impacting significantly on ham
+
+ (ChrisSanterre: Seeing this breakdown of dirs, gave me an idea. Why not set the "aggresiveness" of SA for updates? Like how SARE has ruleset0.cf (no ham hits), ruleset1.cf (few ham, high S/O), etc., with each "level" of rule set file getting slightly more aggressive, risking (though not necessarily seeing) slightly higher FP rates. Users could set some config like supdate=(1-4), with 1 being the most conservative, and 4 being the most aggresive (with the knowledge that more aggresive *could* possibly cause more FPs).
+
+ JustinMason: I think for now it's easiest to stick with the 'load aggressive rulesets by name' idea, rather than adding a new configuration variable. For example, aggressiveness is not the only criteria for what rulesets to use; we'd have to include config variables for "I want anti-viral-bounce rulesets", too.)
+
+ == Overlap Criteria ==
+
+ BobMenschel: The method I used for weeding out SARE rules that overlapped 3.0.0 rules, was to run a full mass-check with overlap analysis, and throw away anything where the overlap is less than 50% (ie: keep only those rules which have "meaningful" overlap). Manually reviewing the remaining (significantly) overlapping rules was fairly easy. The command I use is: perl ./overlap ../rules/tested/$testfile.ham.log ../rules/tested/$testfile.spam.log | grep -v mid= | awk ' NR == 1 { print } ; $2 + 0 == 1.000 && $3 + 0 >= 0.500 { print } ' >../rules/tested/$testfile.overlap.out
+
+ DanielQuinlan: 'By "throw away", do you mean put into the bucket that is retained going forward or did you mean to say "greater than 50%"?'
+
+ BobMenschel: 'By "throw away anything where the overlap is less than 50%" I
+ meant to discard (exclude from the final file) anything where the overlap was
+ (IMO) insignificant.
+ This would leave those overlaps where RULE_A hit all the emails that
+ RULE_B also hit (100%), and RULE_B hit somewhere between 50% and 100%
+ of the rules that RULE_A hit.'
+
+ JustinMason: Like Daniel, I'm confused here. as far as I can see, you want to
+ keep the rules that do NOT have a high degree of overlap with other rules, and
+ throw out the rules that do (because they're redundant). in other words, you
+ want to throw away when the mutual overlap is greater than some high value
+ (like 95% at a guess).
+

[Spamassassin Wiki] Update of "RulesProjStreamlining" by JustinMason [ In reply to ]

wikidiffs at apache

Aug 13, 2005, 3:06 PM

Post #2 of 3 (114 views)

Permalink

[Spamassassin Wiki] Update of "RulesProjStreamlining" by JustinMason [ In reply to ]

wikidiffs at apache

Aug 13, 2005, 3:13 PM

Post #3 of 3 (115 views)

Permalink

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Spamassassin Wiki" for change notification.

The following page has been changed by JustinMason:
http://wiki.apache.org/spamassassin/RulesProjStreamlining

------------------------------------------------------------------------------

First off, the sandboxes idea greatly increases the number of people who can check rules into SVN. Secondly, the barriers to entry for getting a sandboxes account are much lower.

- = Rule Promotion =
+ Getting rules from the sandbox, into the distribution, is dealt with on RulesProjSandboxes, in the 'Rule Promotion' section on down.

- Getting rules from the sandbox, into the distribution:
-
- * each user gets their own sandbox as discussed on RulesProjMoreInput
- * checked-in rules in the sandboxes are mass-checked in the nightly mass-checks
- * to migrate a rule from "sandbox" (dev) to "core" (production) ruleset uses C-T-R; ie. votes are not required in advance
- * also C-T-R to migrate from "sandbox" to "extra" ruleset
-
- Rules that get promoted from a "sandbox" to "core" should pass the following criteria:
-
- * pass "--lint"!
- * S/O ratio of 0.95 or greater (or 0.05 or less for nice rules)
- * > 0.25% of target type hit (e.g. spam for non-nice rules)
- * < 1.00% of non-target type hit (e.g. ham for non-nice rules)
-
- We can automate those criteria pretty easily. We can also vote for rules that don't pass those criteria, but we think should be put into core for some reason.
-
- Future criteria:
-
- * not too slow ;) TODO: need an automated way to measure that
- * TODO: criteria for overlap with existing rules? see 'overlap criteria' below.
-
- == Getting There From Here ==
-
- If we're going to start pulling rules from sandboxes into core/ in
- the above fashion, but we leave the current ruleset intact in the
- core as well, things will get messy.
-
- I propose we move the current core ruleset into a sandbox, called
- 'rules/sandbox/legacy/'. The good rules that pass the above
- selection criteria, get promoted as any other rules from other
- sandboxes do, into the new 'core/'; the old, stale rules (of
- which we have a few), will not get back into core.
-
- == The 'extra/' Set ==
-
- A ruleset in the "extra" set would have different criteria; e.g.
-
- * the virus bounce ruleset
- * rules that positively identify spam from spamware, but hit <0.25% of spam
- * an "aggressive" rules set might include rules that hit with an S/O of only 0.89, but push a lot of spam over the 5.0 threshold without impacting significantly on ham
-
- (ChrisSanterre: Seeing this breakdown of dirs, gave me an idea. Why not set the "aggresiveness" of SA for updates? Like how SARE has ruleset0.cf (no ham hits), ruleset1.cf (few ham, high S/O), etc., with each "level" of rule set file getting slightly more aggressive, risking (though not necessarily seeing) slightly higher FP rates. Users could set some config like supdate=(1-4), with 1 being the most conservative, and 4 being the most aggresive (with the knowledge that more aggresive *could* possibly cause more FPs).
-
- JustinMason: I think for now it's easiest to stick with the 'load aggressive rulesets by name' idea, rather than adding a new configuration variable. For example, aggressiveness is not the only criteria for what rulesets to use; we'd have to include config variables for "I want anti-viral-bounce rulesets", too.)
-
- == Overlap Criteria ==
-
- BobMenschel: The method I used for weeding out SARE rules that overlapped 3.0.0 rules, was to run a full mass-check with overlap analysis, and throw away anything where the overlap is less than 50% (ie: keep only those rules which have "meaningful" overlap). Manually reviewing the remaining (significantly) overlapping rules was fairly easy. The command I use is: perl ./overlap ../rules/tested/$testfile.ham.log ../rules/tested/$testfile.spam.log | grep -v mid= | awk ' NR == 1 { print } ; $2 + 0 == 1.000 && $3 + 0 >= 0.500 { print } ' >../rules/tested/$testfile.overlap.out
-
- DanielQuinlan: 'By "throw away", do you mean put into the bucket that is retained going forward or did you mean to say "greater than 50%"?'
-
- BobMenschel: 'By "throw away anything where the overlap is less than 50%" I
- meant to discard (exclude from the final file) anything where the overlap was
- (IMO) insignificant.
- This would leave those overlaps where RULE_A hit all the emails that
- RULE_B also hit (100%), and RULE_B hit somewhere between 50% and 100%
- of the rules that RULE_A hit.'
-
- JustinMason: Like Daniel, I'm confused here. as far as I can see, you want to
- keep the rules that do NOT have a high degree of overlap with other rules, and
- throw out the rules that do (because they're redundant). in other words, you
- want to throw away when the mutual overlap is greater than some high value
- (like 95% at a guess).
-