Dear Wiki user,
You have subscribed to a wiki page or wiki category on "Spamassassin Wiki" for change notification.
The following page has been changed by JustinMason:
http://wiki.apache.org/spamassassin/RulesProjStreamlining
The comment on the change is:
updating
------------------------------------------------------------------------------
First off, the sandboxes idea greatly increases the number of people who can check rules into SVN. Secondly, the barriers to entry for getting a sandboxes account are much lower.
+ = Rule Promotion =
- Some bulletpoints from discussion, needs expanding:
-
- sandbox:
* each user gets their own sandbox as discussed on RulesProjMoreInput
* checked-in rules in the sandboxes are mass-checked in the nightly mass-checks
@@ -26, +24 @@
* S/O ratio of 0.95 or greater (or 0.05 or less for nice rules)
* > 0.25% of target type hit (e.g. spam for non-nice rules)
* < 1.00% of non-target type hit (e.g. ham for non-nice rules)
- * not too slow ;)
- * TODO: criteria for overlap with existing rules? BobMenschel: The method I used for weeding out SARE rules that overlapped 3.0.0 rules, was to run a full mass-check with overlap analysis, and throw away anything where the overlap is less than 50% (ie: keep only those rules which have "meaningful" overlap). Manually reviewing the remaining (significantly) overlapping rules was fairly easy. The command I use is: perl ./overlap ../rules/tested/$testfile.ham.log ../rules/tested/$testfile.spam.log | grep -v mid= | awk ' NR == 1 { print } ; $2 + 0 == 1.000 && $3 + 0 >= 0.500 { print } ' >../rules/tested/$testfile.overlap.out
+ Future criteria:
- A ruleset in the "extra" set would have different criteria.
- * DanielQuinlan suggested: The second, a collection that do not qualify for rules/core. For example, SpamAssassin intentionally doesn't filter virus bounces (yet, at least), but there is a good virus bounce ruleset out there.
- * BobMenschel: Similarly, an "extra" rules set might include rules that positively identify spam from spamware, but hit <0.25% of spam. Or an "aggressive" rules set might include rules that hit with an S/O of only 0.89, but push a lot of spam over the 5.0 threshold without impacting significantly on ham.
- * ChrisSanterre: Seeing this breakdown of dirs, gave me an idea. Why not set the "aggresiveness" of SA for updates? Like how SARE has ruleset0.cf (no ham hits), ruleset1.cf (few ham, high S/O), etc., with each "level" of rule set file getting slightly more aggressive, risking (though not necessarily seeing) slightly higher FP rates. Users could set some config like supdate=(1-4), with 1 being the most conservative, and 4 being the most aggresive (with the knowledge that more aggresive *could* possibly cause more FPs).
- We can also vote for extraordinary stuff that doesn't fit into those criteria...
+ * not too slow ;) TODO: need an automated way to measure that
+ * TODO: criteria for overlap with existing rules? see 'overlap criteria' below.
- private list for mass-checks:
+ We can also vote for rules that don't pass those criteria, but we think should be put into core for some reason.
+ A ruleset in the "extra" set would have different criteria; e.g.
- * archives delayed 1 month?
- * moderated signups
- * automated mass-checks of attachments in specific file format
- * rules considered suitable for use are checked into the "sandbox" area for a quick nightly-mass-check, for release
+ * the virus bounce ruleset
+ * rules that positively identify spam from spamware, but hit <0.25% of spam
+ * an "aggressive" rules set might include rules that hit with an S/O of only 0.89, but push a lot of spam over the 5.0 threshold without impacting significantly on ham
+
+ (ChrisSanterre: Seeing this breakdown of dirs, gave me an idea. Why not set the "aggresiveness" of SA for updates? Like how SARE has ruleset0.cf (no ham hits), ruleset1.cf (few ham, high S/O), etc., with each "level" of rule set file getting slightly more aggressive, risking (though not necessarily seeing) slightly higher FP rates. Users could set some config like supdate=(1-4), with 1 being the most conservative, and 4 being the most aggresive (with the knowledge that more aggresive *could* possibly cause more FPs).
+
+ JustinMason: I think for now it's easiest to stick with the 'load aggressive rulesets by name' idea, rather than adding a new configuration variable. For example, aggressiveness is not the only criteria for what rulesets to use; we'd have to include config variables for "I want anti-viral-bounce rulesets", too.)
+
+ == Overlap Criteria ==
+
+ BobMenschel: The method I used for weeding out SARE rules that overlapped 3.0.0 rules, was to run a full mass-check with overlap analysis, and throw away anything where the overlap is less than 50% (ie: keep only those rules which have "meaningful" overlap). Manually reviewing the remaining (significantly) overlapping rules was fairly easy. The command I use is: perl ./overlap ../rules/tested/$testfile.ham.log ../rules/tested/$testfile.spam.log | grep -v mid= | awk ' NR == 1 { print } ; $2 + 0 == 1.000 && $3 + 0 >= 0.500 { print } ' >../rules/tested/$testfile.overlap.out
+
+ DanielQuinlan: 'By "throw away", do you mean put into the bucket that is retained going forward or did you mean to say "greater than 50%"?'
+
+ BobMenschel: 'By "throw away anything where the overlap is less than 50%" I
+ meant to discard (exclude from the final file) anything where the overlap was
+ (IMO) insignificant.
+ This would leave those overlaps where RULE_A hit all the emails that
+ RULE_B also hit (100%), and RULE_B hit somewhere between 50% and 100%
+ of the rules that RULE_A hit.'
+
+ JustinMason: Like Daniel, I'm confused here. as far as I can see, you want to
+ keep the rules that do NOT have a high degree of overlap with other rules, and
+ throw out the rules that do (because they're redundant). in other words, you
+ want to throw away when the mutual overlap is greater than some high value
+ (like 95% at a guess).
+
You have subscribed to a wiki page or wiki category on "Spamassassin Wiki" for change notification.
The following page has been changed by JustinMason:
http://wiki.apache.org/spamassassin/RulesProjStreamlining
The comment on the change is:
updating
------------------------------------------------------------------------------
First off, the sandboxes idea greatly increases the number of people who can check rules into SVN. Secondly, the barriers to entry for getting a sandboxes account are much lower.
+ = Rule Promotion =
- Some bulletpoints from discussion, needs expanding:
-
- sandbox:
* each user gets their own sandbox as discussed on RulesProjMoreInput
* checked-in rules in the sandboxes are mass-checked in the nightly mass-checks
@@ -26, +24 @@
* S/O ratio of 0.95 or greater (or 0.05 or less for nice rules)
* > 0.25% of target type hit (e.g. spam for non-nice rules)
* < 1.00% of non-target type hit (e.g. ham for non-nice rules)
- * not too slow ;)
- * TODO: criteria for overlap with existing rules? BobMenschel: The method I used for weeding out SARE rules that overlapped 3.0.0 rules, was to run a full mass-check with overlap analysis, and throw away anything where the overlap is less than 50% (ie: keep only those rules which have "meaningful" overlap). Manually reviewing the remaining (significantly) overlapping rules was fairly easy. The command I use is: perl ./overlap ../rules/tested/$testfile.ham.log ../rules/tested/$testfile.spam.log | grep -v mid= | awk ' NR == 1 { print } ; $2 + 0 == 1.000 && $3 + 0 >= 0.500 { print } ' >../rules/tested/$testfile.overlap.out
+ Future criteria:
- A ruleset in the "extra" set would have different criteria.
- * DanielQuinlan suggested: The second, a collection that do not qualify for rules/core. For example, SpamAssassin intentionally doesn't filter virus bounces (yet, at least), but there is a good virus bounce ruleset out there.
- * BobMenschel: Similarly, an "extra" rules set might include rules that positively identify spam from spamware, but hit <0.25% of spam. Or an "aggressive" rules set might include rules that hit with an S/O of only 0.89, but push a lot of spam over the 5.0 threshold without impacting significantly on ham.
- * ChrisSanterre: Seeing this breakdown of dirs, gave me an idea. Why not set the "aggresiveness" of SA for updates? Like how SARE has ruleset0.cf (no ham hits), ruleset1.cf (few ham, high S/O), etc., with each "level" of rule set file getting slightly more aggressive, risking (though not necessarily seeing) slightly higher FP rates. Users could set some config like supdate=(1-4), with 1 being the most conservative, and 4 being the most aggresive (with the knowledge that more aggresive *could* possibly cause more FPs).
- We can also vote for extraordinary stuff that doesn't fit into those criteria...
+ * not too slow ;) TODO: need an automated way to measure that
+ * TODO: criteria for overlap with existing rules? see 'overlap criteria' below.
- private list for mass-checks:
+ We can also vote for rules that don't pass those criteria, but we think should be put into core for some reason.
+ A ruleset in the "extra" set would have different criteria; e.g.
- * archives delayed 1 month?
- * moderated signups
- * automated mass-checks of attachments in specific file format
- * rules considered suitable for use are checked into the "sandbox" area for a quick nightly-mass-check, for release
+ * the virus bounce ruleset
+ * rules that positively identify spam from spamware, but hit <0.25% of spam
+ * an "aggressive" rules set might include rules that hit with an S/O of only 0.89, but push a lot of spam over the 5.0 threshold without impacting significantly on ham
+
+ (ChrisSanterre: Seeing this breakdown of dirs, gave me an idea. Why not set the "aggresiveness" of SA for updates? Like how SARE has ruleset0.cf (no ham hits), ruleset1.cf (few ham, high S/O), etc., with each "level" of rule set file getting slightly more aggressive, risking (though not necessarily seeing) slightly higher FP rates. Users could set some config like supdate=(1-4), with 1 being the most conservative, and 4 being the most aggresive (with the knowledge that more aggresive *could* possibly cause more FPs).
+
+ JustinMason: I think for now it's easiest to stick with the 'load aggressive rulesets by name' idea, rather than adding a new configuration variable. For example, aggressiveness is not the only criteria for what rulesets to use; we'd have to include config variables for "I want anti-viral-bounce rulesets", too.)
+
+ == Overlap Criteria ==
+
+ BobMenschel: The method I used for weeding out SARE rules that overlapped 3.0.0 rules, was to run a full mass-check with overlap analysis, and throw away anything where the overlap is less than 50% (ie: keep only those rules which have "meaningful" overlap). Manually reviewing the remaining (significantly) overlapping rules was fairly easy. The command I use is: perl ./overlap ../rules/tested/$testfile.ham.log ../rules/tested/$testfile.spam.log | grep -v mid= | awk ' NR == 1 { print } ; $2 + 0 == 1.000 && $3 + 0 >= 0.500 { print } ' >../rules/tested/$testfile.overlap.out
+
+ DanielQuinlan: 'By "throw away", do you mean put into the bucket that is retained going forward or did you mean to say "greater than 50%"?'
+
+ BobMenschel: 'By "throw away anything where the overlap is less than 50%" I
+ meant to discard (exclude from the final file) anything where the overlap was
+ (IMO) insignificant.
+ This would leave those overlaps where RULE_A hit all the emails that
+ RULE_B also hit (100%), and RULE_B hit somewhere between 50% and 100%
+ of the rules that RULE_A hit.'
+
+ JustinMason: Like Daniel, I'm confused here. as far as I can see, you want to
+ keep the rules that do NOT have a high degree of overlap with other rules, and
+ throw out the rules that do (because they're redundant). in other words, you
+ want to throw away when the mutual overlap is greater than some high value
+ (like 95% at a guess).
+