Mailing List Archive

[Spamassassin Wiki] Update of "RulesProjStreamlining" by BobMenschel
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Spamassassin Wiki" for change notification.

The following page has been changed by BobMenschel:
http://wiki.apache.org/spamassassin/RulesProjStreamlining

The comment on the change is:
Added suggestion for dealing with overlap

------------------------------------------------------------------------------
* > 0.25% of target type hit (e.g. spam for non-nice rules)
* < 1.00% of non-target type hit (e.g. ham for non-nice rules)
* not too slow ;)
- * TODO: criteria for overlap with existing rules?
+ * TODO: criteria for overlap with existing rules? BobMenschel: The method I used for weeding out SARE rules that overlapped 3.0.0 rules, was to run a full mass-check with overlap analysis, and throw away anything where the overlap is less than 50%. Manually reviewing the remaining (significantly) overlapping rules was fairly easy. The command I use is: perl ./overlap ../rules/tested/$testfile.ham.log ../rules/tested/$testfile.spam.log | grep -v mid= | awk ' NR == 1 { print } ; $2 + 0 == 1.000 && $3 + 0 >= 0.500 { print } ' >../rules/tested/$testfile.overlap.out

A ruleset in the "extra" set would have different criteria.
[Spamassassin Wiki] Update of "RulesProjStreamlining" by BobMenschel [ In reply to ]
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Spamassassin Wiki" for change notification.

The following page has been changed by BobMenschel:
http://wiki.apache.org/spamassassin/RulesProjStreamlining

The comment on the change is:
added some ideas about "extra" rule set files

------------------------------------------------------------------------------
* TODO: criteria for overlap with existing rules? BobMenschel: The method I used for weeding out SARE rules that overlapped 3.0.0 rules, was to run a full mass-check with overlap analysis, and throw away anything where the overlap is less than 50%. Manually reviewing the remaining (significantly) overlapping rules was fairly easy. The command I use is: perl ./overlap ../rules/tested/$testfile.ham.log ../rules/tested/$testfile.spam.log | grep -v mid= | awk ' NR == 1 { print } ; $2 + 0 == 1.000 && $3 + 0 >= 0.500 { print } ' >../rules/tested/$testfile.overlap.out

A ruleset in the "extra" set would have different criteria.
+ * DanielQuinlan suggested: The second, a collection that do not qualify for rules/core. For example, SpamAssassin intentionally doesn't filter virus bounces (yet, at least), but there is a good virus bounce ruleset out there.
+ * BobMenschel: Similarly, an "extra" rules set might include rules that positively identify spam from spamware, but hit <0.25% of spam. Or an "aggressive" rules set might include rules that hit with an S/O of only 0.89, but push a lot of spam over the 5.0 threshold without impacting significantly on ham.

We can also vote for extraordinary stuff that doesn't fit into those criteria...
[Spamassassin Wiki] Update of "RulesProjStreamlining" by BobMenschel [ In reply to ]
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Spamassassin Wiki" for change notification.

The following page has been changed by BobMenschel:
http://wiki.apache.org/spamassassin/RulesProjStreamlining

The comment on the change is:
Added Chris Santerre comment about aggressiveness

------------------------------------------------------------------------------
A ruleset in the "extra" set would have different criteria.
* DanielQuinlan suggested: The second, a collection that do not qualify for rules/core. For example, SpamAssassin intentionally doesn't filter virus bounces (yet, at least), but there is a good virus bounce ruleset out there.
* BobMenschel: Similarly, an "extra" rules set might include rules that positively identify spam from spamware, but hit <0.25% of spam. Or an "aggressive" rules set might include rules that hit with an S/O of only 0.89, but push a lot of spam over the 5.0 threshold without impacting significantly on ham.
+ * ChrisSanterre: Seeing this breakdown of dirs, gave me an idea. Why not set the "aggresiveness" of SA for updates? Like how SARE has ruleset0.cf (no ham hits), ruleset1.cf (few ham, high S/O), etc., with each "level" of rule set file getting slightly more aggressive, risking (though not necessarily seeing) slightly higher FP rates. Users could set some config like supdate=(1-4), with 1 being the most conservative, and 4 being the most aggresive (with the knowledge that more aggresive *could* possibly cause more FPs).

We can also vote for extraordinary stuff that doesn't fit into those criteria...