Mailing List Archive

[Bug 3429] bayes scores
http://bugzilla.spamassassin.org/show_bug.cgi?id=3429

spamassassin@svp.zuzino.net.ru changed:

What |Removed |Added
----------------------------------------------------------------------------
Severity|normal |enhancement
Version|unspecified |2.63





------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
[Bug 3429] bayes scores [ In reply to ]
http://bugzilla.spamassassin.org/show_bug.cgi?id=3429





------- Additional Comments From quinlan@pathname.com 2004-05-26 13:35 -------
Subject: Re: New: bayes scores

> More effective will be next logarithmic rules
>
> score BAYES_007 from 0 to exp(-5)
> score BAYES_018 from exp(-5) to exp(-4)
> score BAYES_049 from exp(-4) to exp(-3)
> score BAYES_135 from exp(-3) to exp(-2)
> score BAYES_367 from exp(-2) to 1 - exp(-2)
> score BAYES_633
> score BAYES_865
> score BAYES_951
> score BAYES_982
> score BAYES_993 from 1-exp(-5) to 1

I like the concept. I pretty much ended up with an experimentally
derived ranging in 3.0 that is not too different. I'm willing to give
yours a look:

current:

body BAYES_00 eval:check_bayes('0.00', '0.01')
body BAYES_05 eval:check_bayes('0.01', '0.05')
body BAYES_10 eval:check_bayes('0.05', '0.20')
body BAYES_25 eval:check_bayes('0.20', '0.40')
body BAYES_50 eval:check_bayes('0.40', '0.60')
body BAYES_75 eval:check_bayes('0.60', '0.80')
body BAYES_90 eval:check_bayes('0.80', '0.95')
body BAYES_95 eval:check_bayes('0.95', '0.99')
body BAYES_99 eval:check_bayes('0.99', '1.00')

0.000000-0.010000 39.550
0.010000-0.050000 0.579
0.050000-0.200000 0.306 <- thin
0.200000-0.400000 0.318 <- thin
0.400000-0.600000 4.385
0.600000-0.800000 1.337
0.800000-0.950000 1.401
0.950000-0.990000 1.200
0.990000-1.000000 50.923

new

0.000000-0.006738 39.485
0.006738-0.018316 0.159 <- thin
0.018316-0.049787 0.441 <- thin
0.049787-0.135335 0.258 <- thin
0.135335-0.367879 0.352 <- thin
0.367879-0.632121 4.717
0.632121-0.864665 1.511
0.864665-0.950213 0.960
0.950213-0.981684 0.779
0.981684-0.993262 0.697
0.993262-1.000000 50.641

I think some of the ranges are too empty. Let's try:

0 to exp(-8)
exp(-4) to exp(-2)
exp(-2) to exp(-1)
exp(-1) to 1-exp(-1)
1-exp(-1) to 1-exp(-2)
1-exp(-2) to 1-exp(-4)
1-exp(-4) to 1-exp(-8)
1-exp(-8) to 1

0.000000-0.000335 39.046
0.000335-0.018316 0.598
0.018316-0.135335 0.699
0.135335-0.367879 0.352
0.367879-0.632121 4.717
0.632121-0.864665 1.511
0.864665-0.981684 1.739
0.981684-0.999665 2.450
0.999665-1.000000 48.888

That's better. Maybe we could...





------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
[Bug 3429] bayes scores [ In reply to ]
http://bugzilla.spamassassin.org/show_bug.cgi?id=3429





------- Additional Comments From spamassassin@svp.zuzino.net.ru 2004-05-26 15:03 -------
Lets use Ratio*Wide criteria.

>new
>
> 0.000000-0.006738 39.485
> 0.006738-0.018316 0.159 <- thin
> 0.018316-0.049787 0.441 <- thin
> 0.049787-0.135335 0.258 <- thin
They are thin (small Ratio) but they can have strong Wide (popularity, frequency)

> 0.135335-0.367879 0.352 <- thin
> 0.367879-0.632121 4.717
> 0.632121-0.864665 1.511
> 0.864665-0.950213 0.960
> 0.950213-0.981684 0.779
> 0.981684-0.993262 0.697
> 0.993262-1.000000 50.641

>I think some of the ranges are too empty. Let's try:

>0 to exp(-8)
>exp(-4) to exp(-2)
>exp(-2) to exp(-1)
>exp(-1) to 1-exp(-1)
>1-exp(-1) to 1-exp(-2)
>1-exp(-2) to 1-exp(-4)
>1-exp(-4) to 1-exp(-8)
>1-exp(-8) to 1

> 0.000000-0.000335 39.046
> 0.000335-0.018316 0.598
> 0.018316-0.135335 0.699
> 0.135335-0.367879 0.352
> 0.367879-0.632121 4.717
> 0.632121-0.864665 1.511
> 0.864665-0.981684 1.739
> 0.981684-0.999665 2.450
> 0.999665-1.000000 48.888

>That's better. Maybe we could...

I dont think that its better....

We should Sum all Ratio * Wide for every row and search combination
that maximize the Sum







------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
[Bug 3429] bayes scores [ In reply to ]
http://bugzilla.spamassassin.org/show_bug.cgi?id=3429





------- Additional Comments From quinlan@pathname.com 2004-05-26 17:22 -------
Subject: Re: bayes scores

> They are thin (small Ratio) but they can have strong Wide (popularity,
> frequency)

I have no idea what you mean.

By "thin", I meant that not enough messages fall into the category, so
the score optimizer (GA or the perceptron) would have a hard time
determining the correct score for the rule.

It's better to have more messages (spam plus ham) falling into each
rule bucket.

> We should Sum all Ratio * Wide for every row and search combination
> that maximize the Sum

I don't understand.





------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
[Bug 3429] bayes scores [ In reply to ]
http://bugzilla.spamassassin.org/show_bug.cgi?id=3429





------- Additional Comments From spamassassin@svp.zuzino.net.ru 2004-05-26 23:11 -------
First We should create a mathematical criteria of rule quality and effectivly.

(I suppose that this criteria reject/commit new rules and remove old rules)

The first and main criteria is ham/spam ratio for whitelist rules (score<0)
and spam/ham ratio for blacklist rules (score>0).

The second criteria is "popularity" or "wide" - ham/totalhams from whitelist
rules and spam/totalspams for blacklist rules.

The third criteria is correlation this other rules. For bayes rules correlation
= 0; Itis good.

For better quality all coefficients must be > 200.

There are rules that have biggest Ratio (big scores) but work seldom.
There are rules that have small Ratio (small scores) but they work almost in
every message and there are many rules of this type.

We dont need rules whith small scores that work seldom.

The total criteria I define as production Ratio*Wide







------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
[Bug 3429] bayes scores [ In reply to ]
http://bugzilla.spamassassin.org/show_bug.cgi?id=3429





------- Additional Comments From spamassassin@svp.zuzino.net.ru 2004-05-26 23:46 -------
> so the score optimizer (GA or the perceptron)

Why you use Genetic Algorith and perceptron, but not use some Statistical Rules?

For example

Score = log(spams/(20+hams_detected_as_spam)) for blacklisted rules.

(20 is a penalty for low accuracy data)



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
Re: [Bug 3429] bayes scores [ In reply to ]
> First We should create a mathematical criteria of rule quality and
> effectivly.
>
> (I suppose that this criteria reject/commit new rules and remove old rules)
>
> The first and main criteria is ham/spam ratio for whitelist rules
> (score<0) and spam/ham ratio for blacklist rules (score>0).

We already have criteria.

We use the S/O ratio (spam/overall = spam/(ham+spam) using a 50/50
weighting of ham to spam so the weighting is constant). High is good
for spam rules. Low is good for ham rules.

We also use a RANK number which is a relative ranking system of each
rule compared to every other rule.

We also use the hit rate. SPAM% for spam rules and HAM% for ham rules.

And also we use overlap (or correlation) of rules to eliminate rules
that overlap with other rules too much.

At the end of the day, however, the only thing that matters is the score
generated by the perceptron. It does a better job than other simple
measures of setting scores because interactions between rules are too
complicated to represent with simple formulas.

--
Daniel Quinlan
http://www.pathname.com/~quinlan/
[Bug 3429] bayes scores [ In reply to ]
http://bugzilla.spamassassin.org/show_bug.cgi?id=3429





------- Additional Comments From quinlan@pathname.com 2004-05-27 11:05 -------
Subject: Re: bayes scores

> First We should create a mathematical criteria of rule quality and
> effectivly.
>
> (I suppose that this criteria reject/commit new rules and remove old rules)
>
> The first and main criteria is ham/spam ratio for whitelist rules
> (score<0) and spam/ham ratio for blacklist rules (score>0).

We already have criteria.

We use the S/O ratio (spam/overall = spam/(ham+spam) using a 50/50
weighting of ham to spam so the weighting is constant). High is good
for spam rules. Low is good for ham rules.

We also use a RANK number which is a relative ranking system of each
rule compared to every other rule.

We also use the hit rate. SPAM% for spam rules and HAM% for ham rules.

And also we use overlap (or correlation) of rules to eliminate rules
that overlap with other rules too much.

At the end of the day, however, the only thing that matters is the score
generated by the perceptron. It does a better job than other simple
measures of setting scores because interactions between rules are too
complicated to represent with simple formulas.





------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
[Bug 3429] bayes scores [ In reply to ]
http://bugzilla.spamassassin.org/show_bug.cgi?id=3429





------- Additional Comments From spamassassin@svp.zuzino.net.ru 2004-05-27 11:33 -------
>we already have criteria.

>We use the S/O ratio (spam/overall = spam/(ham+spam) using a 50/50
>weighting of ham to spam so the weighting is constant). High is good
>for spam rules. Low is good for ham rules.

>We also use a RANK number which is a relative ranking system of each
>rule compared to every other rule.

RANK - is simple sorting by other criteries?

>We also use the hit rate. SPAM% for spam rules and HAM% for ham rules.

Thank you for criteria!

We have 3 coefficient - R/0, HitRate and Overlap

What about Potencial Forging?

Lets speak about R/0, HitRate

What should be R/0 and HitRate for new rule, that rule will be accepted?

Can we public formula R/0*HitRate > something to accept new rules?

Where users can found R/0 and HitRates for all rules?

At page http://www.spamassassin.org/tests.html I see only scores.

Can we public corpus size, number on hams/spams, and R/0 ratio and HitRate for
every rule on this page?


Thank you?









------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
[Bug 3429] bayes scores [ In reply to ]
http://bugzilla.spamassassin.org/show_bug.cgi?id=3429





------- Additional Comments From spamassassin@svp.zuzino.net.ru 2004-05-27 11:48 -------
For Bayes rules we should select intervals, that maximize the sum of R/0*HitRates

BAYES_INTERVAL1 R/O*HitRate = B1
......
BAYES_INTERVALN R/0*HitRate = B2

Sum = B1 + ... + BN




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
[Bug 3429] bayes scores [ In reply to ]
http://bugzilla.spamassassin.org/show_bug.cgi?id=3429





------- Additional Comments From spamassassin@svp.zuzino.net.ru 2004-05-27 12:23 -------
For Bayes rules we should select intervals,
that maximize the sum of R/0*HitRates

BAYES_INTERVAL1 R/O*HitRate = B1
BAYES_INTERVAL2 R/O*HitRate = B2
......
BAYES_INTERVALN R/0*HitRate = BN

Sum = B1 + B2 + ... + BN, N = fixed

I think, this sum will be maximum, if b1 will be about b2, b2 about b3 ... b1
about bn

--------

Other idea is to transfer bayses probability to scores

Bayes_Score = Constant1*log(BAYES_PROBABYLYTY) if BAYES_PROBABYLYTY < 0.5
Bayes_Score = -Constant2*log(1-BAYES_PROBABYLYTY) if BAYES_PROBABYLYTY > 0.5

We should select only Constant1 and Constant2






------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
[Bug 3429] bayes scores [ In reply to ]
http://bugzilla.spamassassin.org/show_bug.cgi?id=3429

quinlan@pathname.com changed:

What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |RESOLVED
Resolution| |WORKSFORME



------- Additional Comments From quinlan@pathname.com 2004-05-27 13:37 -------
Closing as WORKSFORME, maybe we'll tweak the ranges, but I don't want to argue
with you about it.




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.