Mailing List Archive

[Bug 2853] Rewrite masses/ (in perl)
http://bugzilla.spamassassin.org/show_bug.cgi?id=2853

duncf@debian.org changed:

What |Removed |Added
----------------------------------------------------------------------------
CC| |spamassassin-
| |dev@incubator.apache.org
AssignedTo|spamassassin- |duncf@debian.org
|dev@incubator.apache.org |
Summary|Clean up packagability of |Rewrite masses/ (in perl)
|evolve scripts |



------- Additional Comments From duncf@debian.org 2004-03-16 08:18 -------
I intend to re-write most of the masses/ stuff so that it's easier to use for
end users. Namely, I intend to rewrite perceptron in perl, and fix up the other
scripts so they don't use temp files all the time (there's a lot more reading
and writing than i think is necessary).

One major adavantage this will have is that a compiler won't be necessary to
rescore everything.

If anyone can think of some pitfalls I need to think about, I'd appreciate it if
you could let me know. Or, if you think what I'm doing is useless... tell me. My
goal is to create a spamassassin-tools/-utils package with this sort of thing so
people can rescore themselves.

One of the biggest weaknesses of SpamAssassin is that scores are hard to
determine and get stale quickly after a release. By allowing users to make their
own scores, we can prolong the "shelf life" of older version of SpamAssassin.
This, combined with plugins should be quite useful to users.

Ideally, SpamAssassin 3.0.x will be in Debian sarge. Since Debian releases so
infrequently, it is important that this last quite a while.



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
You are on the CC list for the bug, or are watching someone who is.
[Bug 2853] Rewrite masses/ (in perl) [ In reply to ]
http://bugzilla.spamassassin.org/show_bug.cgi?id=2853





------- Additional Comments From jm@jmason.org 2004-03-16 11:13 -------
Sounds generally like a good idea -- in particular, I'd suggest making
mass-check easier to use. I'd like to see mass-check generating one
output file, e.g. by adding a "ham"/"spam" indicator to the start of the line
instead of keeping each in separate output files.

Also the ancillary scripts -- fp-fn-statistics, hit-frequencies, etc. are a
little complicated, and all of them make too many assumptions about their
location, e.g. assumign that ../rules is the rules dir.

However, rewriting the perceptron in perl gets -1 from me.

IMO the C nature of the perceptron is not a big problem. Pretty much every
Debian machine will have a C compiler available. Also, the amount of data (logs
and scores) in RAM needs a good, compact and fast representation, and C works
very very well for this; probably a lot better than perl can do without quite a
bit of work.

What is the problem, however, is that it currently requires a rebuild to include
the hits and scores from the C files generated from logs-to-c. That should
probably be fixed, so that the perceptron can be distributed as a binary and
read that data at runtime. (as you said)

IMO, the biggest problem for users of a system like this, will be in corpus
management and mass-checking. the perceptron et al aren't too hard compared to
that.




------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.
[Bug 2853] Rewrite masses/ (in perl) [ In reply to ]
http://bugzilla.spamassassin.org/show_bug.cgi?id=2853





------- Additional Comments From felicity@kluge.net 2004-03-16 11:47 -------
Subject: Re: Rewrite masses/ (in perl)

On Tue, Mar 16, 2004 at 11:13:40AM -0800, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> However, rewriting the perceptron in perl gets -1 from me.

Um... Yeah, ditto to everything JM said, including the perceptron
rewrite -1... (stupid stealing my thunder ... <G>)





------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.
[Bug 2853] Rewrite masses/ (in perl) [ In reply to ]
http://bugzilla.spamassassin.org/show_bug.cgi?id=2853





------- Additional Comments From jm@jmason.org 2004-03-16 12:26 -------
ha! need to comment faster ;)



------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.
[Bug 2853] Rewrite masses/ (in perl) [ In reply to ]
http://bugzilla.spamassassin.org/show_bug.cgi?id=2853





------- Additional Comments From duncf@debian.org 2004-03-16 13:23 -------
Subject: Re: Rewrite masses/ (in perl)

As far as the perceptron in perl thing goes, I'm probably still going
to try it to see how slow it is.

My first goal will be to amalgamate many of those helper scripts --
hit-frequencies score-ranges-from-freqs logs-to-c, etc.




------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.
Re: [Bug 2853] Rewrite masses/ (in perl) [ In reply to ]
> http://bugzilla.spamassassin.org/show_bug.cgi?id=2853
> duncf@debian.org changed:

> I intend to re-write most of the masses/ stuff so that it's easier to
> use for end users. Namely, I intend to rewrite perceptron in perl, and
> fix up the other scripts so they don't use temp files all the time
> (there's a lot more reading and writing than i think is necessary).

Thank you!

> If anyone can think of some pitfalls I need to think about, I'd
> appreciate it if you could let me know. Or, if you think what I'm doing
> is useless... tell me. My goal is to create a spamassassin-tools/-utils
> package with this sort of thing so people can rescore themselves.

Useful, appreciated, and very beneficial to those of us interested and
(otherwise) able to run against our own corpi.

I've managed to make mass-check work for me, but it's a multiple step
process, with a structure I had to create by trial and error.
Improvements will make it much easier to move from version to version.

> One of the biggest weaknesses of SpamAssassin is that scores are hard
> to determine and get stale quickly after a release. By allowing users
> to make their own scores, we can prolong the "shelf life" of older
> version of SpamAssassin. This, combined with plugins should be quite
> useful to users.

Excellent! As one of the contributors to the Rules Emporium, I know that
one of our biggest challenges is the determination of workable first-pass
scores. Accepting that people may need to change the scores we develop as
they adopt our rules and/or rule sets, it's still important that we score
high enough to show whether the rules are useful, and low enough to avoid
FPs. Allowing automatic realignment of scores will be a big help to this
process, and to the accuracy of Spam Assassin itself!

Bob Menschel
[Bug 2853] Rewrite masses/ (in perl) [ In reply to ]
http://bugzilla.spamassassin.org/show_bug.cgi?id=2853





------- Additional Comments From gary@intrepid.com 2004-03-17 08:31 -------
Subject: RE: Rewrite masses/ (in perl)


My RFE, bug 3096, probably relates to this one, as well.






------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.
[Bug 2853] Rewrite masses/ (in perl) [ In reply to ]
http://bugzilla.spamassassin.org/show_bug.cgi?id=2853

jm@jmason.org changed:

What |Removed |Added
----------------------------------------------------------------------------
BugsThisDependsOn| |3096





------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.