Mailing List Archive

still cant expire bayes tokens
i have been trying for well over a month now to expire tokens, and it
still just wont happen. The expire thinks that i have my max tokens
going way back in time, which just isnt true. Any advice here would be
great.

[root@nydb1 adam]# sa-learn --dump magic
0.000 0 2 0 non-token data: bayes db version
0.000 0 115103 0 non-token data: nspam
0.000 0 38288 0 non-token data: nham
0.000 0 2588992 0 non-token data: ntokens
0.000 0 0 0 non-token data: oldest atime
0.000 0 1134906269 0 non-token data: newest atime
0.000 0 1076421666 0 non-token data: last journal
sync atime
0.000 0 1076423258 0 non-token data: last expiry
atime
0.000 0 1382400 0 non-token data: last expire
atime delta
0.000 0 1019 0 non-token data: last expire
reduction count

-----
and the expire looks like this.

debug: bayes: 9510 tie-ing to DB file R/W /share/spam/bayes_toks
debug: bayes: 9510 tie-ing to DB file R/W /share/spam/bayes_seen
debug: bayes: found bayes db version 2
.................
synced Bayes databases from journal in 87 seconds: 53467 unique entries
(83965 total entries)
debug: bayes: expiry check keep size, 75% of max: 750000
debug: bayes: token count: 2588992, final goal reduction size: 1838992
debug: bayes: First pass? Current: 1076421753, Last: 1076394691, atime:
1382400, count: 1019, newdelta: 765, ratio: 1804.70264965653
debug: bayes: Can't use estimation method for expiry, something fishy,
calculating optimal atime delta (first pass)

debug: bayes: atime token reduction
debug: bayes: ======== ===============
debug: bayes: 43200 2595384
debug: bayes: 86400 2595384
debug: bayes: 172800 2595384
debug: bayes: 345600 2595384
debug: bayes: 691200 2595384
debug: bayes: 1382400 2595384
debug: bayes: 2764800 2595384
debug: bayes: 5529600 2595384
debug: bayes: 11059200 2595384
debug: bayes: 22118400 2595384
debug: bayes: couldn't find a good delta atime, need more token
difference, skipping expire.
debug: Syncing complete.
debug: bayes: 9510 untie-ing
debug: bayes: 9510 untie-ing db_toks
debug: bayes: 9510 untie-ing db_seen
debug: bayes: files locked, now unlocking lock
unlock: 9510 unlink failed: /share/spam/bayes.lock
debug: unlock: 9510 unlink /share/spam/bayes.lock
Re: still cant expire bayes tokens [ In reply to ]
At 09:31 AM 2/10/2004, Adam Denenberg wrote:

<snip>

>[root@nydb1 adam]# sa-learn --dump magic
>0.000 0 0 0 non-token data: oldest atime
>0.000 0 1134906269 0 non-token data: newest atime


<snip>


>debug: bayes: First pass? Current: 1076421753, Last: 1076394691, atime:
>1382400, count: 1019, newdelta: 765, ratio: 1804.70264965653
>debug: bayes: Can't use estimation method for expiry, something fishy,
>calculating optimal atime delta (first pass)

The first thing that REALLY jumps out at me, is that your newest token
atime is ahead of the current atime... did you have some kind of massive
clock messup on this system? Theoretically you shouldn't ever have
futuristic tokens.. SpamAssassin isn't a psychic (yet).
Re: still cant expire bayes tokens [ In reply to ]
no clock problems i know of. I run ntp on all my servers and just double
checked to confirm that. All my boxes are in sync to the second.

adam

On Tue, 2004-02-10 at 10:53, Matt Kettler wrote:
> At 09:31 AM 2/10/2004, Adam Denenberg wrote:
>
> <snip>
>
> >[root@nydb1 adam]# sa-learn --dump magic
> >0.000 0 0 0 non-token data: oldest atime
> >0.000 0 1134906269 0 non-token data: newest atime
>
>
> <snip>
>
>
> >debug: bayes: First pass? Current: 1076421753, Last: 1076394691, atime:
> >1382400, count: 1019, newdelta: 765, ratio: 1804.70264965653
> >debug: bayes: Can't use estimation method for expiry, something fishy,
> >calculating optimal atime delta (first pass)
>
> The first thing that REALLY jumps out at me, is that your newest token
> atime is ahead of the current atime... did you have some kind of massive
> clock messup on this system? Theoretically you shouldn't ever have
> futuristic tokens.. SpamAssassin isn't a psychic (yet).
>
>
>
Re: still cant expire bayes tokens [ In reply to ]
On Tue, Feb 10, 2004 at 10:53:15AM -0500, Matt Kettler wrote:
> The first thing that REALLY jumps out at me, is that your newest token
> atime is ahead of the current atime... did you have some kind of massive
> clock messup on this system? Theoretically you shouldn't ever have
> futuristic tokens.. SpamAssassin isn't a psychic (yet).

atimes in the bayes DB are based on the dates found in the message.
it should use the date from the top received header, but could fall
through to other headers... fyi. :)

--
Randomly Generated Tagline:
"> I'm an idiot.. At least this [bug] took about 5 minutes to find..
We need to find some new terms to describe the rest of us mere mortals
then." - Craig Schlenter in response to Linus Torvalds about a kernel bug.
Re: still cant expire bayes tokens [ In reply to ]
On Tue, Feb 10, 2004 at 09:31:34AM -0500, Adam Denenberg wrote:
> debug: bayes: expiry check keep size, 75% of max: 750000

Ok, so your max size is 1_000_000 tokens.

> debug: bayes: token count: 2588992, final goal reduction size: 1838992

Your DB says you have ~2.6m tokens, so to get to the goal of 750k tokens,
you need to remove ~1.8m tokens.

> debug: bayes: First pass? Current: 1076421753, Last: 1076394691, atime:
> 1382400, count: 1019, newdelta: 765, ratio: 1804.70264965653

Not looking at the other things, the ratio is way off, so expiry isn't going to work.

> debug: bayes: atime token reduction
> debug: bayes: ======== ===============
> debug: bayes: 43200 2595384
> debug: bayes: 86400 2595384
> debug: bayes: 172800 2595384
> debug: bayes: 345600 2595384
> debug: bayes: 691200 2595384
> debug: bayes: 1382400 2595384
> debug: bayes: 2764800 2595384
> debug: bayes: 5529600 2595384
> debug: bayes: 11059200 2595384
> debug: bayes: 22118400 2595384

The interesting thing here is that you only have 2588992 tokens in the DB
(magic token), but the atime/reduction chart shows 2595384 being removed
(actual loop through DB tokens)... What's up with that?

What the above chart says is that no matter what atime you use, you'll
be expirying too many tokens. Now, the atime deltas here are populated
sets via newest_atime - token_atime. Since your newest atime is far
far in the future as Matt already pointed out (1134906269 == Sun Dec
18 06:44:29 2005 EST), all of your tokens are "older" than 256 days
(last line in the chart).

So ... I would do 2 things. 1) fix the db. unless you're _very sure_
about the internal db format, "rm bayes_*". if you are used to the
format, do a db_dump, edit the output and modify the "future" token
atimes to be something more reasonable, modify the newest atime magic
token, do a db_load. 2) if you save your messages, find the one that
caused the problem and attach it to the ticket specified below...

FYI: For 3.0.0, I just put in some code that stops this kind of thing from
happening (if the calculated message atime is determined to be more than
1 day in the future, it just uses the current time() value instead).
If a 2.64 release happens, the fix will probably go in there too:
http://bugzilla.spamassassin.org/show_bug.cgi?id=3025

--
Randomly Generated Tagline:
"If you think nobody cares if you're alive, try missing a couple of car
payments." - Zen Musings
Re: still cant expire bayes tokens [ In reply to ]
thanks Theo. I would love to send my bayes_toks thru db_dump and fix the
"broken" records. However i am not familiar with the format. is there
an existing script, or a site that will allow me to properly remove
entries with bad atime values?

thanks
adam

On Tue, 2004-02-10 at 11:44, Theo Van Dinter wrote:
> On Tue, Feb 10, 2004 at 09:31:34AM -0500, Adam Denenberg wrote:
> > debug: bayes: expiry check keep size, 75% of max: 750000
>
> Ok, so your max size is 1_000_000 tokens.
>
> > debug: bayes: token count: 2588992, final goal reduction size: 1838992
>
> Your DB says you have ~2.6m tokens, so to get to the goal of 750k tokens,
> you need to remove ~1.8m tokens.
>
> > debug: bayes: First pass? Current: 1076421753, Last: 1076394691, atime:
> > 1382400, count: 1019, newdelta: 765, ratio: 1804.70264965653
>
> Not looking at the other things, the ratio is way off, so expiry isn't going to work.
>
> > debug: bayes: atime token reduction
> > debug: bayes: ======== ===============
> > debug: bayes: 43200 2595384
> > debug: bayes: 86400 2595384
> > debug: bayes: 172800 2595384
> > debug: bayes: 345600 2595384
> > debug: bayes: 691200 2595384
> > debug: bayes: 1382400 2595384
> > debug: bayes: 2764800 2595384
> > debug: bayes: 5529600 2595384
> > debug: bayes: 11059200 2595384
> > debug: bayes: 22118400 2595384
>
> The interesting thing here is that you only have 2588992 tokens in the DB
> (magic token), but the atime/reduction chart shows 2595384 being removed
> (actual loop through DB tokens)... What's up with that?
>
> What the above chart says is that no matter what atime you use, you'll
> be expirying too many tokens. Now, the atime deltas here are populated
> sets via newest_atime - token_atime. Since your newest atime is far
> far in the future as Matt already pointed out (1134906269 == Sun Dec
> 18 06:44:29 2005 EST), all of your tokens are "older" than 256 days
> (last line in the chart).
>
> So ... I would do 2 things. 1) fix the db. unless you're _very sure_
> about the internal db format, "rm bayes_*". if you are used to the
> format, do a db_dump, edit the output and modify the "future" token
> atimes to be something more reasonable, modify the newest atime magic
> token, do a db_load. 2) if you save your messages, find the one that
> caused the problem and attach it to the ticket specified below...
>
> FYI: For 3.0.0, I just put in some code that stops this kind of thing from
> happening (if the calculated message atime is determined to be more than
> 1 day in the future, it just uses the current time() value instead).
> If a 2.64 release happens, the fix will probably go in there too:
> http://bugzilla.spamassassin.org/show_bug.cgi?id=3025
Re: still cant expire bayes tokens [ In reply to ]
sorry for the repost but this seems like my only chance at a working
expiry. Does anybody have a link to some information as to how to
properly manipulate the bayes_toks file from a db_dump to remove the
bogus atime entries so i can put it back into db_load for a fixed bayes?

thanks again.
adam

On Tue, 2004-02-10 at 12:08, Adam Denenberg wrote:
> thanks Theo. I would love to send my bayes_toks thru db_dump and fix the
> "broken" records. However i am not familiar with the format. is there
> an existing script, or a site that will allow me to properly remove
> entries with bad atime values?
>
> thanks
> adam
>
> On Tue, 2004-02-10 at 11:44, Theo Van Dinter wrote:
> > On Tue, Feb 10, 2004 at 09:31:34AM -0500, Adam Denenberg wrote:
> > > debug: bayes: expiry check keep size, 75% of max: 750000
> >
> > Ok, so your max size is 1_000_000 tokens.
> >
> > > debug: bayes: token count: 2588992, final goal reduction size: 1838992
> >
> > Your DB says you have ~2.6m tokens, so to get to the goal of 750k tokens,
> > you need to remove ~1.8m tokens.
> >
> > > debug: bayes: First pass? Current: 1076421753, Last: 1076394691, atime:
> > > 1382400, count: 1019, newdelta: 765, ratio: 1804.70264965653
> >
> > Not looking at the other things, the ratio is way off, so expiry isn't going to work.
> >
> > > debug: bayes: atime token reduction
> > > debug: bayes: ======== ===============
> > > debug: bayes: 43200 2595384
> > > debug: bayes: 86400 2595384
> > > debug: bayes: 172800 2595384
> > > debug: bayes: 345600 2595384
> > > debug: bayes: 691200 2595384
> > > debug: bayes: 1382400 2595384
> > > debug: bayes: 2764800 2595384
> > > debug: bayes: 5529600 2595384
> > > debug: bayes: 11059200 2595384
> > > debug: bayes: 22118400 2595384
> >
> > The interesting thing here is that you only have 2588992 tokens in the DB
> > (magic token), but the atime/reduction chart shows 2595384 being removed
> > (actual loop through DB tokens)... What's up with that?
> >
> > What the above chart says is that no matter what atime you use, you'll
> > be expirying too many tokens. Now, the atime deltas here are populated
> > sets via newest_atime - token_atime. Since your newest atime is far
> > far in the future as Matt already pointed out (1134906269 == Sun Dec
> > 18 06:44:29 2005 EST), all of your tokens are "older" than 256 days
> > (last line in the chart).
> >
> > So ... I would do 2 things. 1) fix the db. unless you're _very sure_
> > about the internal db format, "rm bayes_*". if you are used to the
> > format, do a db_dump, edit the output and modify the "future" token
> > atimes to be something more reasonable, modify the newest atime magic
> > token, do a db_load. 2) if you save your messages, find the one that
> > caused the problem and attach it to the ticket specified below...
> >
> > FYI: For 3.0.0, I just put in some code that stops this kind of thing from
> > happening (if the calculated message atime is determined to be more than
> > 1 day in the future, it just uses the current time() value instead).
> > If a 2.64 release happens, the fix will probably go in there too:
> > http://bugzilla.spamassassin.org/show_bug.cgi?id=3025
>
Re: still cant expire bayes tokens [ In reply to ]
On Tue, Feb 10, 2004 at 12:08:56PM -0500, Adam Denenberg wrote:
> thanks Theo. I would love to send my bayes_toks thru db_dump and fix the
> "broken" records. However i am not familiar with the format. is there
> an existing script, or a site that will allow me to properly remove
> entries with bad atime values?

Not that I know of. If you're really keen on trying this, here's the
basics... Some of this probably should be documented somewhere besides
the code anyway ...:

# stop spamassassin ...
# make a backup copy of bayes_toks!

$ db_dump -p -f out .spamassassin/bayes_toks
$ sa-learn --dump data | perl -nle 'print if ( (split)[3] > time )' > out2

out2 now contains the list of tokens you need to fix. go through each
one in "out" and fix it. for instance, assume "anticipate" was a token
that needed fixing, in "out" you'd see something like:

anticipate
\00\fa\00\00\00\e0\00\00\00l\87*@

That's 13 bytes, which means it's the CVVV format. If it was 5 bytes,
it's CV format, fyi. Now you want to throw the data through unpack to
get the actual values out:

$ perl -e 'print join("\n", unpack("CVVV", "\x00\xfa\x00\x00\x00\xe0\x00\x00\x00l\x87*@"),"")'
0
250
224
1076529004

There's probably an easier way to do that, but ... perl expect hex
values in "\x##" format, but db_dump outputs in "\##" format, so you
have to put the "x" in appropriately there.

The first 3 numbers you don't care about, but they're packing format
(0 for CVVV, or 192 for CV), # of spam matches, # of ham matches.
The fourth number is atime. Change the atime to whatever you want, I'd
choose the current time() value (use the same one for all of the ones
you want to fix...) In my case, I'm going to use 1076685969 just for example.
Now you get to put it back in the right format ...

$ perl -e 'print map { sprintf "\\%02x",$_ } unpack("C13", pack("CVVV", 0, 250, 224, 1076685969));print "\n"'
\00\fa\00\00\00\e0\00\00\00\91\ec\2c\40

Take that and put it in the "out" file appropriately. Now repeat
for the other tokens. At the end, find the newest atime magic token
"\0d\01\07\09\03NEWESTAGE", and change the value (it's just a string)
from the current one to whatever atime you used, 1076685969 in this case.

$ db_load -f out .spamassassin/bayes_toks

You can now do a "sa-learn --dump" to make sure it all looks right...:

[...]
0.000 0 1076685969 0 non-token data: newest atime
[...]
0.158 250 224 1076685969 anticipate
[...]


Now, here's the fun part -- if you have tokens in CV format (which is
very likely in your case since the ham/spam counts are very likely to be
both < 8), this whole thing becomes a lot more complicated to do by hand...
So, let's switch to the more simple, but uglier, way of doing things:

$ perl -MMail::SpamAssassin::BayesStore -e 'print join("\n", \
Mail::SpamAssassin::BayesStore::tok_unpack({db_version => 2}, \
"\x00\xfa\x00\x00\x00\xe0\x00\x00\x00l\x87*@"),"")'
250
224
1076529004

$ perl -MMail::SpamAssassin::BayesStore -e 'print map { sprintf "\\%02x",$_ } \
unpack("C*", Mail::SpamAssassin::BayesStore->tok_pack(250, 224, 1076529004));print "\n"'
\00\fa\00\00\00\e0\00\00\00\6c\87\2a\40

This code has the benefit of working for both CVVV and CV formats...

For example: "\xd0\x1fU(@"
2
0
1076385055

[...]

\d0\1f\55\28\40



Please note that by editing your DB by hand, any future issues that
arise will be blamed on the editing. aka: no support.

--
Randomly Generated Tagline:
"The programmer needs the machine to run long enough to destroy it."
- Prof. Michaelson
Re: still cant expire bayes tokens [ In reply to ]
Theo Van Dinter wrote on Tue, 10 Feb 2004 11:44:24 -0500:

> FYI: For 3.0.0, I just put in some code that stops this kind of thing from
> happening (if the calculated message atime is determined to be more than
> 1 day in the future, it just uses the current time() value instead).
> If a 2.64 release happens, the fix will probably go in there too:
> http://bugzilla.spamassassin.org/show_bug.cgi?id=3025
>

I think I'm hitting the same problem:

debug: bayes: found bayes db version 2
debug: bayes: expiry check keep size, 75% of max: 112500
debug: bayes: token count: 638040, final goal reduction size: 525540
debug: bayes: First pass? Current: 1076602270, Last: 1076601983, atime: 0,
count: 0, newdelta: 0, ratio: 0
debug: bayes: Can't use estimation method for expiry, something fishy,
calculating optimal atime delta (first pass)

If I understand correctly the database should have only 112500 (must be the
2.63 default), so it's been failing for quite some time if it's now at over
600.000.

The token reduction count stays at

debug: bayes: 43200 637929
debug: bayes: 22118400 637929

so, it would expire almost everything.
What does this mean? That most tokens are within the same time range or that
most tokens are way too old ??? How can I figure this out?
This is a db which started around summer/autumn last year with some learning
and is continually growing since then, with around 17.000 spam and 3.000 ham
at the moment. I'm not sure what the next means, does it help to better
understand the above?

0.000 0 -17982 0 non-token data: newest atime
0.000 0 1076601982 0 non-token data: last journal sync
atime
0.000 0 1076602431 0 non-token data: last expiry atime

I "fixed" this now by setting
bayes_expiry_max_db_size 1000000

Is there a way I can sanitize the db? I don't really want to throw it away.

The interesting thing is that I have this problem on two machines but it was
detectable only on one of them. We use a milter (MailCorral) which hands the
mail over to spamd. The timeout for that is 60 seconds. I didn't note any
increase in spam or other problems on that machine. Since MailCorral isn't
actively developed anymore I'm looking for alternatives and set up
MailScanner + SA on another machine, copied the old Bayes and other SA stuff
over and keep sending a small portion of the spamtrap spam we get directly
to that machine. Almost immediately I had a lot of SA time-outs and
searching the list I finally found the articles about the "fishy" atime
delta. MailScanner uses a smaller time-out by default, I think 20 seconds or
so, that's still unchanged yet. So, one could imagine that the problem
wasn't detected because the longer time-out allowed for finishing the
hanging expiry. However, this doesn't seem to be the case. Most of the time
the spamd result comes after a few seconds. I'm not seeing much if any spamd
time-outs in the logs of the first machine. Is there something different
between spamd and sa, so that the problem would exist but only visually
emerge with SA but not with spamd? Like that spamd isn't trying the
auto-expire with every message but just once a day while it happens with
each invocation of spamassassin?


Kai

--

Kai Schätzl, Berlin, Germany
Get your web at Conactive Internet Services: http://www.conactive.com
IE-Center: http://ie5.de & http://msie.winware.org
Re: still cant expire bayes tokens [ In reply to ]
On Thu, Feb 12, 2004 at 05:41:07PM +0100, Kai Schaetzl wrote:
> debug: bayes: 43200 637929
> debug: bayes: 22118400 637929
>
> so, it would expire almost everything.

yep.

> What does this mean? That most tokens are within the same time range or that
> most tokens are way too old ??? How can I figure this out?

Well, the data listing there tells you. The tokens in your DB are all over 256 days old.

> 0.000 0 -17982 0 non-token data: newest atime

That's not possible.

> Is there a way I can sanitize the db? I don't really want to throw it away.

You'd have to figure out what the problem is first. The above indicates
something is really messed up for you -- you can't have a negative
newest value.

--
Randomly Generated Tagline:
#define SIGILL 6 /* blech */
-- Larry Wall in perl.c from the perl source code
Re: still cant expire bayes tokens [ In reply to ]
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


Theo Van Dinter writes:
> On Tue, Feb 10, 2004 at 12:08:56PM -0500, Adam Denenberg wrote:
> > thanks Theo. I would love to send my bayes_toks thru db_dump and fix the
> > "broken" records. However i am not familiar with the format. is there
> > an existing script, or a site that will allow me to properly remove
> > entries with bad atime values?
>
> Not that I know of. If you're really keen on trying this, here's the
> basics... Some of this probably should be documented somewhere besides
> the code anyway ...:

(cough) wiki.SpamAssassin.org ;)

- --j.

> # stop spamassassin ...
> # make a backup copy of bayes_toks!
>
> $ db_dump -p -f out .spamassassin/bayes_toks
> $ sa-learn --dump data | perl -nle 'print if ( (split)[3] > time )' > out2
>
> out2 now contains the list of tokens you need to fix. go through each
> one in "out" and fix it. for instance, assume "anticipate" was a token
> that needed fixing, in "out" you'd see something like:
>
> anticipate
> \00\fa\00\00\00\e0\00\00\00l\87*@
>
> That's 13 bytes, which means it's the CVVV format. If it was 5 bytes,
> it's CV format, fyi. Now you want to throw the data through unpack to
> get the actual values out:
>
> $ perl -e 'print join("\n", unpack("CVVV", "\x00\xfa\x00\x00\x00\xe0\x00\x00\x00l\x87*@"),"")'
> 0
> 250
> 224
> 1076529004
>
> There's probably an easier way to do that, but ... perl expect hex
> values in "\x##" format, but db_dump outputs in "\##" format, so you
> have to put the "x" in appropriately there.
>
> The first 3 numbers you don't care about, but they're packing format
> (0 for CVVV, or 192 for CV), # of spam matches, # of ham matches.
> The fourth number is atime. Change the atime to whatever you want, I'd
> choose the current time() value (use the same one for all of the ones
> you want to fix...) In my case, I'm going to use 1076685969 just for example.
> Now you get to put it back in the right format ...
>
> $ perl -e 'print map { sprintf "\\%02x",$_ } unpack("C13", pack("CVVV", 0, 250, 224, 1076685969));print "\n"'
> \00\fa\00\00\00\e0\00\00\00\91\ec\2c\40
>
> Take that and put it in the "out" file appropriately. Now repeat
> for the other tokens. At the end, find the newest atime magic token
> "\0d\01\07\09\03NEWESTAGE", and change the value (it's just a string)
> from the current one to whatever atime you used, 1076685969 in this case.
>
> $ db_load -f out .spamassassin/bayes_toks
>
> You can now do a "sa-learn --dump" to make sure it all looks right...:
>
> [...]
> 0.000 0 1076685969 0 non-token data: newest atime
> [...]
> 0.158 250 224 1076685969 anticipate
> [...]
>
> Now, here's the fun part -- if you have tokens in CV format (which is
> very likely in your case since the ham/spam counts are very likely to be
> both < 8), this whole thing becomes a lot more complicated to do by hand...
> So, let's switch to the more simple, but uglier, way of doing things:
>
> $ perl -MMail::SpamAssassin::BayesStore -e 'print join("\n", \
> Mail::SpamAssassin::BayesStore::tok_unpack({db_version => 2}, \
> "\x00\xfa\x00\x00\x00\xe0\x00\x00\x00l\x87*@"),"")'
> 250
> 224
> 1076529004
>
> $ perl -MMail::SpamAssassin::BayesStore -e 'print map { sprintf "\\%02x",$_ } \
> unpack("C*", Mail::SpamAssassin::BayesStore->tok_pack(250, 224, 1076529004));print "\n"'
> \00\fa\00\00\00\e0\00\00\00\6c\87\2a\40
>
> This code has the benefit of working for both CVVV and CV formats...
>
> For example: "\xd0\x1fU(@"
> 2
> 0
> 1076385055
>
> [...]
>
> \d0\1f\55\28\40
>
> Please note that by editing your DB by hand, any future issues that
> arise will be blamed on the editing. aka: no support.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.3 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFAK8bLQTcbUG5Y7woRAkwgAKCrO/28yt1JpwdVzbD8IYXm9U5D8gCgzd8W
JFuRyMgL4Jb2tChTydnqn+g=
=b0an
-----END PGP SIGNATURE-----
Re: still cant expire bayes tokens [ In reply to ]
On Thu, Feb 12, 2004 at 10:32:44AM -0800, Justin Mason wrote:
> (cough) wiki.SpamAssassin.org ;)

(cough) I hope someone else posts it for me ... ;)

--
Randomly Generated Tagline:
And in the limiting case where the optimizer is completely broken because
it's not implemented yet, we get to work around that too. Optionally...
-- Larry Wall in <20031217195433.GB31020@wall.org>
Re: still cant expire bayes tokens [ In reply to ]
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


Justin Mason writes:
>Theo Van Dinter writes:
>> On Tue, Feb 10, 2004 at 12:08:56PM -0500, Adam Denenberg wrote:
>> > thanks Theo. I would love to send my bayes_toks thru db_dump and fix the
>> > "broken" records. However i am not familiar with the format. is there
>> > an existing script, or a site that will allow me to properly remove
>> > entries with bad atime values?
>>
>> Not that I know of. If you're really keen on trying this, here's the
>> basics... Some of this probably should be documented somewhere besides
>> the code anyway ...:
>
>(cough) wiki.SpamAssassin.org ;)

BTW, having said that, I'd reckon it might be worthwhile just providing
a tool that'll take "sa-learn --dump" output and reload it into a db.
Much easier than mucking with the binary data...

- --j.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.3 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFAK86xQTcbUG5Y7woRAj2vAKCcf4cL1RWpsKNqAcP/FsSUFN6rNQCcCCdQ
oEPIX+XB8/QHJu38/1UONGM=
=hZBH
-----END PGP SIGNATURE-----
Re: still cant expire bayes tokens [ In reply to ]
this would actually be great. If someone wants to help me work on a
tool like this that would be really helpful, not sure how hard it would
be. I think a lot of people could benefit from a bayes "dump and load"
tool.

adam

On Thu, 2004-02-12 at 14:06, Justin Mason wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
>
> Justin Mason writes:
> >Theo Van Dinter writes:
> >> On Tue, Feb 10, 2004 at 12:08:56PM -0500, Adam Denenberg wrote:
> >> > thanks Theo. I would love to send my bayes_toks thru db_dump and fix the
> >> > "broken" records. However i am not familiar with the format. is there
> >> > an existing script, or a site that will allow me to properly remove
> >> > entries with bad atime values?
> >>
> >> Not that I know of. If you're really keen on trying this, here's the
> >> basics... Some of this probably should be documented somewhere besides
> >> the code anyway ...:
> >
> >(cough) wiki.SpamAssassin.org ;)
>
> BTW, having said that, I'd reckon it might be worthwhile just providing
> a tool that'll take "sa-learn --dump" output and reload it into a db.
> Much easier than mucking with the binary data...
>
> - --j.
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.2.3 (GNU/Linux)
> Comment: Exmh CVS
>
> iD8DBQFAK86xQTcbUG5Y7woRAj2vAKCcf4cL1RWpsKNqAcP/FsSUFN6rNQCcCCdQ
> oEPIX+XB8/QHJu38/1UONGM=
> =hZBH
> -----END PGP SIGNATURE-----
>
Re: still cant expire bayes tokens [ In reply to ]
On Thu, Feb 12, 2004 at 11:06:25AM -0800, Justin Mason wrote:
> BTW, having said that, I'd reckon it might be worthwhile just providing
> a tool that'll take "sa-learn --dump" output and reload it into a db.
> Much easier than mucking with the binary data...

Yeah, I was thinking of a similar tool for letting people merge 2 DBs
together since that seems to come up occasionally. I haven't really
considered it a high priority though.

The whole thing would be pretty simple I'd say. Something like:

sa-learn --dump > output
sa-learn --loaddb output
sa-learn --mergedb output

Where loaddb would overwrite, and mergedb would, well, merge. ;)


This then brings up the question of the seen DB and whether that should
be dump/merge-able, if it should expire, etc, etc.

--
Randomly Generated Tagline:
"It's a chicken finger device." - Theo, looking at entree
Re: still cant expire bayes tokens [ In reply to ]
On Thu, Feb 12, 2004 at 04:14:52PM -0500, Theo Van Dinter wrote:
> On Thu, Feb 12, 2004 at 11:06:25AM -0800, Justin Mason wrote:
> > BTW, having said that, I'd reckon it might be worthwhile just providing
> > a tool that'll take "sa-learn --dump" output and reload it into a db.
> > Much easier than mucking with the binary data...
>

I worked up a quick script and sent it to Adam to try out, would be
trivial and actually a little smaller to fold it into sa-learn. I'll
work up a patch.

> Yeah, I was thinking of a similar tool for letting people merge 2 DBs
> together since that seems to come up occasionally. I haven't really
> considered it a high priority though.
>
> The whole thing would be pretty simple I'd say. Something like:
>
> sa-learn --dump > output
> sa-learn --loaddb output
> sa-learn --mergedb output
>
> Where loaddb would overwrite, and mergedb would, well, merge. ;)
>
>
> This then brings up the question of the seen DB and whether that should
> be dump/merge-able, if it should expire, etc, etc.


Here is my problem with merging two databases, maybe my concerns are
unfounded and it doesn't matter. It basically has to do with
collisions. If you are merging two databases that may have "learned"
from the same data then you could skew your results. It would be
similar to learning the same message twice. One or two messages
probably won't matter, but if it's a good number, then you basically
double the numbers on those tokens. Like I said, perhaps this isn't
such a big deal. Now, if we stored which tokens where associated with
which message ids, then it would be much easier.

Michael
Re: still cant expire bayes tokens [ In reply to ]
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


Michael Parker writes:
>On Thu, Feb 12, 2004 at 04:14:52PM -0500, Theo Van Dinter wrote:
>> On Thu, Feb 12, 2004 at 11:06:25AM -0800, Justin Mason wrote:
>> > BTW, having said that, I'd reckon it might be worthwhile just providing
>> > a tool that'll take "sa-learn --dump" output and reload it into a db.
>> > Much easier than mucking with the binary data...
>>
>
>I worked up a quick script and sent it to Adam to try out, would be
>trivial and actually a little smaller to fold it into sa-learn. I'll
>work up a patch.
>
>> Yeah, I was thinking of a similar tool for letting people merge 2 DBs
>> together since that seems to come up occasionally. I haven't really
>> considered it a high priority though.
>>
>> The whole thing would be pretty simple I'd say. Something like:
>>
>> sa-learn --dump > output
>> sa-learn --loaddb output
>> sa-learn --mergedb output
>>
>> Where loaddb would overwrite, and mergedb would, well, merge. ;)
>>
>>
>> This then brings up the question of the seen DB and whether that should
>> be dump/merge-able, if it should expire, etc, etc.
>
>
>Here is my problem with merging two databases, maybe my concerns are
>unfounded and it doesn't matter. It basically has to do with
>collisions. If you are merging two databases that may have "learned"
>from the same data then you could skew your results. It would be
>similar to learning the same message twice. One or two messages
>probably won't matter, but if it's a good number, then you basically
>double the numbers on those tokens. Like I said, perhaps this isn't
>such a big deal.

yes -- this is an "emergency use only" tool, and that issue has to
be noted very clearly.

> Now, if we stored which tokens where associated with
>which message ids, then it would be much easier.

But a bigger DB, probably :(

aside: here's a possibly-good way to do this.

Basically, go back to a message counter. so first msg learned is 1,
second 2, etc. Store msg reception time -- as used in expiry -- in a
per-message db, possibly db_seen.

Then use the message counter in the per-token db, instead of msg reception
time, and when doing expiry, expire whole messages -- including
decrementing all of the tokens that were in the message being expired.

as far as I can see that would not be a big db bloat issue.)

- --j.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.3 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFAK/nUQTcbUG5Y7woRAr+KAJ9WWMwNob9PQVAHsFHdJjBfSbAAGwCdF98X
KmFPhnEMJAkojz5BLWUhbVg=
=KxkB
-----END PGP SIGNATURE-----
Re: still cant expire bayes tokens [ In reply to ]
Theo Van Dinter wrote on Thu, 12 Feb 2004 13:11:55 -0500:

> > What does this mean? That most tokens are within the same time range
> or that
> > most tokens are way too old ??? How can I figure this out?
>
> Well, the data listing there tells you.

Tells you, not me ;-) I can read that stuff to a certain extent, but only
understand portions of it.

The tokens in your DB are all
> over 256 days old.

This is simply "impossible" because auto-learned items are added daily and
I also learn it a spam mailbox sometimes. However, it's possible that a
great portion of the db is quite old considering the fact that it didn't
expire for a while and we learned several thousand spam and ham mails at
the beginning.

>
> > 0.000 0 -17982 0 non-token data: newest atime
>
> That's not possible.

I have "-17982" on three machines, always the same value. This db started
out as Bayes DB version 1 (or 0?) with SA 2.43 possibly, then was carried
over to two other machines and they also got upgraded to 2.5x and 2.6x
versions consecutively.
There's also no "oldest atime". Wouldn't that suggest that possibly all
dates are in the future?

When I do a sa-learn --dump data, what do I need to look for? Everything
over 1076607731 (= last expiry atime, so near current date)?

0.958 1 0 1051805273 low_interest
0.206 17 12 1075495400 HX-MIMETrack:Release

f.i., the above, are these valid records/dates? If so, then I'm wondering
why it can't display an oldest atime (if I understand correctly what atime
means). What's the exact meaning of "atime"? Is this the time when the
token was added to the db? I think the times above are in the past, so it
should be able to show an oldest atime, shouldn't it?

I'm sure there is a command which converts that Unix Timestamp (assuming
it is one) to something human-readable, but I don't know it.

I read the dumping instructions etc. in
Message-ID: <20040212155407.GF9361@kluge.net>
Didn't understand everything, though.
I now have a readable dump of the incorrect records (at least I hope I
have).
Most of these records seem to be way in the future:
0.518 219 37 1128239545 review
0.978 2 0 1104581966 8:Ñ£
0.958 1 0 1128052147 lkalowhbrd
0.994 8 0 1093712392 WEST
0.942 90 1 1128239545 REQUIRED

Couldn't I simply remove these from bayes_toks or "out"? I'm not keen on
fixing them. It's only about 50 KB.
So remove the token and any lines until the next token? Is that the
correct thing to do? (Next thing then: learn how to convert this back to
bayes_toks.)

straight.php
\e0\c4\97\cf>
\db\d5\d4\cb\c9
\f0\91\05\dc>

f.i. remove that completely?
What is CVVV / CV?

Thanks,

Kai

--

Kai Schätzl, Berlin, Germany
Get your web at Conactive Internet Services: http://www.conactive.com
IE-Center: http://ie5.de & http://msie.winware.org
Re: still cant expire bayes tokens [ In reply to ]
On Thu, Feb 12, 2004 at 02:10:28PM -0800, Justin Mason wrote:
> Michael Parker writes:
> >On Thu, Feb 12, 2004 at 04:14:52PM -0500, Theo Van Dinter wrote:
> >> This then brings up the question of the seen DB and whether that should
> >> be dump/merge-able, if it should expire, etc, etc.
> >
> >
> >Here is my problem with merging two databases, maybe my concerns are
> >unfounded and it doesn't matter. It basically has to do with
> >collisions. If you are merging two databases that may have "learned"
> >from the same data then you could skew your results. It would be
> >similar to learning the same message twice. One or two messages
> >probably won't matter, but if it's a good number, then you basically
> >double the numbers on those tokens. Like I said, perhaps this isn't
> >such a big deal.
>
> yes -- this is an "emergency use only" tool, and that issue has to
> be noted very clearly.
>

In this case, I'd say you need to merge the bayes_seen databases as
well. Hmm, suddenly it's a little more complicated than just reading a
dump file.

Michael
Re: still cant expire bayes tokens [ In reply to ]
On Thu, Feb 12, 2004 at 11:32:05PM +0100, Kai Schaetzl wrote:
> The tokens in your DB are all
> > over 256 days old.
> This is simply "impossible" because auto-learned items are added daily and

I should have been a little clearer... The # of tokens listed in the
atime output are older than 256 days, as calculated from the atime of the
newest token. If you have the same issue as the other fellow (Adam?),
then you had an erroneous message get learned with an atime in the future.

> I have "-17982" on three machines, always the same value. This db started
> out as Bayes DB version 1 (or 0?) with SA 2.43 possibly, then was carried
> over to two other machines and they also got upgraded to 2.5x and 2.6x
> versions consecutively.

Well, it would have started out as DBv0 in 2.5x (2.4x had no bayes code).
If you skipped development versions of 2.60, you would have gone to DBv2
(DBv1 was a short-lived version in about 2 weeks of dev code).

> There's also no "oldest atime". Wouldn't that suggest that possibly all
> dates are in the future?

Hmm? Is there actually no oldest atime set, or is the value 0? There's a
big difference.

> When I do a sa-learn --dump data, what do I need to look for? Everything
> over 1076607731 (= last expiry atime, so near current date)?
>
> 0.958 1 0 1051805273 low_interest
> 0.206 17 12 1075495400 HX-MIMETrack:Release

Pretty much. I'd look for atime values (4th column) either < 100000000
or > time() (aka: perl -e 'print time(),"\n"')

Judging by the rest of this conversation, I'm going to guess you'll
find tokens with an atime of 0, and some > time(), probably by more than
256 days.


> f.i., the above, are these valid records/dates? If so, then I'm wondering
> why it can't display an oldest atime (if I understand correctly what atime
> means). What's the exact meaning of "atime"? Is this the time when the
> token was added to the db? I think the times above are in the past, so it
> should be able to show an oldest atime, shouldn't it?

"atime" is short for "access time", and is the number of seconds since
the epoch (1/1/1970) that the message was received (or sent if received
can't be determined). In theory, the atime values should all be <=
current time(), although I allow for <= time()+86400 in case you need to
use sent time and the sender is on the other side of the planet somewhere.

The atime values are set when you learn the message or when the token
is seen in a new message -- ie: the last time the token was "accessed".

> I'm sure there is a command which converts that Unix Timestamp (assuming
> it is one) to something human-readable, but I don't know it.

Yeah, there's a bunch. You could probably get date to do it, but I just use:

#!/usr/bin/perl
print scalar localtime($ARGV[0]),"\n";

> Most of these records seem to be way in the future:
> 0.518 219 37 1128239545 review
> 0.978 2 0 1104581966 8:Ñ£
> 0.958 1 0 1128052147 lkalowhbrd
> 0.994 8 0 1093712392 WEST
> 0.942 90 1 1128239545 REQUIRED

Yep.

1128239545 = Sun Oct 2 03:52:25 2005 EST
1093712392 = Sat Aug 28 12:59:52 2004 EST
...

> Couldn't I simply remove these from bayes_toks or "out"? I'm not keen on
> fixing them. It's only about 50 KB.
> So remove the token and any lines until the next token? Is that the

You could do that, but then you'll have to edit more magic tokens to
change # of toks in DB, you'll still need to know the new newest atime,
etc.

> straight.php
> \e0\c4\97\cf>
> \db\d5\d4\cb\c9
> \f0\91\05\dc>
>
> f.i. remove that completely?

well, the format is:

token
value
token
value

etc.

> What is CVVV / CV?

It's the perl pack() format code ... Basically C=unsigned char,
V=unsigned long (32 bits) in little endian format.

--
Randomly Generated Tagline:
"J: Do YOU know who the Spin Doctors are?
P: Maybe your mother does..." - John West and a Pizza Delivery Guy
Re: still cant expire bayes tokens [ In reply to ]
Theo Van Dinter wrote on Thu, 12 Feb 2004 10:54:07 -0500:

> $ db_dump -p -f out .spamassassin/bayes_toks
> $ sa-learn --dump data | perl -nle 'print if ( (split)[3] > time )' > out2

I overlooked something here at first. At a quick glance it looked like this
was a sequence, so that line 2 depends on line 1 but it isn't. I think just
doing an
sa-learn --dump data > dump.file
is what I need. I then get everything neatly arranged in columns and just
need to strip away all the lines with the negative value.
0.958 1 0 -17982 bgiek
Interestingly, all of them seem to be spam tokens and all have -17982.

And then rebuild the database from that. Michael sent me a script which is
supposed to do that and the interesting thing is that it *seems* to create a
valid db of exactly the same size as before but it's binarily different.
(For testing purposes I dumped from a *valid* non-corrupted db and then
recreated it with his tool. So there aren't any mistakes I could introduce
by editing.) sa-learn identifies it as a v0 database and does not show any
tokens or other data in it with "--dump magic". When I run --force-expire
over it it starts converting the db to v2 and after that still lists no
tokens and all four atime values show the current time. No errors whatsoever
shown. Michael says his tool creates a v2 database, but sa-learn identifies
it as v0 and converts without an error to v2. Weird.
I'm gonna post his code here once he acknowledges.

Kai

--

Kai Schätzl, Berlin, Germany
Get your web at Conactive Internet Services: http://www.conactive.com
IE-Center: http://ie5.de & http://msie.winware.org