Mailing List Archive: Speeding up Bayes SQL

Speeding up Bayes SQL

sidney at sidney

Mar 22, 2004, 8:31 AM

Post #1 of 15 (220 views)

I would like to post about a very rough test I ran to get some comments
and in case anybody wants to pursue this more. I may end up not having
time to take this all the way to a finished patch.

Just to see what the effects would be, I did some crude patches to make
the following changes to Bayes SQL:

In the bayes_token table changed username from a VARCHAR to INT, with
the idea that we could use the user uid instead of username to identify
the user.

Also in the bayes_token changed token from VARCHAR to BIGINT. I patched
SQL.pm to convert the token string into the low order 15 hex digits of
the SHA-1 hash of the string. By putting a "0x" in front of that in the
SELECT, MySQL will treat the string as a 64bit integer even though perl
doesn't itself support 64 bit integers.

Those two changes cause bayes_token table to have no variable length
fields, which according to the MySQL documentation makes access more
efficient. It also reduces the database size quite a bit.

I then added a tok_get_all routine in SQL.pm that uses a SELECT ... FROM
bayes_token WITH ... token IN ( ... ) to get all the information from
the database at once instead of using a different SELECT query for each
token.

I tested this by training Bayes on approximately 1000 ham and 1500 spam,
which resulted in about 150,000 tokens, then running 1000 other spam
messages through spamc with no network tests and autolearning off.

There was no appreciable change in the time for sa-learn, but of course
the database is smaller with the smaller fixed fields.

The baseline test of running 1000 spams through spamc took about 25
minutes on my machine.

After I changes the username and token fields to the integer formats, it
took about 17 minutes.

After I then added the the SELECT .... token IN (...) it went down to 14
minutes.

When I turned off Bayes completely, running the same messages through
spamc took 5 minutes.

So it looks like if we are willing to sacrifice being able to see the
tokens in a readable form when someone dumps the Bayes database, we can
makes things about twice as fast and the database a lot smaller.

-- sidney

Re: Speeding up Bayes SQL [ In reply to ]

spamassassin-dev at heise

Mar 22, 2004, 8:47 AM

Post #2 of 15 (219 views)

-- On Tuesday, March 23, 2004 03:31:35 +1200 Sidney Markowitz wrote:

> I would like to post about a very rough test I ran to get some comments
> and in case anybody wants to pursue this more. I may end up not having
> time to take this all the way to a finished patch.
>
> Just to see what the effects would be, I did some crude patches to make
> the following changes to Bayes SQL:

[...]

> Also in the bayes_token changed token from VARCHAR to BIGINT. I patched
> SQL.pm to convert the token string into the low order 15 hex digits of
> the SHA-1 hash of the string. By putting a "0x" in front of that in the
> SELECT, MySQL will treat the string as a 64bit integer even though perl
> doesn't itself support 64 bit integers.

[...]

> So it looks like if we are willing to sacrifice being able to see the
> tokens in a readable form when someone dumps the Bayes database, we can
> makes things about twice as fast and the database a lot smaller.

It is not necessary to sacrifice the ability of reading the tokens as you
may add a SHA-1-hash->token dictionary table. This would give a greater
database and a slightly slower sa-learn, but there should be no difference
in the spamc times.

--
Michael Fischer v. Mollard, network administration
Heise Zeitschriften Verlag GmbH & Co KG
Helstorfer Straße 7
D-30625 Hannover
Tel: +49 511 5352 477; Email: Michael.Fischer.von.Mollard@heise.de

Re: Speeding up Bayes SQL [ In reply to ]

quinlan at pathname

Mar 22, 2004, 2:32 PM

Post #3 of 15 (219 views)

Sidney Markowitz <sidney@sidney.com> writes:

> Also in the bayes_token changed token from VARCHAR to BIGINT. I patched
> SQL.pm to convert the token string into the low order 15 hex digits of
> the SHA-1 hash of the string. By putting a "0x" in front of that in the
> SELECT, MySQL will treat the string as a 64bit integer even though perl
> doesn't itself support 64 bit integers.

We should be able to make this an option used by all the Store engines,
not just SQL.

Daniel

--
Daniel Quinlan anti-spam (SpamAssassin), Linux,
http://www.pathname.com/~quinlan/ and open source consulting

Re: Speeding up Bayes SQL [ In reply to ]

parkerm at pobox

Mar 22, 2004, 2:52 PM

Post #4 of 15 (219 views)

On Mon, Mar 22, 2004 at 01:32:12PM -0800, Daniel Quinlan wrote:
> Sidney Markowitz <sidney@sidney.com> writes:
>
> > Also in the bayes_token changed token from VARCHAR to BIGINT. I patched
> > SQL.pm to convert the token string into the low order 15 hex digits of
> > the SHA-1 hash of the string. By putting a "0x" in front of that in the
> > SELECT, MySQL will treat the string as a 64bit integer even though perl
> > doesn't itself support 64 bit integers.
>
> We should be able to make this an option used by all the Store engines,
> not just SQL.
>

I agree, anything we do should transfer across all storage backends.
We need to be able to dump (backup) from one backend, twiddle the
config and restore to another backend and back again.

Michael

Re: Speeding up Bayes SQL [ In reply to ]

quinlan at pathname

Mar 22, 2004, 2:57 PM

Post #5 of 15 (219 views)

Michael Parker <parkerm@pobox.com> writes:

> I agree, anything we do should transfer across all storage backends.
> We need to be able to dump (backup) from one backend, twiddle the
> config and restore to another backend and back again.

Agreed. I'm fine with making the option break compatibility with old
databases, though. Just make it an option.

We can debate whether or not it should be turned on by default, but I'd
lean towards turning off SHA1 by default. However, if we ever want to
turn it on by default, 3.0 is the right time.

Sticking with plain-text is a mistake in the long run:

- fewer collisions => probably *better* accuracy
- fixed length => less space, faster in many cases
- hash => easier to share and generate public Bayes DB
- smaller size enables longer tokens (multiple words, for example)

The inability to look at tokens directly does not seem to be a major
impediment to other projects like DSPAM and CRM114.

Daniel

--
Daniel Quinlan anti-spam (SpamAssassin), Linux,
http://www.pathname.com/~quinlan/ and open source consulting

Re: Speeding up Bayes SQL [ In reply to ]

sidney at sidney

Mar 22, 2004, 4:11 PM

Post #6 of 15 (219 views)

Daniel Quinlan wrote:
[ ... ]

I agree with all you said. I only had time to do a quick and dirty test
of the concepts using MySQL, but they are likely to generalize.

The ability to share public DBs is a really good benefit of using
hashes. That provides another reason to stick with the low order bits of
SHA1 for the hash, so that there is cryptographic grade security against
someone reversing the hash from entries in a database.

-- sidney

Re: Speeding up Bayes SQL [ In reply to ]

quinlan at pathname

Mar 22, 2004, 9:10 PM

Post #7 of 15 (219 views)

Sidney Markowitz <sidney@sidney.com> writes:

> The ability to share public DBs is a really good benefit of using
> hashes. That provides another reason to stick with the low order bits of
> SHA1 for the hash, so that there is cryptographic grade security against
> someone reversing the hash from entries in a database.

As a security measure, we can also just eliminate any tokens not shared
among N people.

Daniel

--
Daniel Quinlan anti-spam (SpamAssassin), Linux,
http://www.pathname.com/~quinlan/ and open source consulting

Re: Speeding up Bayes SQL [ In reply to ]

sidney at sidney

Mar 24, 2004, 3:08 PM

Post #8 of 15 (219 views)

I have two questions, one for the group, and one probably more for
Michael Parker.

First, is there an open bug on Bugzilla that anyone would like to see
this discussion moved to? If not I think I or whoever gets there first
should open one so this is easier to track.

Second, I ran a test last night in which I moved the hashing of tokens
from SQL.pm to Bayes.pm. To simplify things, I changed from using a
BIGINT of the low order 60 bits of the SHA1 to a CHAR(8) of the Base64
encoding of the low order 42 bits. That should take the same amount of
space, still provides the benefits of a fixed field, and keeps enough
hash bits to allow about 4 million unique tokens before there are any
collisions expected. I only ran one test, which also had the SELECT ...
token IN (...) optimization. It ran in 17 minutes, which is the same
time that I got with the BIGINT field without the IN, and as compared to
the 14 minutes with BIGINT and IN, and 25 minutes before I made any
changes at all.

So my question is whether having token be a BIGINT could be that much
faster than having it CHAR(8), or whether I could expect that much
variability between test runs and the difference between 17 and 14
minutes is not that significant without running these tests many more times?

I'll check that out empirically when I have time, but that will take a
while and maybe you already know the answer.

-- sidney

Re: Speeding up Bayes SQL [ In reply to ]

parkerm at pobox

Mar 24, 2004, 3:21 PM

Post #9 of 15 (219 views)

On Thu, Mar 25, 2004 at 10:08:48AM +1200, Sidney Markowitz wrote:
> I have two questions, one for the group, and one probably more for
> Michael Parker.
>
> First, is there an open bug on Bugzilla that anyone would like to see
> this discussion moved to? If not I think I or whoever gets there first
> should open one so this is easier to track.

Not sure, I want to say there is but I'm really not sure. Go ahead
and create one if you can't find an existing bug. I think we need to
target 3.0.0 for this change.

> Second, I ran a test last night in which I moved the hashing of tokens
> from SQL.pm to Bayes.pm. To simplify things, I changed from using a
> BIGINT of the low order 60 bits of the SHA1 to a CHAR(8) of the Base64
> encoding of the low order 42 bits. That should take the same amount of
> space, still provides the benefits of a fixed field, and keeps enough
> hash bits to allow about 4 million unique tokens before there are any
> collisions expected. I only ran one test, which also had the SELECT ...
> token IN (...) optimization. It ran in 17 minutes, which is the same
> time that I got with the BIGINT field without the IN, and as compared to
> the 14 minutes with BIGINT and IN, and 25 minutes before I made any
> changes at all.
>
> So my question is whether having token be a BIGINT could be that much
> faster than having it CHAR(8), or whether I could expect that much
> variability between test runs and the difference between 17 and 14
> minutes is not that significant without running these tests many more times?
>

I've seen test times swing, but maybe not by this much. In general
you certainly should run it more than once, I'd say at least 3 times
with the same data. Make sure your test machine isn't do anything
that might throw things off.

Can you send me the latest and greatest path? Or maybe attach it to
the bug.

Thanks
Michael

Re: Speeding up Bayes SQL [ In reply to ]

sidney at sidney

Mar 24, 2004, 4:02 PM

Post #10 of 15 (219 views)

Michael Parker wrote:
> you certainly should run it more than once, I'd say at
> least 3 times with the same data.

I'm having the machine run through some tests while I'm at work/school.
We'll see what it says. But I did get to see the first additional run
and it was completed in 15 minutes. This may mean that the good result
for adding the SELECT ... token IN ... to the fixed length fields change
has no statistical significance.

One thing that I have tried yet is setting the query cache to more than
the default of 0. From your warnings about blowing out the cache, it
seems that I should set it up to, what, 100000 or so?

-- sidney

Re: Speeding up Bayes SQL [ In reply to ]

parkerm at pobox

Mar 24, 2004, 4:23 PM

Post #11 of 15 (219 views)

On Thu, Mar 25, 2004 at 11:02:31AM +1200, Sidney Markowitz wrote:
> Michael Parker wrote:
> >you certainly should run it more than once, I'd say at
> >least 3 times with the same data.
>
> I'm having the machine run through some tests while I'm at work/school.
> We'll see what it says. But I did get to see the first additional run
> and it was completed in 15 minutes. This may mean that the good result
> for adding the SELECT ... token IN ... to the fixed length fields change
> has no statistical significance.

Do you have a benchmark script?

I think we might want to see test results without tok_get_all change.
Making small steps so we can see exactly what performance gains are
made.

> One thing that I have tried yet is setting the query cache to more than
> the default of 0. From your warnings about blowing out the cache, it
> seems that I should set it up to, what, 100000 or so?

Keep it at 0, my understanding of how the MySQL query cache works is
different than I'm used to with Oracle. So it doesn't matter for
MySQL.

Michael

Re: Speeding up Bayes SQL [ In reply to ]

sidney at sidney

Mar 24, 2004, 4:51 PM

Post #12 of 15 (219 views)

Michael Parker wrote:
> Do you have a benchmark script?

Not really. I initialize an empty database by running sa-learn on a
folder of 967 ham message files (Maildir format) and then running
sa-learn on another 1641 spam message files.

After that, with bayes_auto-learn 0 in user_prefs, I run spamd -L and
just do

time for i in directory_of_another_1000_spams/* ; do spamc -R < $i |
grep BAYES ; echo $i ; done

Oh, I just thought of something... Is there an optimization that does
not update the atime of a token if the time increment is smaller than
some amount like an hour or a day? If there is, that would explain the
first run through after I have initialized the database taking longer
than all others. If we don't do that we should. It would save most
updates on the most frequent significant tokens.

> I think we might want to see test results without tok_get_all change.
> Making small steps so we can see exactly what performance gains are
> made.

Yes, my plan was that after I got the code working I would wrap them in
config options to make testing different combinations easier and
systematically get some numbers. I just ran the one test as I
implemented each, and reported those early results. But since I lack
patience :-) now that I see that the results might have been an
artifact, as soon as I get the chance I will comment out the call to
tok_get_all, uncomment the version that calls tok_get for each token,
and see what repeated runs of that using the fixed length fields do.

> Keep it at 0, my understanding of how the MySQL query cache works is
> different than I'm used to with Oracle. So it doesn't matter for
> MySQL.

Ok, reading about it in the MySQL doc I see that it benefits if you have
repeated identical queries. A single message generates no repeated
queries, since the token array is uniquified. In a production
environment, you probably will fill the cache with queries with
different username values before seeing any repeats. So it would would
not be worth it.

-- sidney

Re: Speeding up Bayes SQL [ In reply to ]

sidney at sidney

Mar 28, 2004, 12:21 PM

Post #13 of 15 (219 views)

I've run some more careful tests comparing the following:

I changed username in the tokens table from varchar to INT and used the
bayes override username config option to force it to my unix UID. This
was without any change to the code, only to the schema.

I enabled DELAYED_KEY_WRITE.

The test consists of emptying the Bayes database, training it on 960 ham
and 1600 spam, then timing five consecutive runs of feeding one set of
1000 spam messages to spamc while running spamd -L.

The variables in the testing were

1a. Take the low order 64 bits of the SHA1 hash of the tokens as a hex
string and store it as a BIGINT field

or

2a. Take the low order 40 bits of the SHA1 hash of the tokens and store
it as a CHAR(5) BINARY field

The other variable was

1b. Get token values from the database one at a time using the tok_get
function

2b. Get token values for a message all at once using a tok_get_all
function that looks up all of them using one SELECT ... token IN (?,?,...)

I fond that it is easy to work with arbitrary binary data as CHAR fields
if you allow prepare or prepare-cache to handle the quoting when it
matches the data to the wildcard ? characters.

The results show not much difference in speed between the BIGINT and
CHAR(5) BINARY cases. BIGINT is slightly faster, but requires about 15%
more space in the db.

Using tok_get_all does make a difference in speed.

The approximate average times for the tests were 25 minutes using
unmodified SpamAssassin, 17.5 minutes with the fixed length fields and
tok_get, and 15.5 minutes with fixed length fields and tok_get_all

I also did one run with updating of atime disabled and found almost no
difference in speed. I do think that there is another optimization to be
dome with that anyway: Does anyone see any reason to expire tokens based
on atime with a granularity of less than one day rather than the seconds
that we are using now? If atime is in days rather than seconds we could
use a two byte in for it instead of four bytes, saving two bytes per
token in the database and not updating as often.

I'll create a bugzilla ticket for this and post my test code as a patch
when I get a chance to.

-- sidney

Re: Speeding up Bayes SQL [ In reply to ]

sidney at sidney

Mar 28, 2004, 3:05 PM

Post #14 of 15 (219 views)

Kelsey asked me to forward this response to the list after he
accidentally replied only to me:

[forwarded message]

We've been doing some preliminary tests with large databases and high
volumes of mail. Our first target for optimization is moving the
spam/ham counts into a separate table. The bayes_toks table is getting
locked for a considerable amount of time while it's doing the select
count(*) from .....

We were surprised to see that table getting locked for this query, there
may be some other way around it but we haven't found it yet. Perhaps
this has already addressed.

-- Kelsey Cummings - kgc@sonic.net sonic.net, inc. System Administrator
2260 Apollo Way 707.522.1000 (Voice) Santa Rosa, CA 95407 707.547.2199
(Fax) http://www.sonic.net/ Fingerprint = D5F9 667F 5D32 7347 0B79 8DB7
2B42 86B6 4E2C 3896

Re: Speeding up Bayes SQL [ In reply to ]

parkerm at pobox

Mar 28, 2004, 3:08 PM

Post #15 of 15 (219 views)

On Mon, Mar 29, 2004 at 10:05:21AM +1200, Sidney Markowitz wrote:
>
> We've been doing some preliminary tests with large databases and high
> volumes of mail. Our first target for optimization is moving the
> spam/ham counts into a separate table. The bayes_toks table is getting
> locked for a considerable amount of time while it's doing the select
> count(*) from .....

FYI, I'm doing away with that right now.

Michael