I would like to post about a very rough test I ran to get some comments
and in case anybody wants to pursue this more. I may end up not having
time to take this all the way to a finished patch.
Just to see what the effects would be, I did some crude patches to make
the following changes to Bayes SQL:
In the bayes_token table changed username from a VARCHAR to INT, with
the idea that we could use the user uid instead of username to identify
the user.
Also in the bayes_token changed token from VARCHAR to BIGINT. I patched
SQL.pm to convert the token string into the low order 15 hex digits of
the SHA-1 hash of the string. By putting a "0x" in front of that in the
SELECT, MySQL will treat the string as a 64bit integer even though perl
doesn't itself support 64 bit integers.
Those two changes cause bayes_token table to have no variable length
fields, which according to the MySQL documentation makes access more
efficient. It also reduces the database size quite a bit.
I then added a tok_get_all routine in SQL.pm that uses a SELECT ... FROM
bayes_token WITH ... token IN ( ... ) to get all the information from
the database at once instead of using a different SELECT query for each
token.
I tested this by training Bayes on approximately 1000 ham and 1500 spam,
which resulted in about 150,000 tokens, then running 1000 other spam
messages through spamc with no network tests and autolearning off.
There was no appreciable change in the time for sa-learn, but of course
the database is smaller with the smaller fixed fields.
The baseline test of running 1000 spams through spamc took about 25
minutes on my machine.
After I changes the username and token fields to the integer formats, it
took about 17 minutes.
After I then added the the SELECT .... token IN (...) it went down to 14
minutes.
When I turned off Bayes completely, running the same messages through
spamc took 5 minutes.
So it looks like if we are willing to sacrifice being able to see the
tokens in a readable form when someone dumps the Bayes database, we can
makes things about twice as fast and the database a lot smaller.
-- sidney
and in case anybody wants to pursue this more. I may end up not having
time to take this all the way to a finished patch.
Just to see what the effects would be, I did some crude patches to make
the following changes to Bayes SQL:
In the bayes_token table changed username from a VARCHAR to INT, with
the idea that we could use the user uid instead of username to identify
the user.
Also in the bayes_token changed token from VARCHAR to BIGINT. I patched
SQL.pm to convert the token string into the low order 15 hex digits of
the SHA-1 hash of the string. By putting a "0x" in front of that in the
SELECT, MySQL will treat the string as a 64bit integer even though perl
doesn't itself support 64 bit integers.
Those two changes cause bayes_token table to have no variable length
fields, which according to the MySQL documentation makes access more
efficient. It also reduces the database size quite a bit.
I then added a tok_get_all routine in SQL.pm that uses a SELECT ... FROM
bayes_token WITH ... token IN ( ... ) to get all the information from
the database at once instead of using a different SELECT query for each
token.
I tested this by training Bayes on approximately 1000 ham and 1500 spam,
which resulted in about 150,000 tokens, then running 1000 other spam
messages through spamc with no network tests and autolearning off.
There was no appreciable change in the time for sa-learn, but of course
the database is smaller with the smaller fixed fields.
The baseline test of running 1000 spams through spamc took about 25
minutes on my machine.
After I changes the username and token fields to the integer formats, it
took about 17 minutes.
After I then added the the SELECT .... token IN (...) it went down to 14
minutes.
When I turned off Bayes completely, running the same messages through
spamc took 5 minutes.
So it looks like if we are willing to sacrifice being able to see the
tokens in a readable form when someone dumps the Bayes database, we can
makes things about twice as fast and the database a lot smaller.
-- sidney