Mailing List Archive

sa-learn using multiple CPUs?
Hi all,

I am going to add some large spam archives for my Bayes database with
sa-learn.

I have a machine with six vCPUs and obviously I would like to speed up
the learning process. I am thinking of running six sa-learn processes in
parallel. Is there any issue with this like locks for the database?

Or is sa-learn itself multithreaded and I do not need to run it in
parallel (does not look so)?

Next, when running the above in parallel (if possible) should I use the
"--no-sync" and do the syncing afterwards? But again, this is then only
single-threaded, right?

Thanks a lot for your input!

/Christian
Re: sa-learn using multiple CPUs? [ In reply to ]
Depending on your Bayes backend, your bottleneck will not be the CPUs
but I/O.
Normally there's no need for running multiple sa-learn instances.

My sa-learn is learning +40 msgs/sec from a SSD into a Redis DB.

On 4/15/21 2:33 PM, Christian Völker wrote:
> Hi all,
>
> I am going to add some large spam archives for my Bayes database with
> sa-learn.
>
> I have a machine with six vCPUs and obviously I would like to speed up
> the learning process. I am thinking of running six sa-learn processes in
> parallel. Is there any issue with this like locks for the database?
>
> Or is sa-learn itself multithreaded and I do not need to run it in
> parallel (does not look so)?
>
> Next, when running the above in parallel (if possible) should I use the
> "--no-sync" and do the syncing afterwards? But again, this is then only
> single-threaded, right?
>
> Thanks a lot for your input!
>
> /Christian
>
>
Re: sa-learn using multiple CPUs? [ In reply to ]
Christian Völker <cvoelker@knebb.de> writes:

> I am going to add some large spam archives for my Bayes database with
> sa-learn.
>
> I have a machine with six vCPUs and obviously I would like to speed up
> the learning process. I am thinking of running six sa-learn processes
> in parallel. Is there any issue with this like locks for the database?

I don't know, but beware that if you have TXREP configured, and you do
not use -L to sa-learn, I believe you will end up making DNSBL queries
for all of them.
Re: sa-learn using multiple CPUs? [ In reply to ]
Hi,

> I don't know, but beware that if you have TXREP configured, and you do
> not use -L to sa-learn, I believe you will end up making DNSBL queries
> for all of them.

Good catch! I did not use "-L" so far and I am pretty sure there is
nothing configured but from reading then man page it will not do any
harm. So I will add "-L".

Besides of this a test run really cam up with 100% single CPU usage so I
doubt it is doing the queries here.

Thanks!

/Christian
Re: sa-learn using multiple CPUs? [ In reply to ]
On Thu, Apr 15, 2021 at 08:39:42AM -0400, Greg Troxel wrote:
>
> I don't know, but beware that if you have TXREP configured, and you do
> not use -L to sa-learn, I believe you will end up making DNSBL queries
> for all of them.

Thanks, TxRep actually seems to be the culprit. Will look into it..

https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7881
Re: sa-learn using multiple CPUs? [ In reply to ]
Please keep list mail on list!
if you run parallel sa-learn instances you'll run into locked DB errors.
With a SDBM backend it would be a bit faster but still lock up.
afaik, Redis backend won't have locking issues.
(dunno about SQL - I use Redis)

On 4/15/21 2:45 PM, Christian Völker wrote:
> Hi,
>
> well, here it is not I/O bound (running on RAID1-SSDs). I am using the
> "default" file based backend ~/.spamassassin/bayes*.
>
> 40msg/sec is not really fast enough for me. The number of messages to be
> processed is really huge.
>
> So again asking: is it possible with the file-based dbackend to do this
> stuff in parallel?
>
> Thanks
>
> /Christian
>
> Am 15.04.2021 um 14:38 schrieb Axb:
>> Depending on your Bayes backend, your bottleneck will not be the CPUs
>> but I/O.
>> Normally there's no need for running multiple sa-learn instances.
>>
>> My sa-learn is learning +40 msgs/sec from a SSD into a Redis DB.
>>
>> On 4/15/21 2:33 PM, Christian Völker wrote:
>>> Hi all,
>>>
>>> I am going to add some large spam archives for my Bayes database with
>>> sa-learn.
>>>
>>> I have a machine with six vCPUs and obviously I would like to speed
>>> up the learning process. I am thinking of running six sa-learn
>>> processes in parallel. Is there any issue with this like locks for
>>> the database?
>>>
>>> Or is sa-learn itself multithreaded and I do not need to run it in
>>> parallel (does not look so)?
>>>
>>> Next, when running the above in parallel (if possible) should I use
>>> the "--no-sync" and do the syncing afterwards? But again, this is
>>> then only single-threaded, right?
>>>
>>> Thanks a lot for your input!
>>>
>>> /Christian
>>>
>>>
>>
>>
>
Re: sa-learn using multiple CPUs? [ In reply to ]
If you insist on file bayes, atleast make sure you use "lock_method flock".
Or maybe BDB backend, don't remember if it's faster.

> On 4/15/21 2:45 PM, Christian V?lker wrote:
> > Hi,
> >
> > well, here it is not I/O bound (running on RAID1-SSDs). I am using the
> > "default" file based backend ~/.spamassassin/bayes*.
> >
> > 40msg/sec is not really fast enough for me. The number of messages to be
> > processed is really huge.
> >
> > So again asking: is it possible with the file-based dbackend to do this
> > stuff in parallel?
> >
> > Thanks
> >
> > /Christian
> >
> > Am 15.04.2021 um 14:38 schrieb Axb:
> > > Depending on your Bayes backend, your bottleneck will not be the
> > > CPUs but I/O.
> > > Normally there's no need for running multiple sa-learn instances.
> > >
> > > My sa-learn is learning +40 msgs/sec from a SSD into a Redis DB.
> > >
> > > On 4/15/21 2:33 PM, Christian V?lker wrote:
> > > > Hi all,
> > > >
> > > > I am going to add some large spam archives for my Bayes database
> > > > with sa-learn.
> > > >
> > > > I have a machine with six vCPUs and obviously I would like to
> > > > speed up the learning process. I am thinking of running six
> > > > sa-learn processes in parallel. Is there any issue with this
> > > > like locks for the database?
> > > >
> > > > Or is sa-learn itself multithreaded and I do not need to run it
> > > > in parallel (does not look so)?
> > > >
> > > > Next, when running the above in parallel (if possible) should I
> > > > use the "--no-sync" and do the syncing afterwards? But again,
> > > > this is then only single-threaded, right?
> > > >
> > > > Thanks a lot for your input!
> > > >
> > > > /Christian
> > > >
> > > >
> > >
> > >
> >
>
Re: sa-learn using multiple CPUs? [ In reply to ]
Hi,

so I did some testing.

When using bayes_ files as backend and flock only a single process
consumes CPU (strange, I have seen different behaviour before).
When using MariaDB as backend all processes use CPU and share them with
the MariaDB process.

So I will re-configure my installation to use MariaDB.


Thanks for your input!

/Christian



Am 15.04.2021 um 15:07 schrieb Henrik K:
> If you insist on file bayes, atleast make sure you use "lock_method flock".
> Or maybe BDB backend, don't remember if it's faster.
>
>> On 4/15/21 2:45 PM, Christian Völker wrote:
>>> Hi,
>>>
>>> well, here it is not I/O bound (running on RAID1-SSDs). I am using the
>>> "default" file based backend ~/.spamassassin/bayes*.
>>>
>>> 40msg/sec is not really fast enough for me. The number of messages to be
>>> processed is really huge.
>>>
>>> So again asking: is it possible with the file-based dbackend to do this
>>> stuff in parallel?
>>>
>>> Thanks
>>>
>>> /Christian
>>>
>>> Am 15.04.2021 um 14:38 schrieb Axb:
>>>> Depending on your Bayes backend, your bottleneck will not be the
>>>> CPUs but I/O.
>>>> Normally there's no need for running multiple sa-learn instances.
>>>>
>>>> My sa-learn is learning +40 msgs/sec from a SSD into a Redis DB.
>>>>
>>>> On 4/15/21 2:33 PM, Christian Völker wrote:
>>>>> Hi all,
>>>>>
>>>>> I am going to add some large spam archives for my Bayes database
>>>>> with sa-learn.
>>>>>
>>>>> I have a machine with six vCPUs and obviously I would like to
>>>>> speed up the learning process. I am thinking of running six
>>>>> sa-learn processes in parallel. Is there any issue with this
>>>>> like locks for the database?
>>>>>
>>>>> Or is sa-learn itself multithreaded and I do not need to run it
>>>>> in parallel (does not look so)?
>>>>>
>>>>> Next, when running the above in parallel (if possible) should I
>>>>> use the "--no-sync" and do the syncing afterwards? But again,
>>>>> this is then only single-threaded, right?
>>>>>
>>>>> Thanks a lot for your input!
>>>>>
>>>>> /Christian
>>>>>
>>>>>
>>>>
Re: sa-learn using multiple CPUs? [ In reply to ]
On Thu, 15 Apr 2021, Christian Völker wrote:

> Hi,
>
> so I did some testing.
>
> When using bayes_ files as backend and flock only a single process consumes
> CPU (strange, I have seen different behaviour before).
> When using MariaDB as backend all processes use CPU and share them with the
> MariaDB process.
>
> So I will re-configure my installation to use MariaDB.

You should also consider the Redis backend.

--
John Hardin KA7OHZ http://www.impsec.org/~jhardin/
jhardin@impsec.org pgpk -a jhardin@impsec.org
key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
Our politicians should bear in mind the fact that
the American Revolution was touched off by the then-current
government attempting to confiscate firearms from the people.
-----------------------------------------------------------------------
4 days until the 246th anniversary of The Shot Heard 'Round The World
Re: sa-learn using multiple CPUs? [ In reply to ]
Hi,
>> So I will re-configure my installation to use MariaDB.
> You should also consider the Redis backend.

Ok, had a look when using MariaDB and I monitored it for the last 24hrs.
My 10 vCPUs where used, no I/O waits. But CPU usage overall was
according to "top" only at 25% as top showed 75% idle. I assume there is
some locking in place limiting the CPU usage.

I configured it now to use Redis instead of MySQL and top tells me about
25% idle with 0% I/O waits when running 10 sa-learn in parallel.
Increasing or decreasing the number of jobs does not significally change
the idle percentage.

So using redis the CPU usage is higher compared to MySQL.

Thanks for ideas!

/Christian
Re: sa-learn using multiple CPUs? [ In reply to ]
Sorry to annoy you. Another addition to my tests:

When using redis it took me around 15seconds to scan ~1,500 messages.
When using MariaDB it took one minute to do the same.
With file based I had strange issues whatever lock type eI used (flock
yes/no):
"bayes: bayes db version 0 is not able to be used, aborting! at
/usr/share/perl5/Mail/SpamAssassin/BayesStore/DBM.pm line 206."


Anyways, now using Redis which appears to be the fastest.

Thanks again!

/Christian



Am 16.04.2021 um 08:48 schrieb Christian Völker:
> Hi,
>>> So I will re-configure my installation to use MariaDB.
>> You should also consider the Redis backend.
>
> Ok, had a look when using MariaDB and I monitored it for the last
> 24hrs. My 10 vCPUs where used, no I/O waits. But CPU usage overall was
> according to "top" only at 25% as top showed 75% idle. I assume there
> is some locking in place limiting the CPU usage.
>
> I configured it now to use Redis instead of MySQL and top tells me
> about 25% idle with 0% I/O waits when running 10 sa-learn in parallel.
> Increasing or decreasing the number of jobs does not significally
> change the idle percentage.
>
> So using redis the CPU usage is higher compared to MySQL.
>
> Thanks for ideas!
>
> /Christian
>
Re: sa-learn using multiple CPUs? [ In reply to ]
To avoid suprises, remember to watch your memory usage.
Redis reads/writes the DB in memory and only dumps to disk for backup.

"redis-cli info" is of help


On 4/16/21 9:10 AM, Christian Völker wrote:
> Sorry to annoy you. Another addition to my tests:
>
> When using redis it took me around 15seconds to scan ~1,500 messages.
> When using MariaDB it took one minute to do the same.
> With file based I had strange issues whatever lock type eI used (flock
> yes/no):
> "bayes: bayes db version 0 is not able to be used, aborting! at
> /usr/share/perl5/Mail/SpamAssassin/BayesStore/DBM.pm line 206."
>
>
> Anyways, now using Redis which appears to be the fastest.
>
> Thanks again!
>
> /Christian
>
>
>
> Am 16.04.2021 um 08:48 schrieb Christian Völker:
>> Hi,
>>>> So I will re-configure my installation to use MariaDB.
>>> You should also consider the Redis backend.
>>
>> Ok, had a look when using MariaDB and I monitored it for the last
>> 24hrs. My 10 vCPUs where used, no I/O waits. But CPU usage overall was
>> according to "top" only at 25% as top showed 75% idle. I assume there
>> is some locking in place limiting the CPU usage.
>>
>> I configured it now to use Redis instead of MySQL and top tells me
>> about 25% idle with 0% I/O waits when running 10 sa-learn in parallel.
>> Increasing or decreasing the number of jobs does not significally
>> change the idle percentage.
>>
>> So using redis the CPU usage is higher compared to MySQL.
>>
>> Thanks for ideas!
>>
>> /Christian
>>
>
Re: sa-learn using multiple CPUs? [ In reply to ]
How hard is it to keep list mail on list and not reply directly to sender?

Have you seen
https://svn.apache.org/repos/asf/spamassassin/trunk/contrib/HOWTO.Bayes-Redis/
?

there may be some helpful info in there.

On 4/16/21 9:47 AM, Christian Völker wrote:
> Thanks for the hint. I will monitor it. The machine has 16GB of memory
> which should be sufficient but I already notivce the preallocation of
> redis with 2GB.
>
> It is somehow unclear what happens. If there is no limit I will get an
> OOM errror and redis will (if killed) loose the last transactions after
> the last "save 900 1" snapshot, right?
>
> If I set a limit it will discard the oldest entries, correct?
>
> Both seems not to be perfect for Spamassassin.
>
> However, I will ignore the topic for the moment and see how it goes.
> 16GB shoud (hopefully) be enough. Once scanned the expired rules of
> Spamassassin should take place and reduce the amount of memory.
>
> Greetings
>
> /Christian
>
>
>
>
> Am 16.04.2021 um 09:15 schrieb Axb:
>> To avoid suprises, remember to watch your memory usage.
>> Redis reads/writes the DB in memory and only dumps to disk for backup.
>>
>> "redis-cli info" is of help
>>
>>
>> On 4/16/21 9:10 AM, Christian Völker wrote:
>>> Sorry to annoy you. Another addition to my tests:
>>>
>>> When using redis it took me around 15seconds to scan ~1,500 messages.
>>> When using MariaDB it took one minute to do the same.
>>> With file based I had strange issues whatever lock type eI used
>>> (flock yes/no):
>>> "bayes: bayes db version 0 is not able to be used, aborting! at
>>> /usr/share/perl5/Mail/SpamAssassin/BayesStore/DBM.pm line 206."
>>>
>>>
>>> Anyways, now using Redis which appears to be the fastest.
>>>
>>> Thanks again!
>>>
>>> /Christian
>>>
>>>
>>>
>>> Am 16.04.2021 um 08:48 schrieb Christian Völker:
>>>> Hi,
>>>>>> So I will re-configure my installation to use MariaDB.
>>>>> You should also consider the Redis backend.
>>>>
>>>> Ok, had a look when using MariaDB and I monitored it for the last
>>>> 24hrs. My 10 vCPUs where used, no I/O waits. But CPU usage overall
>>>> was according to "top" only at 25% as top showed 75% idle. I assume
>>>> there is some locking in place limiting the CPU usage.
>>>>
>>>> I configured it now to use Redis instead of MySQL and top tells me
>>>> about 25% idle with 0% I/O waits when running 10 sa-learn in
>>>> parallel. Increasing or decreasing the number of jobs does not
>>>> significally change the idle percentage.
>>>>
>>>> So using redis the CPU usage is higher compared to MySQL.
>>>>
>>>> Thanks for ideas!
>>>>
>>>> /Christian
>>>>
>>>
>>
>>
>
Re: sa-learn using multiple CPUs? [ In reply to ]
On 2021-04-16 03:29, John Hardin wrote:

>> So I will re-configure my installation to use MariaDB.
> You should also consider the Redis backend.

i dont like to see redis needs sysctl non default settings

so much more power does redis not have

imho one could use memory engine in mysql, and then periodly dump to
sql, or copy from memory to csv in mariadb, both memory engine and csv
engine is very low mem frindly while still performing fast access

maybe i am wroung, i just use postgresql