Mailing List Archive

timeouts on processing some messages, started October 24
I have a systeem with postfix and spamassassin 3.4.6 via spamd. It's
been generally running well. I noticed mail from one of my other
systems timing out and 471, and that caused me to look at the logs. I
have KAM rules, some RBL adjustments, a bunch of local rules for my
spam, but really nothing I consider unusual.

I realized I had DCC enabled, perhaps not correctly, and I just took
that out, since I've never really been clear on how it works and if I
want to use it.


My logs go back to October 3, but starting 24th I have lots of lines like:

Oct 24 03:23:13 bar spamd[25868]: check: exceeded time limit in Mail::SpamAssassin::Plugin::Check::_eval_tests_type9_pri1000_set1, skipping further tests

Looking further, I see

Nov 1 12:02:01 bar postfix/cleanup[18861]: 6E2D74106C3: message-id=<20211031071804.B221B16075@bar.example.com> Nov 1 12:07:01 bar postfix/cleanup[18861]: warning: milter unix:/var/run/spamass.sock: can't read SMFIC_BODYEOB reply packet header: Connection timed out
Nov 1 12:07:01 bar postfix/cleanup[18861]: 6E2D74106C3: milter-reject: END-OF-MESSAGE from foo.example.com[10.0.0.2]: 4.7.1 Service unavailable - try again later ; from=<root@foo.example.com> to=<fred@example.com> proto=ESMTP helo=<foo.example.com>
Nov 1 12:07:02 bar spamd[23510]: check: exceeded time limit in Mail::SpamAssassin::Plugin::Check::_eval_tests_type9_pri1000_set1, skipping further tests
Nov 1 12:07:02 bar spamd[13194]: spamd: clean message (-1.0/1.0) for fred:10853 in 300.2 seconds, 2064 bytes.
Nov 1 12:07:02 bar spamd[13194]: spamd: result: . 0 - ALL_TRUSTED,KAM_DMARC_STATUS,TIME_LIMIT_EXCEEDED scantime=300.2,size=2064,user=fred,uid=10853,required_score=1.0,rhost=::1,raddr=::1,rport=56983,mid=<20211031071804.B221B16075@foo.example.com>,autolearn=unavailable

so it sort of looks like:

postfix is waiting 300s
SA thinks it can spend 300s processing
postfix gives up 1s before SA is done

something is causing a delay

and thus I have two problems:

need to have postfix delay be more than spamassassin delay plus rounding

need to figure out why there is a timeout

The first is surely manual reading, but I wonder why it isn't default.

On the second, I wonder if anyone else is seeing this, and clues appreciated.

Thanks,
Greg
Re: timeouts on processing some messages, started October 24 [ In reply to ]
> postfix is waiting 300s
> SA thinks it can spend 300s processing
> postfix gives up 1s before SA is done

The default spamd child timeout is 300s.
The default postfix content milter timeout is 300s.
Each is a reasonable choice, but really postfix's timeout should be
longer.

I set in postfix main.cf: "milter_content_timeout = 330s" and now I
still get spamd child timeouts, but things are better.

So probably we should set the default spamd child timeout to 270s.

A wrinkle is that I realize that I had a learn process running, where I
run over my ham folders and spam folders and run sa-learn -L. I used to
run that often, and it would take some number of minutes, but this one
had been running for days. My guess is that it took long as a symptom
of the same bug, vs being a cause, but that remains to be seen.
Re: timeouts on processing some messages, started October 24 [ In reply to ]
On 2021-11-02 at 19:15:33 UTC-0400 (Tue, 02 Nov 2021 19:15:33 -0400)
Greg Troxel <gdt@lexort.com>
is rumored to have said:

> I have a systeem with postfix and spamassassin 3.4.6 via spamd. It's
> been generally running well. I noticed mail from one of my other
> systems timing out and 471, and that caused me to look at the logs. I
> have KAM rules, some RBL adjustments, a bunch of local rules for my
> spam, but really nothing I consider unusual.
[...]
> and thus I have two problems:
>
> need to have postfix delay be more than spamassassin delay plus
> rounding

It would generally be a bad idea to increase the Postfix timeout, as
that passes the problem back upstream as senders will generally time out
at 300s as well.

So, add '--timeout-child=295' to your spamd arguments if you want to
make spamd timeout faster than Postfix reliably.

> need to figure out why there is a timeout

That's the important part.

> The first is surely manual reading, but I wonder why it isn't default.

We don't try very hard to guess what users will want in the integration
details between SA and the tools like MTAs that use it. 300s is the SMTP
default timeout at end-of-data, which presumably is why it is spamd's
default. I think it makes sense to reduce that for most circumstances,
but I'm a bit hesitant to do so in the distribution because there could
be people relying on the specific idiosyncratic behavior of spamd timing
out after its caller has given up rather than before.

> On the second, I wonder if anyone else is seeing this, and clues
> appreciated.

I have no recent SA timeouts logged recently on any of the systems I
manage.

The most common reason for SA to hit its internal timeout is the
combination of a rule with a pattern that can generate a large number of
backtracks while scanning (exponential or factorial order) and a message
which causes such backtracking. Typically that's caused by a '*' or '+'
in a pattern where a fixed range for the number of repeats should be
used instead. A few years ago we tried to fix all cases of dangerous
rules in the default ruleset, and I think we succeeded. I believe the
KAM rules have also been audited for likely problems. If you have any
unbounded wildcards in your local rules, tightening those rules up
should be your first step. If you can't find and fix the problematic
rule by eye, you can get clues about it by scanning a problematic
message with the "-D all" option to get a detailed rundown of what SA
does in scanning a message. That will show you what rules are checked
successfully. You can find a problematic rule by comparing that debug
output from a bad message to that of a message which doesn't hang SA.


--
Bill Cole
bill@scconsult.com or billcole@apache.org
(AKA @grumpybozo and many *@billmail.scconsult.com addresses)
Not Currently Available For Hire
Re: timeouts on processing some messages, started October 24 [ In reply to ]
Bill Cole <sausers-20150205@billmail.scconsult.com> writes:

> It would generally be a bad idea to increase the Postfix timeout, as
> that passes the problem back upstream as senders will generally time
> out at 300s as well.
>
> So, add '--timeout-child=295' to your spamd arguments if you want to
> make spamd timeout faster than Postfix reliably.

Thanks; I didn't think of the previous timeout. Before getting your
mail, I did set my postfix milter timeout to 330s, but the actual delay
was ~301s since the spamd timeout worked. That resulted in delivery and
the remote system (also postfix) not giving up. I have since changed
to --timeout-child=290 in spamd and restored postfix to default.

>> need to figure out why there is a timeout
>
> That's the important part.

I am narrowing the circumstances and will follow up when I figure it out.

>> The first is surely manual reading, but I wonder why it isn't default.
>
> We don't try very hard to guess what users will want in the
> integration details between SA and the tools like MTAs that use
> it. 300s is the SMTP default timeout at end-of-data, which presumably
> is why it is spamd's default. I think it makes sense to reduce that
> for most circumstances, but I'm a bit hesitant to do so in the
> distribution because there could be people relying on the specific
> idiosyncratic behavior of spamd timing out after its caller has given
> up rather than before.

It strikes me that timeouts happening is basically a symptom of bugs and
each layer should be set up to avoid being non-responsive to the calling
layer. While I see your point about not tuning for what people might
want, it seems that if a system is to meet the 300s SMTP data timeout,
spamd needs to take less than 300s, so going for 290 or 295 seems
sensible. I would guess, without any real basis, that far more people
are just sitting on latent trouble than really intend to have a milter
callout give up about a second just before spamd.

> The most common reason for SA to hit its internal timeout is the
> combination of a rule with a pattern that can generate a large number
> of backtracks while scanning (exponential or factorial order) and a
> message which causes such backtracking. Typically that's caused by a
> '*' or '+' in a pattern where a fixed range for the number of repeats
> should be used instead. A few years ago we tried to fix all cases of
> dangerous rules in the default ruleset, and I think we succeeded. I
> believe the KAM rules have also been audited for likely problems. If
> you have any unbounded wildcards in your local rules, tightening those
> rules up should be your first step. If you can't find and fix the
> problematic rule by eye, you can get clues about it by scanning a
> problematic message with the "-D all" option to get a detailed rundown
> of what SA does in scanning a message. That will show you what rules
> are checked successfully. You can find a problematic rule by comparing
> that debug output from a bad message to that of a message which
> doesn't hang SA.

Thanks, that regexp hint is a huge clue to me.
Re: timeouts on processing some messages, started October 24 [ In reply to ]
>On 2021-11-02 at 19:15:33 UTC-0400 (Tue, 02 Nov 2021 19:15:33 -0400)
>Greg Troxel <gdt@lexort.com>
>is rumored to have said:
>
>>I have a systeem with postfix and spamassassin 3.4.6 via spamd. It's
>>been generally running well. I noticed mail from one of my other
>>systems timing out and 471, and that caused me to look at the logs. I
>>have KAM rules, some RBL adjustments, a bunch of local rules for my
>>spam, but really nothing I consider unusual.
>[...]
>>and thus I have two problems:
>>
>> need to have postfix delay be more than spamassassin delay plus
>>rounding

On 03.11.21 09:32, Bill Cole wrote:
>It would generally be a bad idea to increase the Postfix timeout, as
>that passes the problem back upstream as senders will generally time
>out at 300s as well.

the required timeout for DATA is 10 minutes (RFC 5321 section 4.5.3.2.6.)
and that is the default timeout for postfix (sendmail has even longer
timeout).


--
Matus UHLAR - fantomas, uhlar@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
We are but packets in the Internet of life (userfriendly.org)
Re: timeouts on processing some messages, started October 24 [ In reply to ]
On Wed, 3 Nov 2021, Bill Cole wrote:

> The most common reason for SA to hit its internal timeout is the combination
> of a rule with a pattern that can generate a large number of backtracks while
> scanning (exponential or factorial order) and a message which causes such
> backtracking. Typically that's caused by a '*' or '+' in a pattern where a
> fixed range for the number of repeats should be used instead.

...or a non-greedy match if you're running a newer Perl.

> If you have any unbounded wildcards in your local rules, tightening
> those rules up should be your first step. If you can't find and fix the
> problematic rule by eye, you can get clues about it by scanning a problematic
> message with the "-D all" option to get a detailed rundown of what SA does in
> scanning a message. That will show you what rules are checked successfully.
> You can find a problematic rule by comparing that debug output from a bad
> message to that of a message which doesn't hang SA.

There's also the HitFreqsRuleTiming plugin if you're running in a dev
environment and can let it scan for a potentially long time (until
completion).


--
John Hardin KA7OHZ http://www.impsec.org/~jhardin/
jhardin@impsec.org pgpk -a jhardin@impsec.org
key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
"Bother," said Pooh as he struggled with /etc/sendmail.cf, "it never
does quite what I want. I wish Christopher Robin was here."
-- Peter da Silva in a.s.r
-----------------------------------------------------------------------
4 days until Daylight Saving Time ends in U.S. - Fall Back
Getting an extra hour of 2021 is like
getting a free track on a Yoko Ono album.
Re: timeouts on processing some messages, started October 24 [ In reply to ]
I have captured a bad message. It seems innocuous; it's from me at a
host in my domain, to me, basically

From: gdt@foo.lexort.com
To: gdt@lexort.com

and has a body "foo", no DKIM headers, just Received, Subject,
Message-Id.


Processing this with my normal config results in the timeout.


I noticed lockfiles for txrep, even though I couldn't figure out that
txrep was involved from' -D all', and turned off txrep in my config
("use_txrep 0" instead of 1). Then, the message processes in 2s.

When I had txrep enabled, I saw a tx-reputation.lock with a single line
that was a pid of the spamd child process that was accumulating CPU
time. I also had files like:
tx-reputation.lock.bar.lexort.com.5023
where that was another pid, and this second file seemed to be
accumulating lines.

I did find a stray sa-learn from October and killed it.

Running my spam learning script, which just calls sa-learn with --spam
or --ham (and -L always) is turning out slow, probably from the same thing.

So it sort of smells like one of
- something is wrong with my txrep database
- some code is hitting O(n^k) or something
- there is some strange locking/spinning behavior
- something else I don't understand, as always



Does anyone have pointers to a database export/import script for txrep?