Mailing List Archive

av_scanner is broken suddenly?
Dear Colleagues,

Something strange happened to clamd TCP communication after upgrading
exim to exim-4.94_4 (FreeBSD).

The configuration is standard:

av_scanner = clamd:192.168.153.104 3310
[...]
acl_check_data:
deny malware = */defer_ok
message = This message contains a virus ($malware_name).

clamav is alive (first thing I checked):

$ telnet 192.168.153.104 3310
Trying 192.168.153.104...
Connected to 192.168.153.104.
Escape character is '^]'.
VERSION
ClamAV 0.103.0/26031/Mon Dec 28 19:43:18 2020
Connection closed by foreign host.

However, on mail delivery exim complains in the log:

2020-12-28 21:57:21 1kttxZ-0000Xw-3x malware acl condition: clamd : unable to send file body to socket (192.168.153.104)

The packet dump is available at http://admin.sibptus.ru/~vas/2.pcap
You can see from it that exim's host opens a TCP connection to the
clamav host, the clamav host replies with SYN+ACK, but the exim host sends a TCP RST
immediately. Why could that happen?

Accessing exim locally via a Unix socket would probably work around the
problem, but for the present I have to keep exim and clamav on different
hosts.

Any bright ideas please? Any debug I could enable and look at?

--
Victor Sudakov, VAS4-RIPE, VAS47-RIPN
2:5005/49@fidonet http://vas.tomsk.ru/

--
## List details at https://lists.exim.org/mailman/listinfo/exim-users
## Exim details at http://www.exim.org/
## Please use the Wiki with this list - http://wiki.exim.org/
Re: av_scanner is broken suddenly? [ In reply to ]
On Tue, Dec 29, 2020 at 10:20:51AM +0700, Victor Sudakov via Exim-users wrote:
> The packet dump is available at http://admin.sibptus.ru/~vas/2.pcap
> You can see from it that exim's host opens a TCP connection to the
> clamav host, the clamav host replies with SYN+ACK, but the exim host sends a TCP RST
> immediately. Why could that happen?

Normally tcp connection initiator should respond with ACK to SYN+ACK.
It looks like firewalling issue, I can't imagine how to produce such
behaviour with socket API.
--
Eugene Berdnikov

--
## List details at https://lists.exim.org/mailman/listinfo/exim-users
## Exim details at http://www.exim.org/
## Please use the Wiki with this list - http://wiki.exim.org/
Re: av_scanner is broken suddenly? [ In reply to ]
Evgeniy Berdnikov via Exim-users wrote:
> On Tue, Dec 29, 2020 at 10:20:51AM +0700, Victor Sudakov via Exim-users wrote:
> > The packet dump is available at http://admin.sibptus.ru/~vas/2.pcap
> > You can see from it that exim's host opens a TCP connection to the
> > clamav host, the clamav host replies with SYN+ACK, but the exim host sends a TCP RST
> > immediately. Why could that happen?
>
> Normally tcp connection initiator should respond with ACK to SYN+ACK.

Have you looked at the actual packet dump? I mean, don't trust my word.

> It looks like firewalling issue,

This dump http://admin.sibptus.ru/~vas/3.pcap was collected on the very
host where exim lives (95.170...). I can't imagine how any firewall could
be involved.

> I can't imagine how to produce such
> behaviour with socket API.

Neither do I. Moreover, the problem began after upgrading the OS from
FreeBSD 11 to 12.2-RELEASE-p1, and exim to exim-4.94_4. Maybe the OS is
to blame.

--
Victor Sudakov, VAS4-RIPE, VAS47-RIPN
2:5005/49@fidonet http://vas.tomsk.ru/

--
## List details at https://lists.exim.org/mailman/listinfo/exim-users
## Exim details at http://www.exim.org/
## Please use the Wiki with this list - http://wiki.exim.org/
Re: av_scanner is broken suddenly? [ In reply to ]
On 29/12/2020 11:03, Victor Sudakov via Exim-users wrote:
> Evgeniy Berdnikov via Exim-users wrote:
>> On Tue, Dec 29, 2020 at 10:20:51AM +0700, Victor Sudakov via Exim-users wrote:
>>> You can see from it that exim's host opens a TCP connection to the
>>> clamav host, the clamav host replies with SYN+ACK, but the exim host sends a TCP RST
>>> immediately. Why could that happen?

>> I can't imagine how to produce such
>> behaviour with socket API.
>
> Neither do I. Moreover, the problem began after upgrading the OS from
> FreeBSD 11 to 12.2-RELEASE-p1, and exim to exim-4.94_4. Maybe the OS is
> to blame.

Can you manage to get a syscall trace of exim's actions, isolating those
for that specific socket?
--
Cheers,
Jeremy

--
## List details at https://lists.exim.org/mailman/listinfo/exim-users
## Exim details at http://www.exim.org/
## Please use the Wiki with this list - http://wiki.exim.org/
Re: av_scanner is broken suddenly? [ In reply to ]
On Tue, Dec 29, 2020 at 06:03:10PM +0700, Victor Sudakov via Exim-users wrote:
> Evgeniy Berdnikov via Exim-users wrote:
> > On Tue, Dec 29, 2020 at 10:20:51AM +0700, Victor Sudakov via Exim-users wrote:
> > > The packet dump is available at http://admin.sibptus.ru/~vas/2.pcap
> > > You can see from it that exim's host opens a TCP connection to the
> > > clamav host, the clamav host replies with SYN+ACK, but the exim host sends a TCP RST
> > > immediately. Why could that happen?
> >
> > Normally tcp connection initiator should respond with ACK to SYN+ACK.
>
> Have you looked at the actual packet dump? I mean, don't trust my word.

Yes, I have. And have you looked at packet time stamps?
IMHO, 3-5 us time interval between SYN-ACK and RST is
too low for userspace responce.

However, get strace as Jeremy suggested.
I expect ECONNREFUSED on connect().
--
Eugene Berdnikov

--
## List details at https://lists.exim.org/mailman/listinfo/exim-users
## Exim details at http://www.exim.org/
## Please use the Wiki with this list - http://wiki.exim.org/
Re: av_scanner is broken suddenly? [ In reply to ]
Am 29.12.20 um 12:03 schrieb Victor Sudakov via Exim-users:
>
>> I can't imagine how to produce such
>> behaviour with socket API.
> Neither do I. Moreover, the problem began after upgrading the OS from
> FreeBSD 11 to 12.2-RELEASE-p1, and exim to exim-4.94_4. Maybe the OS is
> to blame.
>

If you are on it, check if firewalld got installed and is running. It
has it's own firewall and some VERY illlogical settings. a "iptables -L"
won't help you with that.

Best regards,
Marius

--
## List details at https://lists.exim.org/mailman/listinfo/exim-users
## Exim details at http://www.exim.org/
## Please use the Wiki with this list - http://wiki.exim.org/
Re: av_scanner is broken suddenly? [ In reply to ]
On 29 Dec 2020, at 7:38, Cyborg via Exim-users wrote:

> Am 29.12.20 um 12:03 schrieb Victor Sudakov via Exim-users:
>>
>>> I can't imagine how to produce such
>>> behaviour with socket API.
>> Neither do I. Moreover, the problem began after upgrading the OS from
>> FreeBSD 11 to 12.2-RELEASE-p1, and exim to exim-4.94_4. Maybe the OS
>> is
>> to blame.
>>
>
> If you are on it, check if firewalld got installed and is running. It
> has it's own firewall and some VERY illlogical settings. a "iptables
> -L" won't help you with that.

Not relevant for FreeBSD. It uses a completely different collection of
onboard firewall tools.

--
Bill Cole
bill@scconsult.com or billcole@apache.org
(AKA @grumpybozo and many *@billmail.scconsult.com addresses)
Not Currently Available For Hire

--
## List details at https://lists.exim.org/mailman/listinfo/exim-users
## Exim details at http://www.exim.org/
## Please use the Wiki with this list - http://wiki.exim.org/
Re: av_scanner is broken suddenly? [ In reply to ]
Jeremy Harris via Exim-users wrote:
> On 29/12/2020 11:03, Victor Sudakov via Exim-users wrote:
> > Evgeniy Berdnikov via Exim-users wrote:
> > > On Tue, Dec 29, 2020 at 10:20:51AM +0700, Victor Sudakov via Exim-users wrote:
> > > > You can see from it that exim's host opens a TCP connection to the
> > > > clamav host, the clamav host replies with SYN+ACK, but the exim host sends a TCP RST
> > > > immediately. Why could that happen?
>
> > > I can't imagine how to produce such
> > > behaviour with socket API.
> >
> > Neither do I. Moreover, the problem began after upgrading the OS from
> > FreeBSD 11 to 12.2-RELEASE-p1, and exim to exim-4.94_4. Maybe the OS is
> > to blame.
>
> Can you manage to get a syscall trace of exim's actions, isolating those
> for that specific socket?

I hope this is what you mean: https://termbin.com/zjsv

This is a test mail sent by SMTP and what's happening to it.

--
Victor Sudakov, VAS4-RIPE, VAS47-RIPN
2:5005/49@fidonet http://vas.tomsk.ru/

--
## List details at https://lists.exim.org/mailman/listinfo/exim-users
## Exim details at http://www.exim.org/
## Please use the Wiki with this list - http://wiki.exim.org/
Re: av_scanner is broken suddenly? [ In reply to ]
Evgeniy Berdnikov via Exim-users wrote:
> On Tue, Dec 29, 2020 at 06:03:10PM +0700, Victor Sudakov via Exim-users wrote:
> > Evgeniy Berdnikov via Exim-users wrote:
> > > On Tue, Dec 29, 2020 at 10:20:51AM +0700, Victor Sudakov via Exim-users wrote:
> > > > The packet dump is available at http://admin.sibptus.ru/~vas/2.pcap
> > > > You can see from it that exim's host opens a TCP connection to the
> > > > clamav host, the clamav host replies with SYN+ACK, but the exim host sends a TCP RST
> > > > immediately. Why could that happen?
> > >
> > > Normally tcp connection initiator should respond with ACK to SYN+ACK.
> >
> > Have you looked at the actual packet dump? I mean, don't trust my word.
>
> Yes, I have. And have you looked at packet time stamps?
> IMHO, 3-5 us time interval between SYN-ACK and RST is
> too low for userspace responce.

Well, the whole story is happening within one hypervisor (bhyve) so
network delays can be very low.

>
> However, get strace as Jeremy suggested.
> I expect ECONNREFUSED on connect().

Is this ktrace informative https://termbin.com/zjsv ?

--
Victor Sudakov, VAS4-RIPE, VAS47-RIPN
2:5005/49@fidonet http://vas.tomsk.ru/

--
## List details at https://lists.exim.org/mailman/listinfo/exim-users
## Exim details at http://www.exim.org/
## Please use the Wiki with this list - http://wiki.exim.org/
Re: av_scanner is broken suddenly? [ In reply to ]
Bill Cole via Exim-users wrote:
> On 29 Dec 2020, at 7:38, Cyborg via Exim-users wrote:
>
> > Am 29.12.20 um 12:03 schrieb Victor Sudakov via Exim-users:
> > >
> > > > I can't imagine how to produce such
> > > > behaviour with socket API.
> > > Neither do I. Moreover, the problem began after upgrading the OS from
> > > FreeBSD 11 to 12.2-RELEASE-p1, and exim to exim-4.94_4. Maybe the OS
> > > is
> > > to blame.
> > >
> >
> > If you are on it, check if firewalld got installed and is running. It
> > has it's own firewall and some VERY illlogical settings. a "iptables -L"
> > won't help you with that.
>
> Not relevant for FreeBSD. It uses a completely different collection of
> onboard firewall tools.

Of which none are enabled on the relevant virtual hosts, because
otherwise this would probably not work either:

$ telnet 192.168.153.104 3310
Trying 192.168.153.104...
Connected to 192.168.153.104.
Escape character is '^]'.
VERSION
ClamAV 0.103.0/26032/Tue Dec 29 19:42:39 2020
Connection closed by foreign host.

$ telnet 192.168.153.104 3310
Trying 192.168.153.104...
Connected to 192.168.153.104.
Escape character is '^]'.
STREAM
PORT 30113



--
Victor Sudakov, VAS4-RIPE, VAS47-RIPN
2:5005/49@fidonet http://vas.tomsk.ru/

--
## List details at https://lists.exim.org/mailman/listinfo/exim-users
## Exim details at http://www.exim.org/
## Please use the Wiki with this list - http://wiki.exim.org/
Re: av_scanner is broken suddenly? [ In reply to ]
On Wed, Dec 30, 2020 at 02:25:19PM +0700, Victor Sudakov via Exim-users wrote:
> Evgeniy Berdnikov via Exim-users wrote:
> > Yes, I have. And have you looked at packet time stamps?
> > IMHO, 3-5 us time interval between SYN-ACK and RST is
> > too low for userspace responce.
>
> Well, the whole story is happening within one hypervisor (bhyve) so
> network delays can be very low.

It does not matter: any hypervisor should perform a context switch bitween
kernel process and userspace process, even for container virtulization.
It's havy task.

> > However, get strace as Jeremy suggested.
> > I expect ECONNREFUSED on connect().
>
> Is this ktrace informative https://termbin.com/zjsv ?

I have never seen such a trace before... However, it seems that relevant
part for connection to 192.168.153.104:3310 is here:

8889 exim CALL socket(PF_INET,0x1<SOCK_STREAM>,IPPROTO_IP)
8889 exim RET socket 5
8889 exim CALL setitimer(0,0x7fffffff30d0,0x7fffffff30b0)
8889 exim STRU itimerval { .interval = {0, 0}, .value = {5, 0} }
8889 exim STRU itimerval { .interval = {0, 0}, .value = {0, 0} }
8889 exim RET setitimer 0
8889 exim CALL setsockopt(0x5,IPPROTO_TCP,TCP_FASTOPEN,0x254270,0x4)
8889 exim RET setsockopt 0
8889 exim CALL sendto(0x5,0x2322d6,0xa,0,0x7fffffff3150,0x10)
8889 exim STRU struct sockaddr { AF_INET, 192.168.153.104:3310 }
8889 exim GIO fd 5 wrote 10 bytes
"zINSTREAM\0"
8889 exim RET sendto 10/0xa
8889 exim CALL setitimer(0,0x7fffffff30d0,0x7fffffff30b0)
8889 exim STRU itimerval { .interval = {0, 0}, .value = {0, 0} }
8889 exim STRU itimerval { .interval = {0, 0}, .value = {4, 999970} }
8889 exim RET setitimer 0
8889 exim CALL close(0xffffffff)
8889 exim RET close -1 errno 9 Bad file descriptor

As packet is sent, it may be some problem with TCP_FASTOPEN, probably
with its handling in hypervisor and/or external firewall.

Consequent close(-1) is definitely an error, but let us ignore it now.
Then exim reads file and tries to write into this socket:

8889 exim CALL sendto(0x5,0x7fffffff3344,0x4,0,0,0)
8889 exim RET sendto -1 errno 57 Socket is not connected
8889 exim CALL close(0x5)

So, the fact that connection is broken is not visible for exim process.

BTW, this trace may be more meaningful with time stamps, turn them on
if possible.
--
Eugene Berdnikov

--
## List details at https://lists.exim.org/mailman/listinfo/exim-users
## Exim details at http://www.exim.org/
## Please use the Wiki with this list - http://wiki.exim.org/
Re: av_scanner is broken suddenly? [ In reply to ]
On Wed, Dec 30, 2020 at 11:04:39AM +0300, Evgeniy Berdnikov via Exim-users wrote:
> Then exim reads file and tries to write into this socket:
>
> 8889 exim CALL sendto(0x5,0x7fffffff3344,0x4,0,0,0)
> 8889 exim RET sendto -1 errno 57 Socket is not connected
> 8889 exim CALL close(0x5)
>
> So, the fact that connection is broken is not visible for exim process.

I mean not visible until this send().
--
Eugene Berdnikov

--
## List details at https://lists.exim.org/mailman/listinfo/exim-users
## Exim details at http://www.exim.org/
## Please use the Wiki with this list - http://wiki.exim.org/
Re: av_scanner is broken suddenly? [ In reply to ]
On 30/12/2020 08:04, Evgeniy Berdnikov via Exim-users wrote:
> On Wed, Dec 30, 2020 at 02:25:19PM +0700, Victor Sudakov via Exim-users wrote:
>> Is this ktrace informative https://termbin.com/zjsv ?

Yes; thanks.

>
> 8889 exim CALL socket(PF_INET,0x1<SOCK_STREAM>,IPPROTO_IP)
> 8889 exim RET socket 5
> 8889 exim CALL setitimer(0,0x7fffffff30d0,0x7fffffff30b0)
> 8889 exim STRU itimerval { .interval = {0, 0}, .value = {5, 0} }
> 8889 exim STRU itimerval { .interval = {0, 0}, .value = {0, 0} }
> 8889 exim RET setitimer 0
> 8889 exim CALL setsockopt(0x5,IPPROTO_TCP,TCP_FASTOPEN,0x254270,0x4)
> 8889 exim RET setsockopt 0
> 8889 exim CALL sendto(0x5,0x2322d6,0xa,0,0x7fffffff3150,0x10)
> 8889 exim STRU struct sockaddr { AF_INET, 192.168.153.104:3310 }
> 8889 exim GIO fd 5 wrote 10 bytes
> "zINSTREAM\0"
> 8889 exim RET sendto 10/0xa
> 8889 exim CALL setitimer(0,0x7fffffff30d0,0x7fffffff30b0)
> 8889 exim STRU itimerval { .interval = {0, 0}, .value = {0, 0} }
> 8889 exim STRU itimerval { .interval = {0, 0}, .value = {4, 999970} }
> 8889 exim RET setitimer 0
> 8889 exim CALL close(0xffffffff)
> 8889 exim RET close -1 errno 9 Bad file descriptor
>
> As packet is sent, it may be some problem with TCP_FASTOPEN, probably
> with its handling in hypervisor and/or external firewall.

Kernel, I think. The packet capture showed the SYNs not carrying
any TFO request, despite that TCP_FASTOPEN setsockopt. Probably the
FreeBSD implementation has changed since I worked on the Exim
implementation, in such a way as to break the combination.

In case it helps, my notes from then include:

# FreeBSD: it looks like you have to compile a custom kernel, with
# 'options TCP_RFC7413' in the config. Also set
# 'net.inet.tcp.fastopen.server_enable=1' in /etc/sysctl.conf

If there's a sysctl for enabling the client side, try changing
it. If that affects this, we need to know.


I could try to code up a backstop, retrying the connection without
TFO... sigh. Without some effort it wouldn't be particularly efficient
in service operation, and I call having to do that "ugly". Better
to get it working correctly, or deciding before trying that the
feature is not usable on the platform.
Slightly less ugly would be a config option "no TFO", either global
or just on the av_scanner address.


Meantime, the Exim debug channels "acl" and "transport" would show
the sequence from a higher-level view. We might be able to guess
what that close(-1) was (yes, that's a bug. Not an important one).

A compile-time workaround would be to disable TFO support, by commenting
out the line "#define EXIM_SUPPORT_TFO" in src/ip.c
>
> Consequent close(-1) is definitely an error, but let us ignore it now.
> Then exim reads file and tries to write into this socket:
>
> 8889 exim CALL sendto(0x5,0x7fffffff3344,0x4,0,0,0)

That write will be the filesize, given the zINSTREAM protocol
being used. The coding is assuming that it will either block until
the TCP connection has been made (and the TFO data either sent
as part of that or immediately following), or queue the data for
when it is made. Either way, it's not expecting an error return.

> 8889 exim RET sendto -1 errno 57 Socket is not connected

The error would be reasonable if we'd not tried to connect - which
is what the first sendto() (under TCP_FASTOPEN) is supposed to do.

--
Cheers,
Jeremy

--
## List details at https://lists.exim.org/mailman/listinfo/exim-users
## Exim details at http://www.exim.org/
## Please use the Wiki with this list - http://wiki.exim.org/
Re: av_scanner is broken suddenly? [ In reply to ]
Evgeniy Berdnikov via Exim-users wrote:
> On Wed, Dec 30, 2020 at 02:25:19PM +0700, Victor Sudakov via Exim-users wrote:
> > Evgeniy Berdnikov via Exim-users wrote:
> > > Yes, I have. And have you looked at packet time stamps?
> > > IMHO, 3-5 us time interval between SYN-ACK and RST is
> > > too low for userspace responce.
> >
> > Well, the whole story is happening within one hypervisor (bhyve) so
> > network delays can be very low.
>
> It does not matter: any hypervisor should perform a context switch bitween
> kernel process and userspace process, even for container virtulization.
> It's havy task.
>
> > > However, get strace as Jeremy suggested.
> > > I expect ECONNREFUSED on connect().
> >
> > Is this ktrace informative https://termbin.com/zjsv ?
>
> I have never seen such a trace before...

It's FreeBSD's native ktrace. FreeBSD also has dtrace but I have no
experience therewith.

>
> As packet is sent, it may be some problem with TCP_FASTOPEN, probably
> with its handling in hypervisor and/or external firewall.

You must be a wise man Evgeniy! On the exim host, I set sysctl
net.inet.tcp.fastopen.client_enable=0 and the problem seemed to go away.

Now if you care :-) please explain what it was and what I've lost by
disabling fastopen client.
>
> BTW, this trace may be more meaningful with time stamps, turn them on
> if possible.

For its academic value, the same ktrace with Unix timestamps:
https://termbin.com/iemr

--
Victor Sudakov, VAS4-RIPE, VAS47-RIPN
2:5005/49@fidonet http://vas.tomsk.ru/

--
## List details at https://lists.exim.org/mailman/listinfo/exim-users
## Exim details at http://www.exim.org/
## Please use the Wiki with this list - http://wiki.exim.org/
Re: av_scanner is broken suddenly? [ In reply to ]
Jeremy Harris via Exim-users wrote:
> On 30/12/2020 08:04, Evgeniy Berdnikov via Exim-users wrote:
> > On Wed, Dec 30, 2020 at 02:25:19PM +0700, Victor Sudakov via Exim-users wrote:
> > > Is this ktrace informative https://termbin.com/zjsv ?
>
> Yes; thanks.
>
> >
> > 8889 exim CALL socket(PF_INET,0x1<SOCK_STREAM>,IPPROTO_IP)
> > 8889 exim RET socket 5
> > 8889 exim CALL setitimer(0,0x7fffffff30d0,0x7fffffff30b0)
> > 8889 exim STRU itimerval { .interval = {0, 0}, .value = {5, 0} }
> > 8889 exim STRU itimerval { .interval = {0, 0}, .value = {0, 0} }
> > 8889 exim RET setitimer 0
> > 8889 exim CALL setsockopt(0x5,IPPROTO_TCP,TCP_FASTOPEN,0x254270,0x4)
> > 8889 exim RET setsockopt 0
> > 8889 exim CALL sendto(0x5,0x2322d6,0xa,0,0x7fffffff3150,0x10)
> > 8889 exim STRU struct sockaddr { AF_INET, 192.168.153.104:3310 }
> > 8889 exim GIO fd 5 wrote 10 bytes
> > "zINSTREAM\0"
> > 8889 exim RET sendto 10/0xa
> > 8889 exim CALL setitimer(0,0x7fffffff30d0,0x7fffffff30b0)
> > 8889 exim STRU itimerval { .interval = {0, 0}, .value = {0, 0} }
> > 8889 exim STRU itimerval { .interval = {0, 0}, .value = {4, 999970} }
> > 8889 exim RET setitimer 0
> > 8889 exim CALL close(0xffffffff)
> > 8889 exim RET close -1 errno 9 Bad file descriptor
> >
> > As packet is sent, it may be some problem with TCP_FASTOPEN, probably
> > with its handling in hypervisor and/or external firewall.
>
> Kernel, I think. The packet capture showed the SYNs not carrying
> any TFO request, despite that TCP_FASTOPEN setsockopt. Probably the
> FreeBSD implementation has changed since I worked on the Exim
> implementation, in such a way as to break the combination.
>
> In case it helps, my notes from then include:
>
> # FreeBSD: it looks like you have to compile a custom kernel, with
> # 'options TCP_RFC7413' in the config. Also set
> # 'net.inet.tcp.fastopen.server_enable=1' in /etc/sysctl.conf

Well, the defaults in FreeBSD 12.2 (GENERIC kernel) are:

$ sysctl -a | grep fastopen
net.inet.tcp.fastopen.server_enable: 0
net.inet.tcp.fastopen.psk_enable: 0
net.inet.tcp.fastopen.path_disable_time: 900
net.inet.tcp.fastopen.numpsks: 0
net.inet.tcp.fastopen.numkeys: 0
net.inet.tcp.fastopen.maxpsks: 2
net.inet.tcp.fastopen.maxkeys: 2
net.inet.tcp.fastopen.keylen: 16
net.inet.tcp.fastopen.client_enable: 1
net.inet.tcp.fastopen.ccache_buckets: 2048
net.inet.tcp.fastopen.ccache_bucket_limit: 16
net.inet.tcp.fastopen.autokey: 120


In my situation, I set net.inet.tcp.fastopen.client_enable=0 on the
client (exim host) and it cured the problem. I could probably have set
net.inet.tcp.fastopen.server_enable=1 on the server side (clamd host)
and it would cure the problem too?

>
> If there's a sysctl for enabling the client side, try changing
> it. If that affects this, we need to know.

Please see above.

>
>
> I could try to code up a backstop, retrying the connection without
> TFO... sigh. Without some effort it wouldn't be particularly efficient
> in service operation, and I call having to do that "ugly". Better
> to get it working correctly, or deciding before trying that the
> feature is not usable on the platform.
> Slightly less ugly would be a config option "no TFO", either global
> or just on the av_scanner address.

Jeremy, I did not quite understand if the whole problem is a bug in
FreeBSD, or a bug in Exim, or both, but if I can provide any help or
additional info/testing to clear the situation once and for all, I'd be
glad to.

>
>
> Meantime, the Exim debug channels "acl" and "transport" would show
> the sequence from a higher-level view. We might be able to guess
> what that close(-1) was (yes, that's a bug. Not an important one).
>
--
Victor Sudakov, VAS4-RIPE, VAS47-RIPN
2:5005/49@fidonet http://vas.tomsk.ru/

--
## List details at https://lists.exim.org/mailman/listinfo/exim-users
## Exim details at http://www.exim.org/
## Please use the Wiki with this list - http://wiki.exim.org/
Re: av_scanner is broken suddenly? [ In reply to ]
On 30/12/2020 13:10, Victor Sudakov via Exim-users wrote:
> You must be a wise man Evgeniy! On the exim host, I set sysctl
> net.inet.tcp.fastopen.client_enable=0 and the problem seemed to go away.

That's good to know; thanks.
It may be worth raising a bug against FreeBSD.

> Now if you care :-) please explain what it was and what I've lost by
> disabling fastopen client.

The fastopen implementation in the FreeBSD kernel is broken
or has changed in some way that breaks previously-reasonable
use of it.

For this use (a very low-latency path, to a sibling VM) it's
a very minor efficiency measure. Disabling it won't lose you
anything measurable. For the more-general SMTP use of it, over
global-latency paths - you still won't lose much at this time
because I've not managed yet to find an implementation of TFO
which can place data on the SYN,ACK packet. Because SMTP is a
server-first protocol that's the one that matters (apart from
for TLS-on-connect, usable by MSA connections [commonly port 465],
where there's potential for getting the TLS Client Hello in the
SYN). But (at least) Linux has only support for data-on-SYN TFO
and not for data-on-SYN,ACK. I live in hope.
--
Cheers,
Jeremy

--
## List details at https://lists.exim.org/mailman/listinfo/exim-users
## Exim details at http://www.exim.org/
## Please use the Wiki with this list - http://wiki.exim.org/
Re: av_scanner is broken suddenly? [ In reply to ]
On 30/12/2020 13:20, Victor Sudakov via Exim-users wrote:
> In my situation, I set net.inet.tcp.fastopen.client_enable=0 on the
> client (exim host) and it cured the problem.

Good to know. It's a workaround, not a fix, IMO.

> I could probably have set
> net.inet.tcp.fastopen.server_enable=1 on the server side (clamd host)
> and it would cure the problem too?

It's worth trying. However, a TFO implementation is *required*
to operate properly talking to a non-TFO peer.

> Jeremy, I did not quite understand if the whole problem is a bug in
> FreeBSD

This.

> or a bug in Exim, or both, but if I can provide any help or
> additional info/testing to clear the situation once and for all, I'd be
> glad to.

If you could get a run with the original configuration, but with debug
enabled (command-line "-d+all") on the exim that ends up calling
out to Clam, that will help to locate that "close(-1)" we saw.

If that's the exim daemon:

- Check using "ps" for any extra args normally used on your
exim daemon process
- stop the exim service
- run
# exim -d+all -bd 2>&1 | tee logfile
to get a daemon with debug.
--
Cheers,
Jeremy

--
## List details at https://lists.exim.org/mailman/listinfo/exim-users
## Exim details at http://www.exim.org/
## Please use the Wiki with this list - http://wiki.exim.org/
Re: av_scanner is broken suddenly? [ In reply to ]
Jeremy Harris via Exim-users wrote:
> On 30/12/2020 13:20, Victor Sudakov via Exim-users wrote:
> > In my situation, I set net.inet.tcp.fastopen.client_enable=0 on the
> > client (exim host) and it cured the problem.
>
> Good to know. It's a workaround, not a fix, IMO.
>
> > I could probably have set
> > net.inet.tcp.fastopen.server_enable=1 on the server side (clamd host)
> > and it would cure the problem too?
>
> It's worth trying. However, a TFO implementation is *required*
> to operate properly talking to a non-TFO peer.
>
> > Jeremy, I did not quite understand if the whole problem is a bug in
> > FreeBSD
>
> This.

Maybe I should file a PR to the FreeBSD team, but could you suggest a
very simple test case? Maybe a couple of lines of code to open a TCP
connection and fail?


>
> > or a bug in Exim, or both, but if I can provide any help or
> > additional info/testing to clear the situation once and for all, I'd be
> > glad to.
>
> If you could get a run with the original configuration, but with debug
> enabled (command-line "-d+all") on the exim that ends up calling
> out to Clam, that will help to locate that "close(-1)" we saw.
>
> If that's the exim daemon:
>
> - Check using "ps" for any extra args normally used on your
> exim daemon process
> - stop the exim service
> - run
> # exim -d+all -bd 2>&1 | tee logfile
> to get a daemon with debug.

I now have this log and am ready to send it privately to you or another
person requesting it, preferably in encrypted form. I would not like to
publish such a detailed log somewhere on the Internet.

A relevant snippet from the log is below:


13:54:33 63708 Malware scan: clamd tmo=2m
13:54:33 63708 trying server name 192.168.153.104, port 3310
13:54:33 63708 TFO mode connection attempt to 192.168.153.104, 10 data
13:54:33 63708 Malware scan: issuing clamd new-style remote scan (zINSTREAM)
13:54:33 63708 socket: domain AF_INET lcl [95.170.141.50]:47149 type SOCK_STREAM proto tcp
13:54:33 63708 LOG: MAIN PANIC
13:54:33 63708 malware acl condition: clamd : unable to send file body to socket (192.168.153.104)
13:54:33 63708 deny: condition test failed in ACL "acl_check_data"


--
Victor Sudakov, VAS4-RIPE, VAS47-RIPN
2:5005/49@fidonet http://vas.tomsk.ru/
Re: av_scanner is broken suddenly? [ In reply to ]
Victor Sudakov via Exim-users wrote:
> >
> > > or a bug in Exim, or both, but if I can provide any help or
> > > additional info/testing to clear the situation once and for all, I'd be
> > > glad to.
> >
> > If you could get a run with the original configuration, but with debug
> > enabled (command-line "-d+all") on the exim that ends up calling
> > out to Clam, that will help to locate that "close(-1)" we saw.
> >
> > If that's the exim daemon:
> >
> > - Check using "ps" for any extra args normally used on your
> > exim daemon process
> > - stop the exim service
> > - run
> > # exim -d+all -bd 2>&1 | tee logfile
> > to get a daemon with debug.
>
> I now have this log and am ready to send it privately to you or another
> person requesting it, preferably in encrypted form. I would not like to
> publish such a detailed log somewhere on the Internet.

Does anyone need the "-d+all" log?

>
> A relevant snippet from the log is below:
>
>
> 13:54:33 63708 Malware scan: clamd tmo=2m
> 13:54:33 63708 trying server name 192.168.153.104, port 3310
> 13:54:33 63708 TFO mode connection attempt to 192.168.153.104, 10 data
> 13:54:33 63708 Malware scan: issuing clamd new-style remote scan (zINSTREAM)
> 13:54:33 63708 socket: domain AF_INET lcl [95.170.141.50]:47149 type SOCK_STREAM proto tcp
> 13:54:33 63708 LOG: MAIN PANIC
> 13:54:33 63708 malware acl condition: clamd : unable to send file body to socket (192.168.153.104)
> 13:54:33 63708 deny: condition test failed in ACL "acl_check_data"
>
>

--
Victor Sudakov, VAS4-RIPE, VAS47-RIPN
2:5005/49@fidonet http://vas.tomsk.ru/

--
## List details at https://lists.exim.org/mailman/listinfo/exim-users
## Exim details at http://www.exim.org/
## Please use the Wiki with this list - http://wiki.exim.org/
Re: av_scanner is broken suddenly? [ In reply to ]
On 04/01/2021 06:53, Victor Sudakov via Exim-users wrote:
> Does anyone need the "-d+all" log?

No, I think we've made as much progress as we can with
the snippet and other info you provided. Thanks.

I managed to create a set with FreeBSD 12.2 which exhibits
the "SYN-only" symptom, and have added a bit of hardening
in the code (3eb0bcd7a1) to ensure that the connection
open times out reasonably quickly. That's obviously not a
real fix, and I'm fairly sure this needs to be addressed in
the FreeBSD kernel stack. It would help a *lot* if there was
documentation around TFO - for example, the TCP_FASTOPEN
sockopt being described in the tcp(7) manpage (Linux is just
as bad; neither TCP_FASTOPEN not MSG_FASTOPEN are described,
and the bug I opened was ignored for long enough for it to be
auto-closed).

I was hindered somewhat by another apparent bug, probably
related to the ipfw/dummynet delay mechanism I was using
to get userland processing interspersed with packets...
after some undefined time, the interface lost it's IP addresses
(so became unusable). A reboot seemed to be required.

Creating a standalone demo program would be pretty simple;
it's not hard code. Working out the combination of
conditions it the difficult bit (it took me ages to realise
that the kernel was negative-caching, hence not trying TFO/R
for a recently non-responding peer).
I've not been sufficiently enthusiastic to do that. If
anyone want to pick it up, please ask; I can point to the
its of Exim source to steal from.



Other lacking bits of TFO implementations:
- means for placing the first server data on the SYN,ACK
- observability, from userland, of what a particular connection
sequence actually achieved
--
Cheers,
Jeremy

--
## List details at https://lists.exim.org/mailman/listinfo/exim-users
## Exim details at http://www.exim.org/
## Please use the Wiki with this list - http://wiki.exim.org/