Mailing List Archive

Google SMTP Timeouts on large mails
Hi all,



I've seen this issue raised in:



https://lists.exim.org/lurker/message/20220216.071725.892984cd.en.html

and

https://lists.exim.org/lurker/message/20220313.200645.624cc373.en.html



but haven't seen a definite resolution as yet.



As per other reports, I have a Debian Bullseye (11.3) system running Exim
4.94.2 #2. It is setup with virtual domains using dovecot for local delivery
and aliases defined for some simple forwarding. I wasn't aware of any
similar issue in Exim 4.92 (on Debian 10). I see log reports similar to
other reports - eg:



/var/log/exim4/mainlog:2022-04-27 07:47:30 1njbGQ-005LxL-M5
H=gmail-smtp-in.l.google.com [2a00:1450:4010:c0e::1a]: SMTP timeout after
sending data block (199774 bytes written): Connection timed out

/var/log/exim4/mainlog:2022-04-27 07:50:10 1njbGU-005Lz8-RV
H=gmail-smtp-in.l.google.com [74.125.131.26]: SMTP timeout after end of data
(246239 bytes written): Connection timed out



This is for both ipv4 and ipv6 connections, and to only Google mail servers,
and only when delivering "large" messages (that are bigger than say about
100kb, though I haven't investigated fully the limits - short, text only is
fine). Eventually, the messages do get through, but with delays of hours in
some cases. As per other reports, delivery of the same mail to all other
hosts works perfectly. This occurs both with firewall rules set to allow
everything, as well as with a "normal" ruleset allowing: all
OUTBOUND/FORWARD, all icmp INBOUND and all TCP INBOUND with ctstate
RELATED,ESTABLISHED (as well as ports opened for relevant services).



If I do: sysctl net.ipv4.tcp_window_scaling=0 , then everything works
perfectly - with tcp_window_scaling=1, the issue is reproduced.



I have a packet capture which is available here:



https://tinyurl.com/742s855d



The Session log from Exim in debug mode is here (with redacted hosts,
addresses, etc) - the message was delivered to the server, and is being
forwarded onto an email in a Google workspace account (following a
forwarding rule in an aliases file)



https://tinyurl.com/22nn887u





Is it possible from these traces to pin down the issue at all and maybe come
up with a workround (without having to turn off tcp_window_scaling) or a
pointer as to where I need to formally raise a bug, and I'll be happy to do
so!



Thanks



Graeme

--

graeme at chromosphere dot co dot uk















--
## List details at https://lists.exim.org/mailman/listinfo/exim-users
## Exim details at http://www.exim.org/
## Please use the Wiki with this list - http://wiki.exim.org/
Re: Google SMTP Timeouts on large mails [ In reply to ]
On 29/04/2022 10:56, Graeme Coates via Exim-users wrote:
> I have a packet capture which is available here:

> https://tinyurl.com/742s855d

Thank you so much for gathering this.

It seems to show buggy behaviour in your Debian TCP implementation;
(or possibly software-firewall)
I don't see any way that Exim could be forcing this.

Specifically, we see (multiple) retries of a TCP segment for which
we saw both the original data and the ACK from the peer (Google).

There are no SACKs, despite further ACKs after the apparently missed one
(and it being a SACK-enabled connection). This implies *no* ACKs
from that point on were received by the TCP code.


We can't tell exactly what data was involved, lacking the TLS session
keys, but given the above it's probably moot. If you care to investigate
that, see the text around "Add SSLKEYLOGFILE to keep_environment in the exim config"
and feed the resulting file to wireshark.

> The Session log from Exim in debug mode is here (with redacted hosts,
> addresses, etc) - the message was delivered to the server, and is being
> forwarded onto an email in a Google workspace account (following a
> forwarding rule in an aliases file)

> https://tinyurl.com/22nn887u

It all looks reasonable there, up to the point that the GnuTLS library
tells us "The TLS connection was non-properly terminated." - which would
follow on from the pcap-observed problem at the TCP level.

> Is it possible from these traces to pin down the issue at all and maybe come
> up with a workround (without having to turn off tcp_window_scaling) or a
> pointer as to where I need to formally raise a bug, and I'll be happy to do
> so!

You already mentioned IPv4/6 makes no difference.
You could try disabling TFO (but I think it's unlikely to help),
TLSv1.3 (ditto), CHNNKING (more possible, but again it's entirely the
wrong protocol layer), PIPELINING (ditto).

The problem going away when you disable TCP window scaling is interesting,
but it might just be shifting the point it bites to somewhere else
in other size flows.
Exim has no facilities to set a small transmit socket buffer size (which
would have the same effect, and not massacre your performance on other
networking users), I'm afraid.


I guess, if ACKs are not being seen by your TCP endpoint, the socket will
still be holding un-ack'd data in the transmit queue. If you can catch that
(use "ss -panmit dport = 25") it would confirm my interpretation.

If it's the firewall that's dropping inbound TCP ACK packets, I guess there's
the possibility of configuring it to log drops.
--
Cheers,
Jeremy

--
## List details at https://lists.exim.org/mailman/listinfo/exim-users
## Exim details at http://www.exim.org/
## Please use the Wiki with this list - http://wiki.exim.org/
Re: Google SMTP Timeouts on large mails [ In reply to ]
On 29/04/2022 10:56, Graeme Coates via Exim-users wrote:
> a
> pointer as to where I need to formally raise a bug, and I'll be happy to do
> so!

I forgot to answer this point.

You could open one at bugs.exim.org just so the info doesn't
get lost. But, currently, I don't think it's likely a bug
in Exim.

You should, I think, open a bug against Debian.
Include that packet capture; it's a red flag.
Feel free to include my analysis of it, too.


(I do hope you're not running any bolt-on "security"
products. I've seen too many bugs associated with such.)
--
Cheers,
Jeremy

--
## List details at https://lists.exim.org/mailman/listinfo/exim-users
## Exim details at http://www.exim.org/
## Please use the Wiki with this list - http://wiki.exim.org/
Re: Google SMTP Timeouts on large mails [ In reply to ]
On Fri, 2022-04-29 at 10:56 +0100, Graeme Coates via Exim-users wrote:
> Hi all,
>
>
>
> I've seen this issue raised in:
>
>
>
> https://lists.exim.org/lurker/message/20220216.071725.892984cd.en.html
>
> and
>
> https://lists.exim.org/lurker/message/20220313.200645.624cc373.en.html
>
>
>
> but haven't seen a definite resolution as yet.
>
>
>
> As per other reports, I have a Debian Bullseye (11.3) system running
>

This is likely to be the result of a known issue with Google's TCP Fast
Open setup - see e.g.
https://blog.apnic.net/2021/07/05/tcp-fast-open-not-so-fast/

Exim 4.93 changed the default for the "hosts_try_fastopen" transport
option to be "*", and the default for the
net.ipv4.tcp_fastopen_blackhole_timeout_sec sysctl changed from 3600
(i.e. an hour) to 0 at some point between the kernel versions in Debian
buster (10) and bullseye (11).

A workaround is to add something similar to "hosts_try_fastopen = !
*.l.google.com" to your SMTP transports.

Regards,

Adam


--
## List details at https://lists.exim.org/mailman/listinfo/exim-users
## Exim details at http://www.exim.org/
## Please use the Wiki with this list - http://wiki.exim.org/
Re: Google SMTP Timeouts on large mails [ In reply to ]
On 30/04/2022 17:43, Adam D. Barratt via Exim-users wrote:
> This is likely to be the result of a known issue with Google's TCP Fast
> Open setup - see e.g.
> https://blog.apnic.net/2021/07/05/tcp-fast-open-not-so-fast/

Always worth a try, but that blog description doesn't match
what the packet capture showed.
--
Cheers,
Jeremy

--
## List details at https://lists.exim.org/mailman/listinfo/exim-users
## Exim details at http://www.exim.org/
## Please use the Wiki with this list - http://wiki.exim.org/
Re: Google SMTP Timeouts on large mails [ In reply to ]
On 2022-04-29, Graeme Coates via Exim-users <exim-users@exim.org> wrote:
> Hi all,
>
>
>
> I've seen this issue raised in:
>
>
>
> https://lists.exim.org/lurker/message/20220216.071725.892984cd.en.html
>
> and
>
> https://lists.exim.org/lurker/message/20220313.200645.624cc373.en.html
>
>
>
> but haven't seen a definite resolution as yet.
>
>
>
> As per other reports, I have a Debian Bullseye (11.3) system running Exim
> 4.94.2 #2. It is setup with virtual domains using dovecot for local delivery
> and aliases defined for some simple forwarding. I wasn't aware of any
> similar issue in Exim 4.92 (on Debian 10). I see log reports similar to
> other reports - eg:
>
>
>
> /var/log/exim4/mainlog:2022-04-27 07:47:30 1njbGQ-005LxL-M5
> H=gmail-smtp-in.l.google.com [2a00:1450:4010:c0e::1a]: SMTP timeout after
> sending data block (199774 bytes written): Connection timed out
>
> /var/log/exim4/mainlog:2022-04-27 07:50:10 1njbGU-005Lz8-RV
> H=gmail-smtp-in.l.google.com [74.125.131.26]: SMTP timeout after end of data
> (246239 bytes written): Connection timed out
>
>
>
> This is for both ipv4 and ipv6 connections, and to only Google mail servers,
> and only when delivering "large" messages (that are bigger than say about
> 100kb, though I haven't investigated fully the limits - short, text only is
> fine). Eventually, the messages do get through, but with delays of hours in
> some cases. As per other reports, delivery of the same mail to all other
> hosts works perfectly. This occurs both with firewall rules set to allow
> everything, as well as with a "normal" ruleset allowing: all
> OUTBOUND/FORWARD, all icmp INBOUND and all TCP INBOUND with ctstate
> RELATED,ESTABLISHED (as well as ports opened for relevant services).
>
>
>
> If I do: sysctl net.ipv4.tcp_window_scaling=0 , then everything works
> perfectly - with tcp_window_scaling=1, the issue is reproduced.
>
>
>
> I have a packet capture which is available here:
>
>
>
> https://tinyurl.com/742s855d
>
>
>
> The Session log from Exim in debug mode is here (with redacted hosts,
> addresses, etc) - the message was delivered to the server, and is being
> forwarded onto an email in a Google workspace account (following a
> forwarding rule in an aliases file)
>
>
>
> https://tinyurl.com/22nn887u
>
>
>
>
>
> Is it possible from these traces to pin down the issue at all and maybe come
> up with a workround (without having to turn off tcp_window_scaling) or a
> pointer as to where I need to formally raise a bug, and I'll be happy to do
> so!

make sure that your DNS and return-path MX are working, we recently
had some sort of firewall issue that was unrelated to SMTP causing
timeouts on deliveries to gmail. removing the firewall rules cleared
it up.




--
Jasen.

--
## List details at https://lists.exim.org/mailman/listinfo/exim-users
## Exim details at http://www.exim.org/
## Please use the Wiki with this list - http://wiki.exim.org/
Re: Google SMTP Timeouts on large mails [ In reply to ]
Time to feed back on this issue ? I gave this a try, and have been running
it for a few weeks now before getting to the point of raising a bug/bugs. In
/etc/exim4/conf.d/transport/30_exim4-config_remote_smtp, added:

hosts_try_fastopen = !*.l.google.com

Conveniently, when coming to add it, I had a pile of queued messages for
delivery via Google MX hosts, and after adding the above config and
restarting exim, they all delivered first time. I've not seen any issues of
this sort since, both in general day to day running of the server, and in
targeted testing. So, it looks like this is a workaround (though, the blog
description doesn't match what my packet trace was showing!).

Cheers

Graeme


On 2022-04-30 17:34, Jeremy Harris wrote:
> On 30/04/2022 17:43, Adam D. Barratt via Exim-users wrote:
> > This is likely to be the result of a known issue with Google's TCP Fast
> > Open setup - see e.g.
> > https://blog.apnic.net/2021/07/05/tcp-fast-open-not-so-fast/
>
> Always worth a try, but that blog description doesn't match
> what the packet capture showed.
> --
> Cheers,
>??? Jeremy
>
>


--
## List details at https://lists.exim.org/mailman/listinfo/exim-users
## Exim details at http://www.exim.org/
## Please use the Wiki with this list - http://wiki.exim.org/