Mailing List Archive

Temporary Crypto Glitches ... ??
This has got to be one of the weirdest problem descriptions I've ever
dared publish ...

Yesterday evening, I had problems SSHing from a jump host through an
IPsec VPN to a couple customer servers (everything running CentOS 7). I
was able to work around the problem by fiddling with the crypto
settings; some more details below.

This morning, those connections were back to normal, but the supporter
on duty reported that he could not SSH into an entirely different server
(also CentOS 7, and straight from his workplace machine); that problem
fixed itself a couple hours later, too.

Is this just the spookiest coincidence since last Halloween, or did we
chance onto a rare, time-triggered malfunction somewhere in the
OpenSSH(/OpenSSL?) crypto ... ?

-------

Alas, the supporter isn't up to SSH connection debugging, so he never
did a -vv and couldn't tell any symptom beyond "it times out". I failed
to save my -vv's output, but I remember that roughly where you'd
normally get to the KEXINIT, my client claimed to be waiting for some
ECDH - and then just sat until the timeout.

I usually have two keypairs - one ed25519, one RSA - loaded into my
agent, and now that things are back to normal, the Kex chosen is
curve25519-sha256@libssh.org. In order to circumvent the problem, I had
to remove my RSA keypair from the agent and use

> $ ssh -o "KexAlgorithms diffie-hellman-group-exchange-sha256" $SERVER

to get logged in.

I started haveged on "my" target machines, but
/proc/sys/kernel/random/entropy_avail reported > 3kbit anyway and my
colleague's remote system had haveged running already, so I doubt that
that actually did anything.

Our monitoring and automated data fetchers apparently never saw any
problem to SSH into those servers - using RSA keypairs. The server, set
to LogLevel VERBOSE and typically logging

> Connection from $CLIENT_IP ...
> Postponed publickey for $LOCAL_USER ...

at the beginning of a connection, never wrote the second line for the
failed attempts. (With all our accesses getting SNATed, I'm not sure yet
whether there are any dangling instances of the *first* line.)

Nothing in hosts.allow/hosts.deny, and DNS lookups of the client IP
garner an NXDOMAIN normally.

Thanks for any pointers,
--
Jochen Bern
Systemingenieur

Binect GmbH
Re: Temporary Crypto Glitches ... ?? [ In reply to ]
Hi Jochen

We run a few thousands of hosts with varying quality of internet lines.
It is a fallback procedure to try to only use ed25519 crypto if the
connection fails half-way through. The reason is that it needs only
smaller packets which can help if there there is (more) trouble with
bigger network packets.

Cheers

Konrad

On 09.11.21 17:35, Jochen Bern wrote:
> This has got to be one of the weirdest problem descriptions I've ever
> dared publish ...
>
> Yesterday evening, I had problems SSHing from a jump host through an
> IPsec VPN to a couple customer servers (everything running CentOS 7). I
> was able to work around the problem by fiddling with the crypto
> settings; some more details below.
>
> This morning, those connections were back to normal, but the supporter
> on duty reported that he could not SSH into an entirely different server
> (also CentOS 7, and straight from his workplace machine); that problem
> fixed itself a couple hours later, too.
>
> Is this just the spookiest coincidence since last Halloween, or did we
> chance onto a rare, time-triggered malfunction somewhere in the
> OpenSSH(/OpenSSL?) crypto ... ?
>
> -------
>
> Alas, the supporter isn't up to SSH connection debugging, so he never
> did a -vv and couldn't tell any symptom beyond "it times out". I failed
> to save my -vv's output, but I remember that roughly where you'd
> normally get to the KEXINIT, my client claimed to be waiting for some
> ECDH - and then just sat until the timeout.
>
> I usually have two keypairs - one ed25519, one RSA - loaded into my
> agent, and now that things are back to normal, the Kex chosen is
> curve25519-sha256@libssh.org. In order to circumvent the problem, I had
> to remove my RSA keypair from the agent and use
>
>> $ ssh -o "KexAlgorithms diffie-hellman-group-exchange-sha256" $SERVER
>
> to get logged in.
>
> I started haveged on "my" target machines, but
> /proc/sys/kernel/random/entropy_avail reported > 3kbit anyway and my
> colleague's remote system had haveged running already, so I doubt that
> that actually did anything.
>
> Our monitoring and automated data fetchers apparently never saw any
> problem to SSH into those servers - using RSA keypairs. The server, set
> to LogLevel VERBOSE and typically logging
>
>> Connection from $CLIENT_IP ...
>> Postponed publickey for $LOCAL_USER ...
>
> at the beginning of a connection, never wrote the second line for the
> failed attempts. (With all our accesses getting SNATed, I'm not sure yet
> whether there are any dangling instances of the *first* line.)
>
> Nothing in hosts.allow/hosts.deny, and DNS lookups of the client IP
> garner an NXDOMAIN normally.
>
> Thanks for any pointers,
>
> _______________________________________________
> openssh-unix-dev mailing list
> openssh-unix-dev@mindrot.org
> https://lists.mindrot.org/mailman/listinfo/openssh-unix-dev?mc_phishing_protection_id=45427-c65ab1euab2puk9vp28g
>


_______________________________________________
openssh-unix-dev mailing list
openssh-unix-dev@mindrot.org
https://lists.mindrot.org/mailman/listinfo/openssh-unix-dev
Re: Temporary Crypto Glitches ... ?? [ In reply to ]
On 2021/11/11 12:49, Konrad Bucheli wrote:
> Hi Jochen
>
> We run a few thousands of hosts with varying quality of internet lines.
> It is a fallback procedure to try to only use ed25519 crypto if the
> connection fails half-way through. The reason is that it needs only smaller
> packets which can help if there there is (more) trouble with bigger network
> packets.

This often indicates problems where some links have smaller than usual
MTUs, in combination with missing ICMP fragmentation-needed messages
(usually due to incorrect firewall configuration somewhere on the path).
The handshake won't be the only place where you run into problems though,
using ed25519 to sidestep this just pushes the problem deeper and you're
likely to run into stalls during either file transfers or with large
amounts of output. Reducing MTU (or clamping the TCP MSS) might be a
better idea if you know you have to work over broken networks.

_______________________________________________
openssh-unix-dev mailing list
openssh-unix-dev@mindrot.org
https://lists.mindrot.org/mailman/listinfo/openssh-unix-dev
Re: Temporary Crypto Glitches ... ?? [ In reply to ]
On 11.11.21 13:31, Stuart Henderson wrote:
> On 2021/11/11 12:49, Konrad Bucheli wrote:
>> Hi Jochen
>>
>> We run a few thousands of hosts with varying quality of internet lines.
>> It is a fallback procedure to try to only use ed25519 crypto if the
>> connection fails half-way through. The reason is that it needs only smaller
>> packets which can help if there there is (more) trouble with bigger network
>> packets.
>
> This often indicates problems where some links have smaller than usual
> MTUs, in combination with missing ICMP fragmentation-needed messages
> (usually due to incorrect firewall configuration somewhere on the path).
> The handshake won't be the only place where you run into problems though,
> using ed25519 to sidestep this just pushes the problem deeper and you're
> likely to run into stalls during either file transfers or with large
> amounts of output. Reducing MTU (or clamping the TCP MSS) might be a
> better idea if you know you have to work over broken networks.

Thanks for the pointers, but after running into the problem again today,
I'm afraid that the actual cryptalgorithms do not play a role in the
problem ...

This time, the problem appeared between a devel test VM (OpenSSH config
tightened according to lynis guidelines, no agent involved, just one
4096 bit RSA and one ed25519 keypair in ~/.ssh) and one of our jump
hosts (serving as a test server in this instance, sshd config hardened
manually). The -vvv outputs from the commands that I also tried a week
ago differ as follows:

> $ diff nok.client.log ok.client.log
> 1c1
> < $ ssh -vvv binect-support
> ---
>> $ ssh -vvv -o "KexAlgorithms diffie-hellman-group-exchange-sha256" binect-support
> 40c40
> < debug2: KEX algorithms: curve25519-sha256,curve25519-sha256@libssh.org,ecdh-sha2-nistp256,ecdh-sha2-nistp384,ecdh-sha2-nistp521,diffie-hellman-group-exchange-sha256,diffie-hellman-group16-sha512,diffie-hellman-group18-sha512,diffie-hellman-group-exchange-sha1,diffie-hellman-group14-sha256,diffie-hellman-group14-sha1,diffie-hellman-group1-sha1,ext-info-c
> ---
>> debug2: KEX algorithms: diffie-hellman-group-exchange-sha256,ext-info-c
> 65c65
> < debug1: kex: algorithm: curve25519-sha256@libssh.org
> ---
>> debug1: kex: algorithm: diffie-hellman-group-exchange-sha256
> 69,74c69,230
> < debug1: kex: curve25519-sha256@libssh.org need=16 dh_need=16
> < debug1: kex: curve25519-sha256@libssh.org need=16 dh_need=16
> < debug3: send packet: type 30
> < debug1: expecting SSH2_MSG_KEX_ECDH_REPLY
> <
> < [ ... wait for timeout ... ]
> ---
>> debug1: kex: diffie-hellman-group-exchange-sha256 need=16 dh_need=16
>> debug1: kex: diffie-hellman-group-exchange-sha256 need=16 dh_need=16
>> debug3: send packet: type 34
>> debug1: SSH2_MSG_KEX_DH_GEX_REQUEST(1024<3072<8192) sent
>> debug3: receive packet: type 31
>> debug1: got SSH2_MSG_KEX_DH_GEX_GROUP
>> [...]

Upon seeing the difference in sheer *length* of the KEX algorithm list
offered by the client, I started experimenting with lists shortened from
either end ... the result being that I could cut out *entirely disjunct*
sets of algorithms to make the connection work.

Then I tried *this*:

> $ ssh -vvv -o "KexAlgorithms curve25519-sha256@libssh.org,curve25519-sha256@libssh.org,curve25519-sha256@libssh.org,curve25519-sha256@libssh.org,curve25519-sha256@libssh.org,curve25519-sha256@libssh.org,curve25519-sha256@libssh.org,curve25519-sha256@libssh.org" binect-support
> OpenSSH_7.4p1, OpenSSL 1.0.2k-fips 26 Jan 2017
[...]
> debug1: kex: algorithm: curve25519-sha256@libssh.org
> debug1: kex: host key algorithm: ecdsa-sha2-nistp256
> debug1: kex: server->client cipher: aes128-ctr MAC: umac-64-etm@openssh.com compression: none
> debug1: kex: client->server cipher: aes128-ctr MAC: umac-64-etm@openssh.com compression: none
> debug1: kex: curve25519-sha256@libssh.org need=16 dh_need=16
> debug1: kex: curve25519-sha256@libssh.org need=16 dh_need=16
> debug3: send packet: type 30
> debug1: expecting SSH2_MSG_KEX_ECDH_REPLY
[ ... waiting until timeout ...]

Yes, that's eight times the *same* algorithm (the one that would get
picked if there were no problem at all). Now let's try giving it only
*seven* thumbs-up:

> [bongo@cube-ng-06 ~]$ ssh -vvv -o "KexAlgorithms curve25519-sha256@libssh.org,curve25519-sha256@libssh.org,curve25519-sha256@libssh.org,curve25519-sha256@libssh.org,curve25519-sha256@libssh.org,curve25519-sha256@libssh.org,curve25519-sha256@libssh.org" binect-support
> OpenSSH_7.4p1, OpenSSL 1.0.2k-fips 26 Jan 2017
[...]
> debug1: kex: algorithm: curve25519-sha256@libssh.org
> debug1: kex: host key algorithm: ecdsa-sha2-nistp256
> debug1: kex: server->client cipher: aes128-ctr MAC: umac-64-etm@openssh.com compression: none
> debug1: kex: client->server cipher: aes128-ctr MAC: umac-64-etm@openssh.com compression: none
> debug1: kex: curve25519-sha256@libssh.org need=16 dh_need=16
> debug1: kex: curve25519-sha256@libssh.org need=16 dh_need=16
> debug3: send packet: type 30
> debug1: expecting SSH2_MSG_KEX_ECDH_REPLY
> debug3: receive packet: type 31
[ ... continue to successful connection]

Still possible that it's a pMTU detection problem or something alike it,
though, will have to look into the tcpdumps I now have to see whether
that's the case ...

(Both VMs are CentOS 7.9, the client a "free-range" one, the server a
cloud provider's sub-flavor. There's a handful of VLANs, leased line
uplink to a colo, then an IPsec VPN through the Internet into the cloud,
and finally the usual cloud networking between the two.)

Thanks again,
--
Jochen Bern
Systemingenieur

Binect GmbH
Re: Temporary Crypto Glitches ... ?? [ In reply to ]
> Then I tried *this*:
...
> Yes, that's eight times the *same* algorithm (the one that would get
> picked if there were no problem at all). Now let's try giving it only
> *seven* thumbs-up:
...
> [ ... continue to successful connection]

Yeah, that smells like MTU.

> Still possible that it's a pMTU detection problem or something alike
> it, though, will have to look into the tcpdumps I now have to see
> whether that's the case ...

When you have the blocking case, run "ss -i" to see the PMTU;
and/or run "tracepath -p 22 <host>" to diagnose.

Furthermore, you could try to set your own VM's MTU smaller to
see whether that solves the problem.


> (Both VMs are CentOS 7.9, the client a "free-range" one, the server a
> cloud provider's sub-flavor. There's a handful of VLANs, leased line
> uplink to a colo, then an IPsec VPN through the Internet into the
> cloud, and finally the usual cloud networking between the two.)

Yeah, lots of PMTU trouble points here inbetween.

If that's the case, you could either run one of the VMs with a
smaller permanent MTU, or set a route-specific MTU ("ip route via mtu").


Good luck!
_______________________________________________
openssh-unix-dev mailing list
openssh-unix-dev@mindrot.org
https://lists.mindrot.org/mailman/listinfo/openssh-unix-dev
Re: Temporary Crypto Glitches ... ?? [ In reply to ]
On 17.11.21 18:48, Philipp Marek wrote:
> Yeah, that smells like MTU.

Ayup, at least *this* instance turned out to have had a pMTU discovery
issue as its root cause (*#%!! legacy firewall).

Not sure that the two previous instances of the problem had the same
cause, though, as they fixed themselves before I could nail down the
details. Neither went over that same FW and one did not go across *any*
connections where I'd expect MTUs to change without *our* physical
intervention ...

> When you have the blocking case, run "ss -i" to see the PMTU;
> and/or run "tracepath -p 22 <host>" to diagnose.

Not sure that those two actually showed any telltale signs, but I admit
I'm not used to them ... :

> tcp ESTAB 0 0 $CLIENT:44006 $SERVER:ssh
> cubic wscale:9,7 rto:226 rtt:25.756/18.351 ato:40 mss:1448
> rcvmss:1386 advmss:1448 cwnd:10 bytes_acked:16274 > bytes_received:17929 segs_out:365 segs_in:336 send 4.5Mbps
> lastsnd:21559 lastrcv:21548 lastack:21509 pacing_rate 9.0Mbps
> rcv_rtt:140 rcv_space:29200


> # tracepath -p 22 $SERVER -n
> 1?: [LOCALHOST] pmtu 1500
> 1: $B0RKEN_FW 0.760ms
> 1: $B0RKEN_FW 0.505ms
> 2: $NEW_EDGE_FW 2.523ms
> 3: $NEW_EDGE_FW 2.617ms reached
> Resume: pmtu 1500 hops 3 back 2

However, good ole fat ping

> # ping -M do -s $SIZE $SERVER

showed pongs up to SIZE=1410, "too long"s from SIZE=1473 upward, and no
replies in between (because the NEED TO FRAGMENTs came from transfer net
IPs and $B0RKEN_FW doesn't do proper ESTABLISHED,RELATED).

Thanks again,
--
Jochen Bern
Systemingenieur

Binect GmbH
Re: Temporary Crypto Glitches ... ?? [ In reply to ]
>> Yeah, that smells like MTU.
>
> Ayup, at least *this* instance turned out to have had a pMTU discovery
> issue as its root cause (*#%!! legacy firewall).

Ack.


> Not sure that the two previous instances of the problem had the same
> cause, though, as they fixed themselves before I could nail down the
> details.

Well, perhaps some routing changed.


>> When you have the blocking case, run "ss -i" to see the PMTU;
>> and/or run "tracepath -p 22 <host>" to diagnose.
>
> Not sure that those two actually showed any telltale signs, but I
> admit I'm not used to them ... :
>
>> tcp ESTAB 0 0 $CLIENT:44006
>> $SERVER:ssh cubic wscale:9,7 rto:226
>> rtt:25.756/18.351 ato:40 mss:1448
>> rcvmss:1386 advmss:1448 cwnd:10 bytes_acked:16274 >
>> bytes_received:17929 segs_out:365 segs_in:336 send 4.5Mbps
>> lastsnd:21559 lastrcv:21548 lastack:21509 pacing_rate
>> 9.0Mbps
>> rcv_rtt:140 rcv_space:29200

Grrr, the "mtu" field was cut out on your end because of the long lines.


>> # tracepath -p 22 $SERVER -n
>> 1?: [LOCALHOST] pmtu 1500
>> 1: $B0RKEN_FW 0.760ms 1:
>> $B0RKEN_FW 0.505ms 2:
>> $NEW_EDGE_FW 2.523ms 3:
>> $NEW_EDGE_FW 2.617ms reached
>> Resume: pmtu 1500 hops 3 back 2

Ok, so no ICMPs are returned...


> However, good ole fat ping
>
>> # ping -M do -s $SIZE $SERVER
>
> showed pongs up to SIZE=1410, "too long"s from SIZE=1473 upward, and
> no replies in between (because the NEED TO FRAGMENTs came from
> transfer net IPs and $B0RKEN_FW doesn't do proper
> ESTABLISHED,RELATED).

Well, there you are.

_______________________________________________
openssh-unix-dev mailing list
openssh-unix-dev@mindrot.org
https://lists.mindrot.org/mailman/listinfo/openssh-unix-dev