Mailing List Archive

Race condition when using ControlMaster=auto with simultaneous connections
Hello,

I'm trying to multiplex many simultaneous SSH connections through a single
master connection, and I'm hitting a race condition while doing this.
This is not a bug; I'm either hitting a limit in the design of OpenSSH or
misusing it.

The use-case is to use Ansible to configure many hosts simultaneously,
while all connections need to go through a single "SSH bastion" via ProxyJump.
For efficiency and to avoid hitting MaxStartups limits, I would like to
use a control master for the connection to the bastion, via the following
client configuration:

Host bastion.example.com
ControlMaster auto
ControlPath /dev/shm/ssh-%h
ControlPersist 30

Host !bastion.example.com *.example.com
ProxyJump bastion.example.com

However, this does not work when making simultaneous connections: all SSH
connections create a new, separate connection to the bastion. Here is a
simple way to reproduce:

$ for i in {1..3}; do ssh myhost.example.com "sleep 1" & done
ControlSocket /dev/shm/ssh-bastion.example.com already exists, disabling multiplexing
ControlSocket /dev/shm/ssh-bastion.example.com already exists, disabling multiplexing

What happens is the following:

1) each SSH process tries to connect to the control socket and fails
(this is expected, the control socket is not yet bound)

2) each SSH process then creates a new SSH connection

3) once connected, each process tries to bind to the control socket

4a) one process successfully binds the control socket
4b) all other processes fail to bind the control socket (error message above)

5) in both cases, each process is now using its own separate SSH connection to the bastion

The window for the race condition is between 1) and 4), so it's rather
large: it includes the time to establish a new SSH connection.

I believe that taking a lock between steps 1) and 4) could solve the issue:

1.1) each process tries to take an exclusive lock related to the control socket
1.1a) one process gets the lock and can continue creating a SSH connection
1.1b) all other processes wait on the lock; when the lock is released, they
go back to step 1) to connect to the control socket

4.1) once the control socket has been bound, the "lucky process" releases the lock

Does it make sense? Would the project accept a patch implementing this as
an additional option?

Thanks,
Baptiste

--
Baptiste Jonglez
Research Engineer, Inria <https://www.inria.fr/>
STACK team <https://stack-research-group.gitlabpages.inria.fr/web/>
_______________________________________________
openssh-unix-dev mailing list
openssh-unix-dev@mindrot.org
https://lists.mindrot.org/mailman/listinfo/openssh-unix-dev
Re: Race condition when using ControlMaster=auto with simultaneous connections [ In reply to ]
On 8/31/22 09:24, Baptiste Jonglez wrote:
> Hello,
>
> I'm trying to multiplex many simultaneous SSH connections through a single
> master connection, and I'm hitting a race condition while doing this.
> This is not a bug; I'm either hitting a limit in the design of OpenSSH or
> misusing it.
>
> The use-case is to use Ansible to configure many hosts simultaneously,
> while all connections need to go through a single "SSH bastion" via ProxyJump.
> For efficiency and to avoid hitting MaxStartups limits, I would like to
> use a control master for the connection to the bastion, via the following
> client configuration:
>
> Host bastion.example.com
> ControlMaster auto
> ControlPath /dev/shm/ssh-%h
> ControlPersist 30
>
> Host !bastion.example.com *.example.com
> ProxyJump bastion.example.com
>
> However, this does not work when making simultaneous connections: all SSH
> connections create a new, separate connection to the bastion. Here is a
> simple way to reproduce:
>
> $ for i in {1..3}; do ssh myhost.example.com "sleep 1" & done
> ControlSocket /dev/shm/ssh-bastion.example.com already exists, disabling multiplexing
> ControlSocket /dev/shm/ssh-bastion.example.com already exists, disabling multiplexing
>
> What happens is the following:
>
> 1) each SSH process tries to connect to the control socket and fails
> (this is expected, the control socket is not yet bound)
>
> 2) each SSH process then creates a new SSH connection
>
> 3) once connected, each process tries to bind to the control socket
>
> 4a) one process successfully binds the control socket
> 4b) all other processes fail to bind the control socket (error message above)
>
> 5) in both cases, each process is now using its own separate SSH connection to the bastion
>
> The window for the race condition is between 1) and 4), so it's rather
> large: it includes the time to establish a new SSH connection.
>
> I believe that taking a lock between steps 1) and 4) could solve the issue:
>
> 1.1) each process tries to take an exclusive lock related to the control socket
> 1.1a) one process gets the lock and can continue creating a SSH connection
> 1.1b) all other processes wait on the lock; when the lock is released, they
> go back to step 1) to connect to the control socket
>
> 4.1) once the control socket has been bound, the "lucky process" releases the lock
>
> Does it make sense? Would the project accept a patch implementing this as
> an additional option?

Not sure if this is related, but I would like to have an option to *only* use the
control socket.
--
Sincerely,
Demi Marie Obenour (she/her/hers)