Dear VPNC-maintainers,
I would like to submit the attached patch proposal for your review:
20150912_fix_phase2_rekey_gap.patch
The patch provides a possible solution for the connectivity gap and ESP
packet
loss resulting from phase 2 re-keying between VPNC and Cisco ASA's.
The extensive testing I did with Cisco ASA's shows very promising results
with
the phase 2 re-keying gap reduced from 55 seconds to less than 1 second and
ESP
packet loss down to a minimum as well: most of the times no ESP packets are
lost at all, only once every so many vpn connections a single ESP packet is
lost (i.e. received by VPNC on the already terminated rx spi).
Compared to the current behaviour of VPNC this would be a huge improvement.
I don't have access to Nortell, Fortigate etc. and I am therefore unable to
test and verify the behaviour of the patch with those concentrators,
unfortunately.
I have kept the patch as minimal and straightforward as possible. It applies
cleanly against:
vpnc-0.5.3-r550 (May 26 20:09:51 2014, Joerg Mayer).
Should the patch proposal pass your review, I am hoping that you are willing
to consider including it into the VPNC-code base. I'm not looking for
credits
in any way, shape or form. My only aim is to get this gap and packet loss
issue resolved simply and elegantly and keep code changes to a minimum as
they
are bound to introduce new issues that I imagine nobody really wants at this
stage.
Most of this email is just intended to provide background and documentation,
so some will consider it to have an extremely high tl;dr-index. To those I
can
only say: just read the summary, or skip all reading entirely, if that's
what
works for you, apply the patch if you feel it applies to your situation or
just for the fun of it and start testing.
Comments, possible concerns and suggestions for further improvement are most
welcome. I'll try to address them to best of my ability.
Patch summary
~~~~~~~~~~~~~
vpnc: fix connectivity gap and ESP packet loss resulting from phase 2
re-keying
On Cisco ASA5510 8.2(4) and ASA5515X 9.1(4) there is a delay of 30-55
seconds
between VPNC instantly 'deleting' the old ipsec SA's and moving to the new
ones
and the ASA (peer) 'deleting' the old ipsec SA's and moving to the new ones.
This delay results in a connectivity gap between VPNC and the ASA. During
this
gap tunnels generally do remain in tact but tunnel traffic stalls and ESP
packets exchanged across the tunnel are lost (i.e. being ignored by VPNC).
Explicitly confirming the (local) termination of the old ipsec SA's by VPNC
to
the peer, immediately following a phase 2 re-key QM packet exchange, tells
the
peer to stop using the old ipsec SA's and causes it to switch to the new
ones
instantly as well.
The explicit delete notification significantly speeds up the completion of
the
entire phase 2 re-key operation and minimises, if not entirely prevents, ESP
packet loss.
This patch implements sending the confirmation of the (local) termination of
the old ipsec SA's to the peer, by sending an ISAKMP_PAYLOAD_D delete
notification to the peer that contains both the terminated inbound and the
terminated outbound ipsec SA's.
The delete notification is sent to the peer immediately following the
successful completion of the phase 2 re-keying QM packet exchange sequence
initiated by the peer.
Additionally some change-related debug and log messages have been revised or
made more meaningful and consistent. A number of debug messages have been
added
to vpnc.c:do_rekey() to provide a clearer view on each of the steps in the
phase 2 re-keying process.
Background
~~~~~~~~~~
I am a long term and very happy user of VPNC and would like to start with
expressing my admiration for and gratitude to all those who have come up
with
the original idea, created it, made it available to the community and have
maintained and helped improving it over the years.
VPNC has served me well and has proven to be a very reliable vpn client
that is
both easy to install and easy to configure on all distributions I work with.
Of course for a while there were occasions when I got disconnected when
using it
longer than 1 hour, but r512 proved the be the perfect solution for that,
and
things got even better with r550.
So, what more could I possibly want?
Well, now that I no longer got disconnected, I noticed that my vpn sessions
would stall for about 55 seconds following a phase 2 re-key.
Following such a stall tunnel traffic generally would pick up again as if
nothing had ever happened and then would continue to run smoothly until the
next phase 2 re-key, at which point the whole thing would repeat, and so on.
I thought that as a token of my appreciation of the amount of effort that
has
gone and still goes into maintaining VPNC, it would be nice if I could help
improve VPNC a little by providing a possible solution for the delay and
submit
that as a patch proposal to the VPNC maintainers.
Similar cases
~~~~~~~~~~~~~
Searching for possible explanations and trying to become more familiar with
the phase 2 re-key process I came across a bug that was posted to this
mailing
list early 2014 by Friedemann Stoyan:
Thread: VPN connectivity gap during Phase2 Rekeying
URL: http://www.gossamer-threads.com/lists/vpnc/devel/4072
Friedemann starts by stating
I have become aware of a 30s VPN connectivity gap during Phase2
rekeying between VPNC and a Cisco ASA5515X with 9.1(4).
and then proceeds to describe in great detail the exact same 'annoyance'
that
I ran across myself.
Note that the bug report that Friedemann submitted is about r512 but if you
look at the change logs since r512 the relevant phase 2 re-key code actually
hasn't changed that much, if at all.
Research
~~~~~~~~
As a Unix-admin I don't have admin access to our vpn concentrators (ASA's
mostly). Unable to view this from the peer side, I decided to analise packet
flow with wireshark on the VPNC side instead.
Below is output extracted from a capture that I ran during a test of r550.
For
brevity the capture only shows packets relevant to the issue.
The capture shows the overall duration of the gap between the peer
confirming phase 2 re-key completion by sending a QM3 (T 3431.465392) and
the
tunnel traffic returning to normal (T 3486.285648) to be 55 seconds.
During that gap 8 ESP packets are lost, discarded by VPNC for their 'unknown
spi'. To make matters even worse, the ISAKMP_PAYLOAD_D delete notification
sent
by a peer in utter despair is also discarded by VPNC for proposing for
deletion
what VPNC by then considers to be a 'bogus spi'.
From the moment VPNC receives the ISAKMP_PAYLOAD_D delete notification from
the
peer (T 3461.446333), already 30 (!) seconds after phase 2 re-key
completion,
it takes another 25 (!) seconds more for tunnel traffic to return to normal
and stop stalling.
No. Time Src Dst Protocol Info Comment
~~~ ~~~~ ~~~ ~~~ ~~~~~~~~ ~~~~ ~~~~~~~
7504 3429.712316 PEER VPNC ESP ESP (SPI=0x3342d642) \
7505 3429.713395 VPNC PEER ESP ESP (SPI=0x19e26f70) |- normal
exchange of tunnel pkts on old rx and tx spi
7506 3430.712439 PEER VPNC ESP ESP (SPI=0x3342d642) |
7507 3430.713520 VPNC PEER ESP ESP (SPI=0x19e26f70) /
7508 3431.447189 PEER VPNC ISAKMP --- peer initiates
phase 2 re-key by sending QM1 with proposed new tx spi
7509 3431.447963 VPNC PEER ISAKMP --- VPNC replies
by sending QM2 with proposed new rx spi and switches to new SA's
7510 3431.465392 PEER VPNC ISAKMP --- peer confirms
phase 2 re-key completion by sending QM3
7511 3431.714920 PEER VPNC ESP ESP (SPI=0x3342d642) \
7512 3431.934957 PEER VPNC ESP ESP (SPI=0x3342d642) |
7513 3432.357945 PEER VPNC ESP ESP (SPI=0x3342d642) |- peer is
unaware of VPNC having switched to the new SA's already and
7514 3433.212068 PEER VPNC ESP ESP (SPI=0x3342d642) | continues to
use the old SA's (i.c. rx spi) for tunnel traffic;
7517 3434.925162 PEER VPNC ESP ESP (SPI=0x3342d642) | this results
in VPNC complaining (8x in this test) and discarding those
7518 3438.353240 PEER VPNC ESP ESP (SPI=0x3342d642) | packets with:
7521 3445.196854 PEER VPNC ESP ESP (SPI=0x3342d642) | 'unknown
spi 0x3342d642 from peer'
7524 3458.892060 PEER VPNC ESP ESP (SPI=0x3342d642) /
7525 3461.446333 PEER VPNC ISAKMP --- peer proposes
old tx spi termination (ISAKMP_PAYLOAD_D); as VPNC has already
terminated
the old tx spi and 'forgotten' all about it, this causes VPNC
to just
complain and ignore the packet with:
'got
isakmp delete with bogus spi (expected
0x96b54bc9,
received 0x19e26f70), ignoring...'
7531 3486.285648 PEER VPNC ESP ESP (SPI=0xe91f1836) \
7532 3486.286465 VPNC PEER ESP ESP (SPI=0x96b54bc9) |- normal
exchange of tunnel pkts resumes on new rx and tx spi
7533 3486.299460 PEER VPNC ESP ESP (SPI=0xe91f1836) |
7534 3486.301185 VPNC PEER ESP ESP (SPI=0x96b54bc9) /
Solution
~~~~~~~~
The fact that the tunnels kept stalling even after the peer had sent the
old tx
spi delete notification, led me to think that, although the applicable
RFC's do
not require the use of the delete notification, perhaps, all along, the peer
was just waiting desperately for VPNC to send one.
So, in essence, that is what I added to the code and tested fairly
extensively
against our ASA's (there's a wide array of concentrators I connect to all
day
every day) on various distributions.
Below is output extracted from a capture that I ran during a test of r550
with
the proposed patch applied to it. Again, for brevity the capture only shows
packets relevant to the issue.
The capture shows that the patch reduces the former 55 second gap between
the
peer confirming phase 2 re-key completion (T 3429.552318) and tunnel traffic
resuming on the new SA's (T 3430.272291), to less than 1 second.
The entire 'conversation' between VPNC and the peer now runs smoothly. The
delay has disappeared, there are no more 'unknow spi' (i.e. lost ESP)
packets
and, since the peer doesn't need to propose old tx spi termination anymore,
the 'bogus spi' packet now belongs to the past as well.
No. Time Src Dst Protocol Info Comment
~~~ ~~~~ ~~~ ~~~ ~~~~~~~~ ~~~~ ~~~~~~~
7582 3428.267254 PEER VPNC ESP ESP (SPI=0xe6425491) \
7583 3428.268257 VPNC PEER ESP ESP (SPI=0x19f581c8) |- normal
exchange of tunnel pkts on old rx and tx spi
7584 3429.272243 PEER VPNC ESP ESP (SPI=0xe6425491) |
7585 3429.273613 VPNC PEER ESP ESP (SPI=0x19f581c8) /
7586 3429.532348 PEER VPNC ISAKMP --- peer initiates
phase 2 re-key by sending QM1 with proposed new tx spi
7587 3429.533922 VPNC PEER ISAKMP --- VPNC replies
by sending QM2 with proposed new rx spi and switches to new SA's
7588 3429.552318 PEER VPNC ISAKMP --- peer confirms
phase 2 re-key completion by sending QM3
7589 3429.552955 VPNC PEER ISAKMP --- VPNC confirms
termination of old rx and tx spi to peer (ISAKMP_PAYLOAD_D)
7590 3430.272291 PEER VPNC ESP ESP (SPI=0x02e643ad) \
7591 3430.273550 VPNC PEER ESP ESP (SPI=0xa2e7bbcf) |- normal
exchange of tunnel pkts resumes on new rx and tx spi
7592 3431.274455 PEER VPNC ESP ESP (SPI=0x02e643ad) |
7594 3431.276151 VPNC PEER ESP ESP (SPI=0xa2e7bbcf) /
The proposed solution has been tested to work as shown here both on RHEL 6
and
Ubuntu 14 icm Cisco ASA551x 8.2(4) and up.
Managing expectations
~~~~~~~~~~~~~~~~~~~~~
I would like to be able to claim that reducing the gap also prevents packet
loss entirely, but I'm not. Testing shows that once every so many vpn
sessions
you will lose the odd 1 ESP packet as a result of phase 2 re-key operation.
Given the stable state the r550 code is in and the apparent development
freeze
and base review it has been undergoing for quite a while now, I doubt
whether
the benefit of fixing that will outweigh the effort required.
--
Piet Starreveld
I would like to submit the attached patch proposal for your review:
20150912_fix_phase2_rekey_gap.patch
The patch provides a possible solution for the connectivity gap and ESP
packet
loss resulting from phase 2 re-keying between VPNC and Cisco ASA's.
The extensive testing I did with Cisco ASA's shows very promising results
with
the phase 2 re-keying gap reduced from 55 seconds to less than 1 second and
ESP
packet loss down to a minimum as well: most of the times no ESP packets are
lost at all, only once every so many vpn connections a single ESP packet is
lost (i.e. received by VPNC on the already terminated rx spi).
Compared to the current behaviour of VPNC this would be a huge improvement.
I don't have access to Nortell, Fortigate etc. and I am therefore unable to
test and verify the behaviour of the patch with those concentrators,
unfortunately.
I have kept the patch as minimal and straightforward as possible. It applies
cleanly against:
vpnc-0.5.3-r550 (May 26 20:09:51 2014, Joerg Mayer).
Should the patch proposal pass your review, I am hoping that you are willing
to consider including it into the VPNC-code base. I'm not looking for
credits
in any way, shape or form. My only aim is to get this gap and packet loss
issue resolved simply and elegantly and keep code changes to a minimum as
they
are bound to introduce new issues that I imagine nobody really wants at this
stage.
Most of this email is just intended to provide background and documentation,
so some will consider it to have an extremely high tl;dr-index. To those I
can
only say: just read the summary, or skip all reading entirely, if that's
what
works for you, apply the patch if you feel it applies to your situation or
just for the fun of it and start testing.
Comments, possible concerns and suggestions for further improvement are most
welcome. I'll try to address them to best of my ability.
Patch summary
~~~~~~~~~~~~~
vpnc: fix connectivity gap and ESP packet loss resulting from phase 2
re-keying
On Cisco ASA5510 8.2(4) and ASA5515X 9.1(4) there is a delay of 30-55
seconds
between VPNC instantly 'deleting' the old ipsec SA's and moving to the new
ones
and the ASA (peer) 'deleting' the old ipsec SA's and moving to the new ones.
This delay results in a connectivity gap between VPNC and the ASA. During
this
gap tunnels generally do remain in tact but tunnel traffic stalls and ESP
packets exchanged across the tunnel are lost (i.e. being ignored by VPNC).
Explicitly confirming the (local) termination of the old ipsec SA's by VPNC
to
the peer, immediately following a phase 2 re-key QM packet exchange, tells
the
peer to stop using the old ipsec SA's and causes it to switch to the new
ones
instantly as well.
The explicit delete notification significantly speeds up the completion of
the
entire phase 2 re-key operation and minimises, if not entirely prevents, ESP
packet loss.
This patch implements sending the confirmation of the (local) termination of
the old ipsec SA's to the peer, by sending an ISAKMP_PAYLOAD_D delete
notification to the peer that contains both the terminated inbound and the
terminated outbound ipsec SA's.
The delete notification is sent to the peer immediately following the
successful completion of the phase 2 re-keying QM packet exchange sequence
initiated by the peer.
Additionally some change-related debug and log messages have been revised or
made more meaningful and consistent. A number of debug messages have been
added
to vpnc.c:do_rekey() to provide a clearer view on each of the steps in the
phase 2 re-keying process.
Background
~~~~~~~~~~
I am a long term and very happy user of VPNC and would like to start with
expressing my admiration for and gratitude to all those who have come up
with
the original idea, created it, made it available to the community and have
maintained and helped improving it over the years.
VPNC has served me well and has proven to be a very reliable vpn client
that is
both easy to install and easy to configure on all distributions I work with.
Of course for a while there were occasions when I got disconnected when
using it
longer than 1 hour, but r512 proved the be the perfect solution for that,
and
things got even better with r550.
So, what more could I possibly want?
Well, now that I no longer got disconnected, I noticed that my vpn sessions
would stall for about 55 seconds following a phase 2 re-key.
Following such a stall tunnel traffic generally would pick up again as if
nothing had ever happened and then would continue to run smoothly until the
next phase 2 re-key, at which point the whole thing would repeat, and so on.
I thought that as a token of my appreciation of the amount of effort that
has
gone and still goes into maintaining VPNC, it would be nice if I could help
improve VPNC a little by providing a possible solution for the delay and
submit
that as a patch proposal to the VPNC maintainers.
Similar cases
~~~~~~~~~~~~~
Searching for possible explanations and trying to become more familiar with
the phase 2 re-key process I came across a bug that was posted to this
mailing
list early 2014 by Friedemann Stoyan:
Thread: VPN connectivity gap during Phase2 Rekeying
URL: http://www.gossamer-threads.com/lists/vpnc/devel/4072
Friedemann starts by stating
I have become aware of a 30s VPN connectivity gap during Phase2
rekeying between VPNC and a Cisco ASA5515X with 9.1(4).
and then proceeds to describe in great detail the exact same 'annoyance'
that
I ran across myself.
Note that the bug report that Friedemann submitted is about r512 but if you
look at the change logs since r512 the relevant phase 2 re-key code actually
hasn't changed that much, if at all.
Research
~~~~~~~~
As a Unix-admin I don't have admin access to our vpn concentrators (ASA's
mostly). Unable to view this from the peer side, I decided to analise packet
flow with wireshark on the VPNC side instead.
Below is output extracted from a capture that I ran during a test of r550.
For
brevity the capture only shows packets relevant to the issue.
The capture shows the overall duration of the gap between the peer
confirming phase 2 re-key completion by sending a QM3 (T 3431.465392) and
the
tunnel traffic returning to normal (T 3486.285648) to be 55 seconds.
During that gap 8 ESP packets are lost, discarded by VPNC for their 'unknown
spi'. To make matters even worse, the ISAKMP_PAYLOAD_D delete notification
sent
by a peer in utter despair is also discarded by VPNC for proposing for
deletion
what VPNC by then considers to be a 'bogus spi'.
From the moment VPNC receives the ISAKMP_PAYLOAD_D delete notification from
the
peer (T 3461.446333), already 30 (!) seconds after phase 2 re-key
completion,
it takes another 25 (!) seconds more for tunnel traffic to return to normal
and stop stalling.
No. Time Src Dst Protocol Info Comment
~~~ ~~~~ ~~~ ~~~ ~~~~~~~~ ~~~~ ~~~~~~~
7504 3429.712316 PEER VPNC ESP ESP (SPI=0x3342d642) \
7505 3429.713395 VPNC PEER ESP ESP (SPI=0x19e26f70) |- normal
exchange of tunnel pkts on old rx and tx spi
7506 3430.712439 PEER VPNC ESP ESP (SPI=0x3342d642) |
7507 3430.713520 VPNC PEER ESP ESP (SPI=0x19e26f70) /
7508 3431.447189 PEER VPNC ISAKMP --- peer initiates
phase 2 re-key by sending QM1 with proposed new tx spi
7509 3431.447963 VPNC PEER ISAKMP --- VPNC replies
by sending QM2 with proposed new rx spi and switches to new SA's
7510 3431.465392 PEER VPNC ISAKMP --- peer confirms
phase 2 re-key completion by sending QM3
7511 3431.714920 PEER VPNC ESP ESP (SPI=0x3342d642) \
7512 3431.934957 PEER VPNC ESP ESP (SPI=0x3342d642) |
7513 3432.357945 PEER VPNC ESP ESP (SPI=0x3342d642) |- peer is
unaware of VPNC having switched to the new SA's already and
7514 3433.212068 PEER VPNC ESP ESP (SPI=0x3342d642) | continues to
use the old SA's (i.c. rx spi) for tunnel traffic;
7517 3434.925162 PEER VPNC ESP ESP (SPI=0x3342d642) | this results
in VPNC complaining (8x in this test) and discarding those
7518 3438.353240 PEER VPNC ESP ESP (SPI=0x3342d642) | packets with:
7521 3445.196854 PEER VPNC ESP ESP (SPI=0x3342d642) | 'unknown
spi 0x3342d642 from peer'
7524 3458.892060 PEER VPNC ESP ESP (SPI=0x3342d642) /
7525 3461.446333 PEER VPNC ISAKMP --- peer proposes
old tx spi termination (ISAKMP_PAYLOAD_D); as VPNC has already
terminated
the old tx spi and 'forgotten' all about it, this causes VPNC
to just
complain and ignore the packet with:
'got
isakmp delete with bogus spi (expected
0x96b54bc9,
received 0x19e26f70), ignoring...'
7531 3486.285648 PEER VPNC ESP ESP (SPI=0xe91f1836) \
7532 3486.286465 VPNC PEER ESP ESP (SPI=0x96b54bc9) |- normal
exchange of tunnel pkts resumes on new rx and tx spi
7533 3486.299460 PEER VPNC ESP ESP (SPI=0xe91f1836) |
7534 3486.301185 VPNC PEER ESP ESP (SPI=0x96b54bc9) /
Solution
~~~~~~~~
The fact that the tunnels kept stalling even after the peer had sent the
old tx
spi delete notification, led me to think that, although the applicable
RFC's do
not require the use of the delete notification, perhaps, all along, the peer
was just waiting desperately for VPNC to send one.
So, in essence, that is what I added to the code and tested fairly
extensively
against our ASA's (there's a wide array of concentrators I connect to all
day
every day) on various distributions.
Below is output extracted from a capture that I ran during a test of r550
with
the proposed patch applied to it. Again, for brevity the capture only shows
packets relevant to the issue.
The capture shows that the patch reduces the former 55 second gap between
the
peer confirming phase 2 re-key completion (T 3429.552318) and tunnel traffic
resuming on the new SA's (T 3430.272291), to less than 1 second.
The entire 'conversation' between VPNC and the peer now runs smoothly. The
delay has disappeared, there are no more 'unknow spi' (i.e. lost ESP)
packets
and, since the peer doesn't need to propose old tx spi termination anymore,
the 'bogus spi' packet now belongs to the past as well.
No. Time Src Dst Protocol Info Comment
~~~ ~~~~ ~~~ ~~~ ~~~~~~~~ ~~~~ ~~~~~~~
7582 3428.267254 PEER VPNC ESP ESP (SPI=0xe6425491) \
7583 3428.268257 VPNC PEER ESP ESP (SPI=0x19f581c8) |- normal
exchange of tunnel pkts on old rx and tx spi
7584 3429.272243 PEER VPNC ESP ESP (SPI=0xe6425491) |
7585 3429.273613 VPNC PEER ESP ESP (SPI=0x19f581c8) /
7586 3429.532348 PEER VPNC ISAKMP --- peer initiates
phase 2 re-key by sending QM1 with proposed new tx spi
7587 3429.533922 VPNC PEER ISAKMP --- VPNC replies
by sending QM2 with proposed new rx spi and switches to new SA's
7588 3429.552318 PEER VPNC ISAKMP --- peer confirms
phase 2 re-key completion by sending QM3
7589 3429.552955 VPNC PEER ISAKMP --- VPNC confirms
termination of old rx and tx spi to peer (ISAKMP_PAYLOAD_D)
7590 3430.272291 PEER VPNC ESP ESP (SPI=0x02e643ad) \
7591 3430.273550 VPNC PEER ESP ESP (SPI=0xa2e7bbcf) |- normal
exchange of tunnel pkts resumes on new rx and tx spi
7592 3431.274455 PEER VPNC ESP ESP (SPI=0x02e643ad) |
7594 3431.276151 VPNC PEER ESP ESP (SPI=0xa2e7bbcf) /
The proposed solution has been tested to work as shown here both on RHEL 6
and
Ubuntu 14 icm Cisco ASA551x 8.2(4) and up.
Managing expectations
~~~~~~~~~~~~~~~~~~~~~
I would like to be able to claim that reducing the gap also prevents packet
loss entirely, but I'm not. Testing shows that once every so many vpn
sessions
you will lose the odd 1 ESP packet as a result of phase 2 re-key operation.
Given the stable state the r550 code is in and the apparent development
freeze
and base review it has been undergoing for quite a while now, I doubt
whether
the benefit of fixing that will outweigh the effort required.
--
Piet Starreveld