Mailing List Archive

[PATCH] Improve endian conversion in umac.c
Howdy all,

I was poking at the MAC routines looking for some efficiencies for high
performance environments. I was looking at the umac.c and comparing it
to the original source at https://fastcrypto.org/front/umac/umac.c After
a couple of false starts I found that reverting the endian conversion
routines back to what Krovetz wrote realized a 8% to 16% improvement in
throughput with aes256-ctr. For example, 715MB/s vs 856MB/s. I may be
missing something in using the get/put routines in misc.c in terms of
portability or security though.

Test bed 1: 2 10Gb connected AMD Epyc 32 core hosts. RTT < .5ms
Test bed 2: Intel Xeon x5675, AMD Ryzen 7 5800X 10Gb connected.
RTT < .5ms

I saw more performance gains in test bed 1 than test bed 2 with (16% vs
8%) but I think the gains are proportional the number of packets. The
Epycs can push data about twice as fast as the Xeon can.

In 1Gb environments I'm not seeing any benefit in throughput as I can
max out that path with stock.

Anyway, I wanted to make this patch available for discussion.

Chris

diff --git a/umac.c b/umac.c
index e5ec19f0..d5de3806 100644
--- a/umac.c
+++ b/umac.c
@@ -134,17 +134,33 @@ typedef unsigned int UWORD; /* Register */
/* --- Endian Conversion --- Forcing assembly on some platforms
*/
/*
---------------------------------------------------------------------- */

+static UINT32 LOAD_UINT32_REVERSED(void *ptr)
+{
+ UINT32 temp = *(UINT32 *)ptr;
+ temp = (temp >> 24) | ((temp & 0x00FF0000) >> 8 )
+ | ((temp & 0x0000FF00) << 8 ) | (temp << 24);
+ return (UINT32)temp;
+}
+
+static void STORE_UINT32_REVERSED(void *ptr, UINT32 x)
+{
+ UINT32 i = (UINT32)x;
+ *(UINT32 *)ptr = (i >> 24) | ((i & 0x00FF0000) >> 8 )
+ | ((i & 0x0000FF00) << 8 ) | (i << 24);
+}
+
+/* The following definitions use the above reversal-primitives to do
the right
+ * thing on endian specific load and stores.
+ */
+
#if (__LITTLE_ENDIAN__)
-#define LOAD_UINT32_REVERSED(p) get_u32(p)
-#define STORE_UINT32_REVERSED(p,v) put_u32(p,v)
+#define LOAD_UINT32_LITTLE(ptr) (*(UINT32 *)(ptr))
+#define STORE_UINT32_BIG(ptr,x) STORE_UINT32_REVERSED(ptr,x)
#else
-#define LOAD_UINT32_REVERSED(p) get_u32_le(p)
-#define STORE_UINT32_REVERSED(p,v) put_u32_le(p,v)
+#define LOAD_UINT32_LITTLE(ptr) LOAD_UINT32_REVERSED(ptr)
+#define STORE_UINT32_BIG(ptr,x) (*(UINT32 *)(ptr) = (UINT32)(x))
#endif

-#define LOAD_UINT32_LITTLE(p) (get_u32_le(p))
-#define STORE_UINT32_BIG(p,v) put_u32(p, v)
-
/*
---------------------------------------------------------------------- */
/*
---------------------------------------------------------------------- */
/* ----- Begin KDF & PDF Section
---------------------------------------- */
_______________________________________________
openssh-unix-dev mailing list
openssh-unix-dev@mindrot.org
https://lists.mindrot.org/mailman/listinfo/openssh-unix-dev
Re: [PATCH] Improve endian conversion in umac.c [ In reply to ]
On Wed, 9 Mar 2022 at 09:59, rapier <rapier@psc.edu> wrote:
> I was poking at the MAC routines looking for some efficiencies for high
> performance environments. I was looking at the umac.c and comparing it
> to the original source at https://fastcrypto.org/front/umac/umac.c After
> a couple of false starts I found that reverting the endian conversion
> routines back to what Krovetz wrote realized a 8% to 16% improvement

Interesting! One obvious difference is what you have is potentially
inline-able static functions instead of function calls across
compilation units that (barring whole program optimization) can't be
inlined. If you put the existing functions from misc.c into umac.c as
statics do you see the same improvement?

--
Darren Tucker (dtucker at dtucker.net)
GPG key 11EAA6FA / A86E 3E07 5B19 5880 E860 37F4 9357 ECEF 11EA A6FA (new)
Good judgement comes with experience. Unfortunately, the experience
usually comes from bad judgement.
_______________________________________________
openssh-unix-dev mailing list
openssh-unix-dev@mindrot.org
https://lists.mindrot.org/mailman/listinfo/openssh-unix-dev
Re: [PATCH] Improve endian conversion in umac.c [ In reply to ]
On Tue, 8 Mar 2022, rapier wrote:

> + UINT32 temp = *(UINT32 *)ptr;

Can you guarantee it’s aligned? Otherwise, SIGBUS. Maybe not on x86,
other platforms exist and enforce that though (x86’ll just slowdown).

> + *(UINT32 *)ptr = (i >> 24) | ((i & 0x00FF0000) >> 8 )
> + | ((i & 0x0000FF00) << 8 ) | (i << 24);

Can you guarantee it’s aligned *and* not UB? Otherwise, a version of
GCC, possibly a future one, is guaranteed to miscompile that and
pinskia will tell you it’s a feature…

bye,
//mirabilos
--
Infrastrukturexperte • tarent solutions GmbH
Am Dickobskreuz 10, D-53121 Bonn • http://www.tarent.de/
Telephon +49 228 54881-393 • Fax: +49 228 54881-235
HRB AG Bonn 5168 • USt-ID (VAT): DE122264941
Geschäftsführer: Dr. Stefan Barth, Kai Ebenrett, Boris Esser, Alexander Steeg

****************************************************
/?\ The UTF-8 Ribbon
? ? Campaign against Mit dem tarent-Newsletter nichts mehr verpassen:
 ?  HTML eMail! Also, https://www.tarent.de/newsletter
? ? header encryption!
****************************************************
_______________________________________________
openssh-unix-dev mailing list
openssh-unix-dev@mindrot.org
https://lists.mindrot.org/mailman/listinfo/openssh-unix-dev
Re: [PATCH] Improve endian conversion in umac.c [ In reply to ]
Haven't checked that yet but I'll look into it tomorrow.

On Tue, Mar 8, 2022, 18:13 Darren Tucker <dtucker@dtucker.net> wrote:

> On Wed, 9 Mar 2022 at 09:59, rapier <rapier@psc.edu> wrote:
> > I was poking at the MAC routines looking for some efficiencies for high
> > performance environments. I was looking at the umac.c and comparing it
> > to the original source at https://fastcrypto.org/front/umac/umac.c After
> > a couple of false starts I found that reverting the endian conversion
> > routines back to what Krovetz wrote realized a 8% to 16% improvement
>
> Interesting! One obvious difference is what you have is potentially
> inline-able static functions instead of function calls across
> compilation units that (barring whole program optimization) can't be
> inlined. If you put the existing functions from misc.c into umac.c as
> statics do you see the same improvement?
>
> --
> Darren Tucker (dtucker at dtucker.net)
> GPG key 11EAA6FA / A86E 3E07 5B19 5880 E860 37F4 9357 ECEF 11EA A6FA (new)
> Good judgement comes with experience. Unfortunately, the experience
> usually comes from bad judgement.
>
_______________________________________________
openssh-unix-dev mailing list
openssh-unix-dev@mindrot.org
https://lists.mindrot.org/mailman/listinfo/openssh-unix-dev
Re: [PATCH] Improve endian conversion in umac.c [ In reply to ]
On 3/8/22 6:12 PM, Darren Tucker wrote:
> On Wed, 9 Mar 2022 at 09:59, rapier <rapier@psc.edu> wrote:
>> I was poking at the MAC routines looking for some efficiencies for high
>> performance environments. I was looking at the umac.c and comparing it
>> to the original source at https://fastcrypto.org/front/umac/umac.c After
>> a couple of false starts I found that reverting the endian conversion
>> routines back to what Krovetz wrote realized a 8% to 16% improvement
>
> Interesting! One obvious difference is what you have is potentially
> inline-able static functions instead of function calls across
> compilation units that (barring whole program optimization) can't be
> inlined. If you put the existing functions from misc.c into umac.c as
> statics do you see the same improvement?


That worked and I saw the same improvement. For a 20GB test (a dd pipe
with aes2560ctr) I'm seeing peaks at 870MB/s versus 720MB/s for stock.
So it does look like that its being inlined. I'm going to poke at a
couple more things and then provide an updated patch. I think I have a
big endian system around here somewhere so I want to test on that as well.

This is pleasing. Initially I was looking at improving performance by
pipelining the MAC but that's not possible with ETM. This is about the
level of performance gain I was hoping to get with that and it's a lot
easier.

Anyway, I'll get the new patch up soon.

Chris
_______________________________________________
openssh-unix-dev mailing list
openssh-unix-dev@mindrot.org
https://lists.mindrot.org/mailman/listinfo/openssh-unix-dev
Re: [PATCH] Improve endian conversion in umac.c [ In reply to ]
On Wed, 9 Mar 2022, rapier wrote:

> > inlined. If you put the existing functions from misc.c into umac.c as
> > statics do you see the same improvement?
>
> That worked and I saw the same improvement. For a 20GB test (a dd pipe with

I’d wager that the compiler optimises the inlined functions from
misc.c, which are suitably generic *and* UB-safe, to the same
result on x86, anyway.

bye,
//mirabilos
--
Infrastrukturexperte • tarent solutions GmbH
Am Dickobskreuz 10, D-53121 Bonn • http://www.tarent.de/
Telephon +49 228 54881-393 • Fax: +49 228 54881-235
HRB AG Bonn 5168 • USt-ID (VAT): DE122264941
Geschäftsführer: Dr. Stefan Barth, Kai Ebenrett, Boris Esser, Alexander Steeg

****************************************************
/?\ The UTF-8 Ribbon
? ? Campaign against Mit dem tarent-Newsletter nichts mehr verpassen:
 ?  HTML eMail! Also, https://www.tarent.de/newsletter
? ? header encryption!
****************************************************
_______________________________________________
openssh-unix-dev mailing list
openssh-unix-dev@mindrot.org
https://lists.mindrot.org/mailman/listinfo/openssh-unix-dev
Re: [PATCH] Improve endian conversion in umac.c [ In reply to ]
Please forgive the overly formalized way I wrote this up. This is what I
use for the notes I make as I need to report on a lot of this back to
the NSF. So I just copied and pasted it here.

Test subject: OpenSSH 8.8p1 commit bf944e37

Test environment: 2 AMD EPYC 7502P 32-Core servers connected via 10Gb
Intel X710 NICs in LAN, RTT < 0.5ms. ESNet TCP tuning applied.

Test method: 50 iterations of 'dd if=/dev/zero bs=1M count=20000 | ssh
-caes256-ctr remote.host "cat > /dev/null"'. 5 second pause between each
iteration.

Versions tested:
Stock openssh 8.8p1 (stock)
Migrating all endian functions from misc.c (full)
Migrating only [get|put]_u32_le from misc.c (little)

Test Results:

Flavor mean, median, min, max, adev
stock 669.68, 670, 653, 688, 4.9728
full 800.32, 804.5, 762, 823, 11.9216
little 803.1, 806, 772, 831, 10.484

Interpretation:
Inlining the endian functions, via compiler optimization, in umac.c
leads to an average 16.6% improvement in throughput, with a maximized
gain of 21.4%, in this test environment. The more pronounced deviation
is likely a result of the higher throughput rather than intrinsic
instabilities in the code. This performance gain comes entirely from
inlining the little endian conversion function [get|put]_u32_le. These
functions are defined in misc.c but only used in umac.c. As such,
migrating them to umac.c would impose minimal disruption on the rest of
the code base.

Patch:
diff --git a/misc.c b/misc.c
index 0134d694..20a2a186 100644
--- a/misc.c
+++ b/misc.c
@@ -1533,20 +1533,6 @@ get_u32(const void *vp)
return (v);
}

-u_int32_t
-get_u32_le(const void *vp)
-{
- const u_char *p = (const u_char *)vp;
- u_int32_t v;
-
- v = (u_int32_t)p[0];
- v |= (u_int32_t)p[1] << 8;
- v |= (u_int32_t)p[2] << 16;
- v |= (u_int32_t)p[3] << 24;
-
- return (v);
-}
-
u_int16_t
get_u16(const void *vp)
{
@@ -1585,17 +1571,6 @@ put_u32(void *vp, u_int32_t v)
p[3] = (u_char)v & 0xff;
}

-void
-put_u32_le(void *vp, u_int32_t v)
-{
- u_char *p = (u_char *)vp;
-
- p[0] = (u_char)v & 0xff;
- p[1] = (u_char)(v >> 8) & 0xff;
- p[2] = (u_char)(v >> 16) & 0xff;
- p[3] = (u_char)(v >> 24) & 0xff;
-}
-
void
put_u16(void *vp, u_int16_t v)
{
diff --git a/misc.h b/misc.h
index 2e2dca54..b0c64270 100644
--- a/misc.h
+++ b/misc.h
@@ -154,12 +154,6 @@ void put_u32(void *, u_int32_t)
void put_u16(void *, u_int16_t)
__attribute__((__bounded__( __minbytes__, 1, 2)));

-/* Little-endian store/load, used by umac.c */
-u_int32_t get_u32_le(const void *)
- __attribute__((__bounded__(__minbytes__, 1, 4)));
-void put_u32_le(void *, u_int32_t)
- __attribute__((__bounded__(__minbytes__, 1, 4)));
-
struct bwlimit {
size_t buflen;
u_int64_t rate; /* desired rate in kbit/s */
diff --git a/umac.c b/umac.c
index e5ec19f0..64502365 100644
--- a/umac.c
+++ b/umac.c
@@ -134,15 +134,56 @@ typedef unsigned int UWORD; /* Register */
/* --- Endian Conversion --- Forcing assembly on some platforms
*/
/*
---------------------------------------------------------------------- */

+
+/* Using local statically defined versions of the get/put functions
+ * found in misc.c allows them to be inlined. This improves throughput
+ * performance by 10% to 15% on well connected (10Gb/s+) systems.
+ * Forward declaration of the functions in order to maintain
+ * the attributes.
+ * Chris Rapier <rapier@psc.edu> 2022-03-09 */
+
+static u_int32_t umac_get_u32_le(const void *)
+ __attribute__((__bounded__(__minbytes__, 1, 4)));
+
+static u_int32_t
+umac_get_u32_le(const void *vp)
+{
+ const u_char *p = (const u_char *)vp;
+ u_int32_t v;
+
+ v = (u_int32_t)p[0];
+ v |= (u_int32_t)p[1] << 8;
+ v |= (u_int32_t)p[2] << 16;
+ v |= (u_int32_t)p[3] << 24;
+
+ return (v);
+}
+
+#if (! __LITTLE_ENDIAN__) /* compile time warning thown otherwise */
+static void umac_put_u32_le(void *, u_int32_t)
+ __attribute__((__bounded__(__minbytes__, 1, 4)));
+
+static void
+umac_put_u32_le(void *vp, u_int32_t v)
+{
+ u_char *p = (u_char *)vp;
+
+ p[0] = (u_char)v & 0xff;
+ p[1] = (u_char)(v >> 8) & 0xff;
+ p[2] = (u_char)(v >> 16) & 0xff;
+ p[3] = (u_char)(v >> 24) & 0xff;
+}
+#endif
+
#if (__LITTLE_ENDIAN__)
#define LOAD_UINT32_REVERSED(p) get_u32(p)
#define STORE_UINT32_REVERSED(p,v) put_u32(p,v)
#else
-#define LOAD_UINT32_REVERSED(p) get_u32_le(p)
-#define STORE_UINT32_REVERSED(p,v) put_u32_le(p,v)
+#define LOAD_UINT32_REVERSED(p) umac_get_u32_le(p)
+#define STORE_UINT32_REVERSED(p,v) umac_put_u32_le(p,v)
#endif

-#define LOAD_UINT32_LITTLE(p) (get_u32_le(p))
+#define LOAD_UINT32_LITTLE(p) (umac_get_u32_le(p))
#define STORE_UINT32_BIG(p,v) put_u32(p, v)

/*
---------------------------------------------------------------------- */

_______________________________________________
openssh-unix-dev mailing list
openssh-unix-dev@mindrot.org
https://lists.mindrot.org/mailman/listinfo/openssh-unix-dev
Re: [PATCH] Improve endian conversion in umac.c [ In reply to ]
rapier:

> Versions tested:
> Stock openssh 8.8p1 (stock)
> Migrating all endian functions from misc.c (full)
> Migrating only [get|put]_u32_le from misc.c (little)

put_u32_le() isn't used anywhere.

And a quick look at the code shows that indeed get_u32_le() is used
to process _all_ data to be hashed. Any other endian functions
aren't called in tight loops.

clang on amd64 optimizes these functions to a simple mov or mov+bswap.
I'm kinda curious whether PPC and SPARC compilers are smart enough
to use the load-swapped instructions.

--
Christian "naddy" Weisgerber naddy@mips.inka.de
_______________________________________________
openssh-unix-dev mailing list
openssh-unix-dev@mindrot.org
https://lists.mindrot.org/mailman/listinfo/openssh-unix-dev
Re: [PATCH] Improve endian conversion in umac.c [ In reply to ]
On Thu, 10 Mar 2022, Christian Weisgerber wrote:

> I'm kinda curious whether PPC and SPARC compilers are smart enough
> to use the load-swapped instructions.

If the data is aligned… perhaps.

bye,
//mirabilos
--
Infrastrukturexperte • tarent solutions GmbH
Am Dickobskreuz 10, D-53121 Bonn • http://www.tarent.de/
Telephon +49 228 54881-393 • Fax: +49 228 54881-235
HRB AG Bonn 5168 • USt-ID (VAT): DE122264941
Geschäftsführer: Dr. Stefan Barth, Kai Ebenrett, Boris Esser, Alexander Steeg

****************************************************
/?\ The UTF-8 Ribbon
? ? Campaign against Mit dem tarent-Newsletter nichts mehr verpassen:
 ?  HTML eMail! Also, https://www.tarent.de/newsletter
? ? header encryption!
****************************************************
_______________________________________________
openssh-unix-dev mailing list
openssh-unix-dev@mindrot.org
https://lists.mindrot.org/mailman/listinfo/openssh-unix-dev
Re: [PATCH] Improve endian conversion in umac.c [ In reply to ]
> + * Forward declaration of the functions in order to maintain
> + * the attributes.
> + * Chris Rapier <rapier@psc.edu> 2022-03-09 */
> +
> +static u_int32_t umac_get_u32_le(const void *)
> + __attribute__((__bounded__(__minbytes__, 1, 4)));
> +
> +static u_int32_t
> +umac_get_u32_le(const void *vp)

You should be able to combine this into one statement if you write:

static __attribute__((__bounded__(__minbytes__, 1, 4)))
u_int32_t umac_get_u32_le(const void *)
{

Eike
--
Rolf Eike Beer, emlix GmbH, https://www.emlix.com
Fon +49 551 30664-0, Fax +49 551 30664-11
Gothaer Platz 3, 37083 Göttingen, Germany
Sitz der Gesellschaft: Göttingen, Amtsgericht Göttingen HR B 3160
Geschäftsführung: Heike Jordan, Dr. Uwe Kracke – Ust-IdNr.: DE 205 198 055

emlix - smart embedded open source
Re: [PATCH] Improve endian conversion in umac.c [ In reply to ]
On 3/11/22 12:56 AM, Rolf Eike Beer wrote:
>> + * Forward declaration of the functions in order to maintain
>> + * the attributes.
>> + * Chris Rapier <rapier@psc.edu> 2022-03-09 */
>> +
>> +static u_int32_t umac_get_u32_le(const void *)
>> + __attribute__((__bounded__(__minbytes__, 1, 4)));
>> +
>> +static u_int32_t
>> +umac_get_u32_le(const void *vp)
>
> You should be able to combine this into one statement if you write:
>
> static __attribute__((__bounded__(__minbytes__, 1, 4)))
> u_int32_t umac_get_u32_le(const void *)
> {

Excellent! Thank you. I had it in one statement before but the order was
wrong so it kept throwing errors.



diff --git a/misc.c b/misc.c
index 0134d694..20a2a186 100644
--- a/misc.c
+++ b/misc.c
@@ -1533,20 +1533,6 @@ get_u32(const void *vp)
return (v);
}

-u_int32_t
-get_u32_le(const void *vp)
-{
- const u_char *p = (const u_char *)vp;
- u_int32_t v;
-
- v = (u_int32_t)p[0];
- v |= (u_int32_t)p[1] << 8;
- v |= (u_int32_t)p[2] << 16;
- v |= (u_int32_t)p[3] << 24;
-
- return (v);
-}
-
u_int16_t
get_u16(const void *vp)
{
@@ -1585,17 +1571,6 @@ put_u32(void *vp, u_int32_t v)
p[3] = (u_char)v & 0xff;
}

-void
-put_u32_le(void *vp, u_int32_t v)
-{
- u_char *p = (u_char *)vp;
-
- p[0] = (u_char)v & 0xff;
- p[1] = (u_char)(v >> 8) & 0xff;
- p[2] = (u_char)(v >> 16) & 0xff;
- p[3] = (u_char)(v >> 24) & 0xff;
-}
-
void
put_u16(void *vp, u_int16_t v)
{
diff --git a/misc.h b/misc.h
index 2e2dca54..b0c64270 100644
--- a/misc.h
+++ b/misc.h
@@ -154,12 +154,6 @@ void put_u32(void *, u_int32_t)
void put_u16(void *, u_int16_t)
__attribute__((__bounded__( __minbytes__, 1, 2)));

-/* Little-endian store/load, used by umac.c */
-u_int32_t get_u32_le(const void *)
- __attribute__((__bounded__(__minbytes__, 1, 4)));
-void put_u32_le(void *, u_int32_t)
- __attribute__((__bounded__(__minbytes__, 1, 4)));
-
struct bwlimit {
size_t buflen;
u_int64_t rate; /* desired rate in kbit/s */
diff --git a/umac.c b/umac.c
index e5ec19f0..81040b97 100644
--- a/umac.c
+++ b/umac.c
@@ -134,15 +134,48 @@ typedef unsigned int UWORD; /* Register */
/* --- Endian Conversion --- Forcing assembly on some platforms
*/
/*
---------------------------------------------------------------------- */

+
+/* Using local statically defined versions of the get/put functions
+ * found in misc.c allows them to be inlined. This improves throughput
+ * performance by 10% to 15% on well connected (10Gb/s+) systems.
+ * Chris Rapier <rapier@psc.edu> 2022-03-09 */
+
+static __attribute__((__bounded__(__minbytes__, 1, 4)))
+u_int32_t umac_get_u32_le(const void *vp)
+{
+ const u_char *p = (const u_char *)vp;
+ u_int32_t v;
+
+ v = (u_int32_t)p[0];
+ v |= (u_int32_t)p[1] << 8;
+ v |= (u_int32_t)p[2] << 16;
+ v |= (u_int32_t)p[3] << 24;
+
+ return (v);
+}
+
+#if (! __LITTLE_ENDIAN__) /* compile time warning thrown otherwise */
+static __attribute__((__bounded__(__minbytes__, 1, 4)));
+void umac_put_u32_le(void *vp, u_int32_t v)
+{
+ u_char *p = (u_char *)vp;
+
+ p[0] = (u_char)v & 0xff;
+ p[1] = (u_char)(v >> 8) & 0xff;
+ p[2] = (u_char)(v >> 16) & 0xff;
+ p[3] = (u_char)(v >> 24) & 0xff;
+}
+#endif
+
#if (__LITTLE_ENDIAN__)
#define LOAD_UINT32_REVERSED(p) get_u32(p)
#define STORE_UINT32_REVERSED(p,v) put_u32(p,v)
#else
-#define LOAD_UINT32_REVERSED(p) get_u32_le(p)
-#define STORE_UINT32_REVERSED(p,v) put_u32_le(p,v)
+#define LOAD_UINT32_REVERSED(p) umac_get_u32_le(p)
+#define STORE_UINT32_REVERSED(p,v) umac_put_u32_le(p,v)
#endif

-#define LOAD_UINT32_LITTLE(p) (get_u32_le(p))
+#define LOAD_UINT32_LITTLE(p) (umac_get_u32_le(p))
#define STORE_UINT32_BIG(p,v) put_u32(p, v)

/*
---------------------------------------------------------------------- */
_______________________________________________
openssh-unix-dev mailing list
openssh-unix-dev@mindrot.org
https://lists.mindrot.org/mailman/listinfo/openssh-unix-dev