Mailing List Archive

[PATCH 4/5] aria-avx512: small optimization for aria_diff_m
* cipher/aria-gfni-avx512-amd64.S (aria_diff_m): Use 'vpternlogq' for
3-way XOR operation.
---

Using vpternlogq gives small performance improvement on AMD Zen4. With
Intel tiger-lake speed is the same as before.

Benchmark on AMD Ryzen 9 7900X (zen4, turbo-freq off):

Before:
ARIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
ECB enc | 0.204 ns/B 4682 MiB/s 0.957 c/B 4700
ECB dec | 0.204 ns/B 4668 MiB/s 0.960 c/B 4700
CTR enc | 0.212 ns/B 4509 MiB/s 0.994 c/B 4700
CTR dec | 0.212 ns/B 4490 MiB/s 0.998 c/B 4700

After (~3% faster):
ARIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
ECB enc | 0.198 ns/B 4812 MiB/s 0.932 c/B 4700
ECB dec | 0.198 ns/B 4824 MiB/s 0.929 c/B 4700
CTR enc | 0.204 ns/B 4665 MiB/s 0.961 c/B 4700
CTR dec | 0.206 ns/B 4631 MiB/s 0.968 c/B 4700

Cc: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
---
cipher/aria-gfni-avx512-amd64.S | 16 ++++++----------
1 file changed, 6 insertions(+), 10 deletions(-)

diff --git a/cipher/aria-gfni-avx512-amd64.S b/cipher/aria-gfni-avx512-amd64.S
index 849c744b..24a49a89 100644
--- a/cipher/aria-gfni-avx512-amd64.S
+++ b/cipher/aria-gfni-avx512-amd64.S
@@ -406,21 +406,17 @@
vgf2p8affineinvqb $0, t2, y3, y3; \
vgf2p8affineinvqb $0, t2, y7, y7;

-
#define aria_diff_m(x0, x1, x2, x3, \
t0, t1, t2, t3) \
/* T = rotr32(X, 8); */ \
/* X ^= T */ \
- vpxorq x0, x3, t0; \
- vpxorq x1, x0, t1; \
- vpxorq x2, x1, t2; \
- vpxorq x3, x2, t3; \
/* X = T ^ rotr(X, 16); */ \
- vpxorq t2, x0, x0; \
- vpxorq x1, t3, t3; \
- vpxorq t0, x2, x2; \
- vpxorq t1, x3, x1; \
- vmovdqu64 t3, x3;
+ vmovdqa64 x0, t0; \
+ vmovdqa64 x3, t3; \
+ vpternlogq $0x96, x2, x1, x0; \
+ vpternlogq $0x96, x2, x1, x3; \
+ vpternlogq $0x96, t0, t3, x2; \
+ vpternlogq $0x96, t0, t3, x1;

#define aria_diff_word(x0, x1, x2, x3, \
x4, x5, x6, x7, \
--
2.37.2


_______________________________________________
Gcrypt-devel mailing list
Gcrypt-devel@gnupg.org
https://lists.gnupg.org/mailman/listinfo/gcrypt-devel
Re: [PATCH 4/5] aria-avx512: small optimization for aria_diff_m [ In reply to ]
On 2/19/23 17:49, Jussi Kivilinna wrote:

Hi Jussi,
Thank you so much for this optimization!

I tested this optimization in the kernel.
It works very well.
In my machine(i3-12100), it improves performance ~9%, awesome!
It will be really helpful to the kernel side aria-avx512 driver for
improving performance.

> * cipher/aria-gfni-avx512-amd64.S (aria_diff_m): Use 'vpternlogq' for
> 3-way XOR operation.
> ---
>
> Using vpternlogq gives small performance improvement on AMD Zen4. With
> Intel tiger-lake speed is the same as before.
>
> Benchmark on AMD Ryzen 9 7900X (zen4, turbo-freq off):
>
> Before:
> ARIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
> ECB enc | 0.204 ns/B 4682 MiB/s 0.957 c/B 4700
> ECB dec | 0.204 ns/B 4668 MiB/s 0.960 c/B 4700
> CTR enc | 0.212 ns/B 4509 MiB/s 0.994 c/B 4700
> CTR dec | 0.212 ns/B 4490 MiB/s 0.998 c/B 4700
>
> After (~3% faster):
> ARIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
> ECB enc | 0.198 ns/B 4812 MiB/s 0.932 c/B 4700
> ECB dec | 0.198 ns/B 4824 MiB/s 0.929 c/B 4700
> CTR enc | 0.204 ns/B 4665 MiB/s 0.961 c/B 4700
> CTR dec | 0.206 ns/B 4631 MiB/s 0.968 c/B 4700
>
> Cc: Taehee Yoo <ap420073@gmail.com>
> Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
> ---
> cipher/aria-gfni-avx512-amd64.S | 16 ++++++----------
> 1 file changed, 6 insertions(+), 10 deletions(-)
>
> diff --git a/cipher/aria-gfni-avx512-amd64.S
b/cipher/aria-gfni-avx512-amd64.S
> index 849c744b..24a49a89 100644
> --- a/cipher/aria-gfni-avx512-amd64.S
> +++ b/cipher/aria-gfni-avx512-amd64.S
> @@ -406,21 +406,17 @@
> vgf2p8affineinvqb $0, t2, y3, y3; \
> vgf2p8affineinvqb $0, t2, y7, y7;
>
> -
> #define aria_diff_m(x0, x1, x2, x3, \
> t0, t1, t2, t3) \
> /* T = rotr32(X, 8); */ \
> /* X ^= T */ \
> - vpxorq x0, x3, t0; \
> - vpxorq x1, x0, t1; \
> - vpxorq x2, x1, t2; \
> - vpxorq x3, x2, t3; \
> /* X = T ^ rotr(X, 16); */ \
> - vpxorq t2, x0, x0; \
> - vpxorq x1, t3, t3; \
> - vpxorq t0, x2, x2; \
> - vpxorq t1, x3, x1; \
> - vmovdqu64 t3, x3;
> + vmovdqa64 x0, t0; \
> + vmovdqa64 x3, t3; \
> + vpternlogq $0x96, x2, x1, x0; \
> + vpternlogq $0x96, x2, x1, x3; \
> + vpternlogq $0x96, t0, t3, x2; \
> + vpternlogq $0x96, t0, t3, x1;
>
> #define aria_diff_word(x0, x1, x2, x3, \
> x4, x5, x6, x7, \

Thank you so much!
Taehee Yoo

_______________________________________________
Gcrypt-devel mailing list
Gcrypt-devel@gnupg.org
https://lists.gnupg.org/mailman/listinfo/gcrypt-devel
Re: [PATCH 4/5] aria-avx512: small optimization for aria_diff_m [ In reply to ]
Hello,

On 20.2.2023 12.54, Taehee Yoo wrote:
> On 2/19/23 17:49, Jussi Kivilinna wrote:
>
> Hi Jussi,
> Thank you so much for this optimization!
>
> I tested this optimization in the kernel.
> It works very well.
> In my machine(i3-12100), it improves performance ~9%, awesome!

Interesting.. I'd expect alderlake to behave similarly to tigerlake. Did you
test with version that has unrolled round functions?

In libgcrypt, I changed from round unrolling to using loops in order to reduce
code size and to allow code to fit into uop-cache. Maybe speed increase happens
since vpternlogq reduces code-size for unrolled version enough and algorithm fits
into i3-12100's uop-cache, giving the extra performance.

-Jussi

> It will be really helpful to the kernel side aria-avx512 driver for improving performance.
>
> > * cipher/aria-gfni-avx512-amd64.S (aria_diff_m): Use 'vpternlogq' for
> > 3-way XOR operation.
> > ---
> >
> > Using vpternlogq gives small performance improvement on AMD Zen4. With
> > Intel tiger-lake speed is the same as before.
> >
> > Benchmark on AMD Ryzen 9 7900X (zen4, turbo-freq off):
> >
> > Before:
> >   ARIA128        |  nanosecs/byte   mebibytes/sec   cycles/byte  auto Mhz
> >          ECB enc |     0.204 ns/B      4682 MiB/s     0.957 c/B      4700
> >          ECB dec |     0.204 ns/B      4668 MiB/s     0.960 c/B      4700
> >          CTR enc |     0.212 ns/B      4509 MiB/s     0.994 c/B      4700
> >          CTR dec |     0.212 ns/B      4490 MiB/s     0.998 c/B      4700
> >
> > After (~3% faster):
> >   ARIA128        |  nanosecs/byte   mebibytes/sec   cycles/byte  auto Mhz
> >          ECB enc |     0.198 ns/B      4812 MiB/s     0.932 c/B      4700
> >          ECB dec |     0.198 ns/B      4824 MiB/s     0.929 c/B      4700
> >          CTR enc |     0.204 ns/B      4665 MiB/s     0.961 c/B      4700
> >          CTR dec |     0.206 ns/B      4631 MiB/s     0.968 c/B      4700
> >
> > Cc: Taehee Yoo <ap420073@gmail.com>
> > Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
> > ---
> >   cipher/aria-gfni-avx512-amd64.S | 16 ++++++----------
> >   1 file changed, 6 insertions(+), 10 deletions(-)
> >
> > diff --git a/cipher/aria-gfni-avx512-amd64.S b/cipher/aria-gfni-avx512-amd64.S
> > index 849c744b..24a49a89 100644
> > --- a/cipher/aria-gfni-avx512-amd64.S
> > +++ b/cipher/aria-gfni-avx512-amd64.S
> > @@ -406,21 +406,17 @@
> >       vgf2p8affineinvqb $0, t2, y3, y3;        \
> >       vgf2p8affineinvqb $0, t2, y7, y7;
> >
> > -
> >   #define aria_diff_m(x0, x1, x2, x3,            \
> >               t0, t1, t2, t3)            \
> >       /* T = rotr32(X, 8); */                \
> >       /* X ^= T */                    \
> > -    vpxorq x0, x3, t0;                \
> > -    vpxorq x1, x0, t1;                \
> > -    vpxorq x2, x1, t2;                \
> > -    vpxorq x3, x2, t3;                \
> >       /* X = T ^ rotr(X, 16); */            \
> > -    vpxorq t2, x0, x0;                \
> > -    vpxorq x1, t3, t3;                \
> > -    vpxorq t0, x2, x2;                \
> > -    vpxorq t1, x3, x1;                \
> > -    vmovdqu64 t3, x3;
> > +    vmovdqa64 x0, t0;                \
> > +    vmovdqa64 x3, t3;                \
> > +    vpternlogq $0x96, x2, x1, x0;            \
> > +    vpternlogq $0x96, x2, x1, x3;            \
> > +    vpternlogq $0x96, t0, t3, x2;            \
> > +    vpternlogq $0x96, t0, t3, x1;
> >
> >   #define aria_diff_word(x0, x1, x2, x3,            \
> >                  x4, x5, x6, x7,            \
>
> Thank you so much!
> Taehee Yoo
>


_______________________________________________
Gcrypt-devel mailing list
Gcrypt-devel@gnupg.org
https://lists.gnupg.org/mailman/listinfo/gcrypt-devel
Re: [PATCH 4/5] aria-avx512: small optimization for aria_diff_m [ In reply to ]
On 2023. 2. 21. ?? 2:38, Jussi Kivilinna wrote:

Hi Jussi,

> Hello,
>
> On 20.2.2023 12.54, Taehee Yoo wrote:
>> On 2/19/23 17:49, Jussi Kivilinna wrote:
>>
>> Hi Jussi,
>> Thank you so much for this optimization!
>>
>> I tested this optimization in the kernel.
>> It works very well.
>> In my machine(i3-12100), it improves performance ~9%, awesome!
>
> Interesting.. I'd expect alderlake to behave similarly to tigerlake. Did
> you
> test with version that has unrolled round functions?
>
> In libgcrypt, I changed from round unrolling to using loops in order to
> reduce
> code size and to allow code to fit into uop-cache. Maybe speed increase
> happens
> since vpternlogq reduces code-size for unrolled version enough and
> algorithm fits
> into i3-12100's uop-cache, giving the extra performance.
>

After your response, I retested it and found my benchmark data is wrong.
When I implement aria-avx512, the benchmark result is below.

testing speed of multibuffer ecb(aria) (ecb-aria-avx512) encryption
tcrypt: 1 operation in 1504 cycles (1024 bytes)
tcrypt: 1 operation in 4595 cycles (4096 bytes)
tcrypt: 1 operation in 1763 cycles (1024 bytes)
tcrypt: 1 operation in 5540 cycles (4096 bytes)
testing speed of multibuffer ecb(aria) (ecb-aria-avx512) decryption
tcrypt: 1 operation in 1502 cycles (1024 bytes)
tcrypt: 1 operation in 4615 cycles (4096 bytes)
tcrypt: 1 operation in 1759 cycles (1024 bytes)
tcrypt: 1 operation in 5554 cycles (4096 bytes)

But, the current result is like this.
tcrypt: testing speed of multibuffer ecb(aria) (ecb-aria-avx512) encryption
tcrypt: 1 operation in 1443 cycles (1024 bytes)
tcrypt: 1 operation in 4396 cycles (4096 bytes)
tcrypt: 1 operation in 1683 cycles (1024 bytes)
tcrypt: 1 operation in 5368 cycles (4096 bytes)
tcrypt: testing speed of multibuffer ecb(aria) (ecb-aria-avx512) decryption
tcrypt: 1 operation in 1458 cycles (1024 bytes)
tcrypt: 1 operation in 4416 cycles (4096 bytes)
tcrypt: 1 operation in 1723 cycles (1024 bytes)
tcrypt: 1 operation in 5358 cycles (4096 bytes)

So, after your optimization is like this.
tcrypt: testing speed of multibuffer ecb(aria) (ecb-aria-avx512) encryption
tcrypt: 1 operation in 1388 cycles (1024 bytes)
tcrypt: 1 operation in 4107 cycles (4096 bytes)
tcrypt: 1 operation in 1595 cycles (1024 bytes)
tcrypt: 1 operation in 5011 cycles (4096 bytes)
tcrypt: testing speed of multibuffer ecb(aria) (ecb-aria-avx512) decryption
tcrypt: 1 operation in 1379 cycles (1024 bytes)
tcrypt: 1 operation in 4163 cycles (4096 bytes)
tcrypt: 1 operation in 1603 cycles (1024 bytes)
tcrypt: 1 operation in 5098 cycles (4096 bytes)

The 9% performance gap I said is actually wrong.
I don't know why the result is changed... anyway, this optimization
increases performance by 5~7%.
Also, I tested it on the both loop and unroll but I couldn't find any
performance gap.
I haven't enough knowledge about uop-cache, so I couldn't provide useful
for focusing on the uop-cache.
Sorry for that the previous benchmark result is wrong.

Thank you so much!
Taehee Yoo


> -Jussi
>
>> It will be really helpful to the kernel side aria-avx512 driver for
>> improving performance.
>>
>> > * cipher/aria-gfni-avx512-amd64.S (aria_diff_m): Use 'vpternlogq' for
>> > 3-way XOR operation.
>> > ---
>> >
>> > Using vpternlogq gives small performance improvement on AMD Zen4.
With
>> > Intel tiger-lake speed is the same as before.
>> >
>> > Benchmark on AMD Ryzen 9 7900X (zen4, turbo-freq off):
>> >
>> > Before:
>> > ARIA128 | nanosecs/byte mebibytes/sec cycles/byte
>> auto Mhz
>> > ECB enc | 0.204 ns/B 4682 MiB/s 0.957
>> c/B 4700
>> > ECB dec | 0.204 ns/B 4668 MiB/s 0.960
>> c/B 4700
>> > CTR enc | 0.212 ns/B 4509 MiB/s 0.994
>> c/B 4700
>> > CTR dec | 0.212 ns/B 4490 MiB/s 0.998
>> c/B 4700
>> >
>> > After (~3% faster):
>> > ARIA128 | nanosecs/byte mebibytes/sec cycles/byte
>> auto Mhz
>> > ECB enc | 0.198 ns/B 4812 MiB/s 0.932
>> c/B 4700
>> > ECB dec | 0.198 ns/B 4824 MiB/s 0.929
>> c/B 4700
>> > CTR enc | 0.204 ns/B 4665 MiB/s 0.961
>> c/B 4700
>> > CTR dec | 0.206 ns/B 4631 MiB/s 0.968
>> c/B 4700
>> >
>> > Cc: Taehee Yoo <ap420073@gmail.com>
>> > Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
>> > ---
>> > cipher/aria-gfni-avx512-amd64.S | 16 ++++++----------
>> > 1 file changed, 6 insertions(+), 10 deletions(-)
>> >
>> > diff --git a/cipher/aria-gfni-avx512-amd64.S
>> b/cipher/aria-gfni-avx512-amd64.S
>> > index 849c744b..24a49a89 100644
>> > --- a/cipher/aria-gfni-avx512-amd64.S
>> > +++ b/cipher/aria-gfni-avx512-amd64.S
>> > @@ -406,21 +406,17 @@
>> > vgf2p8affineinvqb $0, t2, y3, y3; \
>> > vgf2p8affineinvqb $0, t2, y7, y7;
>> >
>> > -
>> > #define aria_diff_m(x0, x1, x2, x3, \
>> > t0, t1, t2, t3) \
>> > /* T = rotr32(X, 8); */ \
>> > /* X ^= T */ \
>> > - vpxorq x0, x3, t0; \
>> > - vpxorq x1, x0, t1; \
>> > - vpxorq x2, x1, t2; \
>> > - vpxorq x3, x2, t3; \
>> > /* X = T ^ rotr(X, 16); */ \
>> > - vpxorq t2, x0, x0; \
>> > - vpxorq x1, t3, t3; \
>> > - vpxorq t0, x2, x2; \
>> > - vpxorq t1, x3, x1; \
>> > - vmovdqu64 t3, x3;
>> > + vmovdqa64 x0, t0; \
>> > + vmovdqa64 x3, t3; \
>> > + vpternlogq $0x96, x2, x1, x0; \
>> > + vpternlogq $0x96, x2, x1, x3; \
>> > + vpternlogq $0x96, t0, t3, x2; \
>> > + vpternlogq $0x96, t0, t3, x1;
>> >
>> > #define aria_diff_word(x0, x1, x2, x3, \
>> > x4, x5, x6, x7, \
>>
>> Thank you so much!
>> Taehee Yoo
>>
>

_______________________________________________
Gcrypt-devel mailing list
Gcrypt-devel@gnupg.org
https://lists.gnupg.org/mailman/listinfo/gcrypt-devel
Re: [PATCH 4/5] aria-avx512: small optimization for aria_diff_m [ In reply to ]
On 22.2.2023 14.07, Taehee Yoo wrote:
> On 2023. 2. 21. ?? 2:38, Jussi Kivilinna wrote:
>
> Hi Jussi,
>
> > Hello,
> >
> > On 20.2.2023 12.54, Taehee Yoo wrote:
> >> On 2/19/23 17:49, Jussi Kivilinna wrote:
> >>
> >> Hi Jussi,
> >> Thank you so much for this optimization!
> >>
> >> I tested this optimization in the kernel.
> >> It works very well.
> >> In my machine(i3-12100), it improves performance ~9%, awesome!
> >
> > Interesting.. I'd expect alderlake to behave similarly to tigerlake. Did
> > you
> > test with version that has unrolled round functions?
> >
> > In libgcrypt, I changed from round unrolling to using loops in order to
> > reduce
> > code size and to allow code to fit into uop-cache. Maybe speed increase
> > happens
> > since vpternlogq reduces code-size for unrolled version enough and
> > algorithm fits
> > into i3-12100's uop-cache, giving the extra performance.
> >
>
> After your response, I retested it and found my benchmark data is wrong.
> When I implement aria-avx512, the benchmark result is below.
>
> testing speed of multibuffer ecb(aria) (ecb-aria-avx512) encryption
> tcrypt: 1 operation in 1504 cycles (1024 bytes)
> tcrypt: 1 operation in 4595 cycles (4096 bytes)
> tcrypt: 1 operation in 1763 cycles (1024 bytes)
> tcrypt: 1 operation in 5540 cycles (4096 bytes)
> testing speed of multibuffer ecb(aria) (ecb-aria-avx512) decryption
> tcrypt: 1 operation in 1502 cycles (1024 bytes)
> tcrypt: 1 operation in 4615 cycles (4096 bytes)
> tcrypt: 1 operation in 1759 cycles (1024 bytes)
> tcrypt: 1 operation in 5554 cycles (4096 bytes)
>
> But, the current result is like this.
> tcrypt: testing speed of multibuffer ecb(aria) (ecb-aria-avx512) encryption
> tcrypt: 1 operation in 1443 cycles (1024 bytes)
> tcrypt: 1 operation in 4396 cycles (4096 bytes)
> tcrypt: 1 operation in 1683 cycles (1024 bytes)
> tcrypt: 1 operation in 5368 cycles (4096 bytes)
> tcrypt: testing speed of multibuffer ecb(aria) (ecb-aria-avx512) decryption
> tcrypt: 1 operation in 1458 cycles (1024 bytes)
> tcrypt: 1 operation in 4416 cycles (4096 bytes)
> tcrypt: 1 operation in 1723 cycles (1024 bytes)
> tcrypt: 1 operation in 5358 cycles (4096 bytes)
>
> So, after your optimization is like this.
> tcrypt: testing speed of multibuffer ecb(aria) (ecb-aria-avx512) encryption
> tcrypt: 1 operation in 1388 cycles (1024 bytes)
> tcrypt: 1 operation in 4107 cycles (4096 bytes)
> tcrypt: 1 operation in 1595 cycles (1024 bytes)
> tcrypt: 1 operation in 5011 cycles (4096 bytes)
> tcrypt: testing speed of multibuffer ecb(aria) (ecb-aria-avx512) decryption
> tcrypt: 1 operation in 1379 cycles (1024 bytes)
> tcrypt: 1 operation in 4163 cycles (4096 bytes)
> tcrypt: 1 operation in 1603 cycles (1024 bytes)
> tcrypt: 1 operation in 5098 cycles (4096 bytes)
>
> The 9% performance gap I said is actually wrong.
> I don't know why the result is changed... anyway, this optimization increases performance by 5~7%.
> Also, I tested it on the both loop and unroll but I couldn't find any performance gap.
> I haven't enough knowledge about uop-cache, so I couldn't provide useful for focusing on the uop-cache.
> Sorry for that the previous benchmark result is wrong.

Ok, thanks for testing. I was just wondering from where the improvement came.

Anyway, good to see that there was performance increase on other CPU in
addition to AMD Zen4.

-Jussi

>
> Thank you so much!
> Taehee Yoo
>
>
> > -Jussi
> >
> >> It will be really helpful to the kernel side aria-avx512 driver for
> >> improving performance.
> >>
> >>  > * cipher/aria-gfni-avx512-amd64.S (aria_diff_m): Use 'vpternlogq' for
> >>  > 3-way XOR operation.
> >>  > ---
> >>  >
> >>  > Using vpternlogq gives small performance improvement on AMD Zen4. With
> >>  > Intel tiger-lake speed is the same as before.
> >>  >
> >>  > Benchmark on AMD Ryzen 9 7900X (zen4, turbo-freq off):
> >>  >
> >>  > Before:
> >>  >   ARIA128        |  nanosecs/byte   mebibytes/sec   cycles/byte
> >> auto Mhz
> >>  >          ECB enc |     0.204 ns/B      4682 MiB/s     0.957
> >> c/B      4700
> >>  >          ECB dec |     0.204 ns/B      4668 MiB/s     0.960
> >> c/B      4700
> >>  >          CTR enc |     0.212 ns/B      4509 MiB/s     0.994
> >> c/B      4700
> >>  >          CTR dec |     0.212 ns/B      4490 MiB/s     0.998
> >> c/B      4700
> >>  >
> >>  > After (~3% faster):
> >>  >   ARIA128        |  nanosecs/byte   mebibytes/sec   cycles/byte
> >> auto Mhz
> >>  >          ECB enc |     0.198 ns/B      4812 MiB/s     0.932
> >> c/B      4700
> >>  >          ECB dec |     0.198 ns/B      4824 MiB/s     0.929
> >> c/B      4700
> >>  >          CTR enc |     0.204 ns/B      4665 MiB/s     0.961
> >> c/B      4700
> >>  >          CTR dec |     0.206 ns/B      4631 MiB/s     0.968
> >> c/B      4700
> >>  >
> >>  > Cc: Taehee Yoo <ap420073@gmail.com>
> >>  > Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
> >>  > ---
> >>  >   cipher/aria-gfni-avx512-amd64.S | 16 ++++++----------
> >>  >   1 file changed, 6 insertions(+), 10 deletions(-)
> >>  >
> >>  > diff --git a/cipher/aria-gfni-avx512-amd64.S
> >> b/cipher/aria-gfni-avx512-amd64.S
> >>  > index 849c744b..24a49a89 100644
> >>  > --- a/cipher/aria-gfni-avx512-amd64.S
> >>  > +++ b/cipher/aria-gfni-avx512-amd64.S
> >>  > @@ -406,21 +406,17 @@
> >>  >       vgf2p8affineinvqb $0, t2, y3, y3;        \
> >>  >       vgf2p8affineinvqb $0, t2, y7, y7;
> >>  >
> >>  > -
> >>  >   #define aria_diff_m(x0, x1, x2, x3,            \
> >>  >               t0, t1, t2, t3)            \
> >>  >       /* T = rotr32(X, 8); */                \
> >>  >       /* X ^= T */                    \
> >>  > -    vpxorq x0, x3, t0;                \
> >>  > -    vpxorq x1, x0, t1;                \
> >>  > -    vpxorq x2, x1, t2;                \
> >>  > -    vpxorq x3, x2, t3;                \
> >>  >       /* X = T ^ rotr(X, 16); */            \
> >>  > -    vpxorq t2, x0, x0;                \
> >>  > -    vpxorq x1, t3, t3;                \
> >>  > -    vpxorq t0, x2, x2;                \
> >>  > -    vpxorq t1, x3, x1;                \
> >>  > -    vmovdqu64 t3, x3;
> >>  > +    vmovdqa64 x0, t0;                \
> >>  > +    vmovdqa64 x3, t3;                \
> >>  > +    vpternlogq $0x96, x2, x1, x0;            \
> >>  > +    vpternlogq $0x96, x2, x1, x3;            \
> >>  > +    vpternlogq $0x96, t0, t3, x2;            \
> >>  > +    vpternlogq $0x96, t0, t3, x1;
> >>  >
> >>  >   #define aria_diff_word(x0, x1, x2, x3,            \
> >>  >                  x4, x5, x6, x7,            \
> >>
> >> Thank you so much!
> >> Taehee Yoo
> >>
> >
>


_______________________________________________
Gcrypt-devel mailing list
Gcrypt-devel@gnupg.org
https://lists.gnupg.org/mailman/listinfo/gcrypt-devel