x86: add AVX512-IceLake implementation of HBD 16x64 DCT^2
nop: 39.4
inv_txfm_add_16x64_dct_dct_0_10bpc_c: 2208.0 ( 1.00x)
inv_txfm_add_16x64_dct_dct_0_10bpc_sse4: 133.5 (16.54x)
inv_txfm_add_16x64_dct_dct_0_10bpc_avx2: 71.3 (30.98x)
inv_txfm_add_16x64_dct_dct_0_10bpc_avx512icl: 102.0 (21.66x)
inv_txfm_add_16x64_dct_dct_1_10bpc_c: 25757.0 ( 1.00x)
inv_txfm_add_16x64_dct_dct_1_10bpc_sse4: 1366.1 (18.85x)
inv_txfm_add_16x64_dct_dct_1_10bpc_avx2: 657.6 (39.17x)
inv_txfm_add_16x64_dct_dct_1_10bpc_avx512icl: 378.9 (67.98x)
inv_txfm_add_16x64_dct_dct_2_10bpc_c: 25771.0 ( 1.00x)
inv_txfm_add_16x64_dct_dct_2_10bpc_sse4: 1739.7 (14.81x)
inv_txfm_add_16x64_dct_dct_2_10bpc_avx2: 772.1 (33.38x)
inv_txfm_add_16x64_dct_dct_2_10bpc_avx512icl: 469.3 (54.92x)
inv_txfm_add_16x64_dct_dct_3_10bpc_c: 25775.7 ( 1.00x)
inv_txfm_add_16x64_dct_dct_3_10bpc_sse4: 1968.1 (13.10x)
inv_txfm_add_16x64_dct_dct_3_10bpc_avx2: 886.5 (29.08x)
inv_txfm_add_16x64_dct_dct_3_10bpc_avx512icl: 662.6 (38.90x)
inv_txfm_add_16x64_dct_dct_4_10bpc_c: 25745.9 ( 1.00x)
inv_txfm_add_16x64_dct_dct_4_10bpc_sse4: 2330.9 (11.05x)
inv_txfm_add_16x64_dct_dct_4_10bpc_avx2: 1008.5 (25.53x)
inv_txfm_add_16x64_dct_dct_4_10bpc_avx512icl: 662.3 (38.88x)
The dc_only being slower than AVX2 is a real thing. I'm not sure why. The 16x32 is also (slightly) slower (compared to AVX2) in my tests, but the obvious fix (unrolling x2 to adjust for the 1-cycle latency of all ZMM instructions) speeds up 16x32 and below, but not 16x64. Suggestions very welcome here.
Numbers for unrolling 16x8.dconly
x2:
inv_txfm_add_16x4_dct_dct_0_10bpc_c: 156.1 ( 1.00x)
inv_txfm_add_16x4_dct_dct_0_10bpc_sse4: 17.2 ( 9.08x)
inv_txfm_add_16x4_dct_dct_0_10bpc_avx2: 12.9 (12.14x)
inv_txfm_add_16x8_dct_dct_0_10bpc_c: 294.1 ( 1.00x)
inv_txfm_add_16x8_dct_dct_0_10bpc_sse4: 27.0 (10.89x)
inv_txfm_add_16x8_dct_dct_0_10bpc_avx2: 17.3 (16.99x)
inv_txfm_add_16x8_dct_dct_0_10bpc_avx512icl: 16.9 (17.44x) => 15.1 (19.44x)
inv_txfm_add_16x16_dct_dct_0_10bpc_c: 583.9 ( 1.00x)
inv_txfm_add_16x16_dct_dct_0_10bpc_sse4: 39.2 (14.89x)
inv_txfm_add_16x16_dct_dct_0_10bpc_avx2: 26.7 (21.88x)
inv_txfm_add_16x16_dct_dct_0_10bpc_avx512icl: 24.7 (23.64x) => 22.4 (26.14x)
inv_txfm_add_16x32_dct_dct_0_10bpc_c: 1122.9 ( 1.00x)
inv_txfm_add_16x32_dct_dct_0_10bpc_sse4: 71.6 (15.69x)
inv_txfm_add_16x32_dct_dct_0_10bpc_avx2: 41.5 (27.05x)
inv_txfm_add_16x32_dct_dct_0_10bpc_avx512icl: 46.7 (24.06x) => 40.5 (27.74x)
inv_txfm_add_16x64_dct_dct_0_10bpc_c: 2211.7 ( 1.00x)
inv_txfm_add_16x64_dct_dct_0_10bpc_sse4: 132.7 (16.67x)
inv_txfm_add_16x64_dct_dct_0_10bpc_avx2: 65.8 (33.62x)
inv_txfm_add_16x64_dct_dct_0_10bpc_avx512icl: 99.8 (22.16x) => 99.3 (22.30x)
diff:
$ git diff
diff --git a/src/x86/itx16_avx512.asm b/src/x86/itx16_avx512.asm
index d973655..1c28cbc 100644
--- a/src/x86/itx16_avx512.asm
+++ b/src/x86/itx16_avx512.asm
@@ -1015,21 +1015,26 @@ cglobal iidentity_8x16_internal_10bpc, 0, 7, 16, dst, stride, c, eob, tx2
add r6d, 384
sar r6d, 9
.dconly2:
- vpbroadcastd m2, [o(dconly_10bpc)]
+ vpbroadcastd m3, [o(dconly_10bpc)]
imul r6d, 181
add r6d, 2176
sar r6d, 12
- vpbroadcastw m1, r6d
- paddsw m1, m2
+ vpbroadcastw m2, r6d
+ paddsw m2, m3
+ lea r2, [strideq*3]
.dconly_loop:
mova ym0, [dstq+strideq*0]
+ mova ym1, [dstq+strideq*2]
vinserti32x8 m0, [dstq+strideq*1], 1
- paddsw m0, m1
- psubusw m0, m2
+ vinserti32x8 m1, [dstq+r2], 1
+ REPX {paddsw x, m2}, m0, m1
+ REPX {psubusw x, m3}, m0, m1
mova [dstq+strideq*0], ym0
vextracti32x8 [dstq+strideq*1], m0, 1
- lea dstq, [dstq+strideq*2]
- sub r3d, 2
+ mova [dstq+strideq*2], ym1
+ vextracti32x8 [dstq+r2 ], m1, 1
+ lea dstq, [dstq+strideq*4]
+ sub r3d, 4
jg .dconly_loop
RET
%endif
Even implementing a straight YMM-only version identical to AVX2 leaves the same slowdown for both 16x32 as well as 16x64.
Edited by Ronald S. Bultje