Skip to content

x86: add AVX512-IceLake implementation of HBD 64x64 DCT^2

Ronald S. Bultje requested to merge rbultje/dav1d:itx-avx512icl-hbd-64x64 into master

Also implement "fast3" path for pass2.dct64 (where 1/8th of the coefficients are non-zero), which affects 32x64 as well as 64x64.

Before:

inv_txfm_add_32x64_dct_dct_1_10bpc_c:          51008.6 ( 1.00x)
inv_txfm_add_32x64_dct_dct_1_10bpc_sse4:        3351.9 (15.22x)
inv_txfm_add_32x64_dct_dct_1_10bpc_avx2:        1419.5 (35.93x)
inv_txfm_add_32x64_dct_dct_1_10bpc_avx512icl:    744.8 (68.49x)

After:

inv_txfm_add_32x64_dct_dct_1_10bpc_c:          51019.5 ( 1.00x)
inv_txfm_add_32x64_dct_dct_1_10bpc_sse4:        3276.1 (15.57x)
inv_txfm_add_32x64_dct_dct_1_10bpc_avx2:        1420.7 (35.91x)
inv_txfm_add_32x64_dct_dct_1_10bpc_avx512icl:    668.3 (76.34x)

(Not sure why the SSE4 speed changed.)

And speed for 64x64:

inv_txfm_add_64x64_dct_dct_0_10bpc_c:           3506.9 ( 1.00x)
inv_txfm_add_64x64_dct_dct_0_10bpc_sse4:         535.6 ( 6.55x)
inv_txfm_add_64x64_dct_dct_0_10bpc_avx2:         223.5 (15.69x)
inv_txfm_add_64x64_dct_dct_0_10bpc_avx512icl:    252.4 (13.89x)
inv_txfm_add_64x64_dct_dct_1_10bpc_c:         108353.7 ( 1.00x)
inv_txfm_add_64x64_dct_dct_1_10bpc_sse4:        6551.9 (16.54x)
inv_txfm_add_64x64_dct_dct_1_10bpc_avx2:        2876.8 (37.66x)
inv_txfm_add_64x64_dct_dct_1_10bpc_avx512icl:   1310.1 (82.70x)
inv_txfm_add_64x64_dct_dct_2_10bpc_c:         108347.6 ( 1.00x)
inv_txfm_add_64x64_dct_dct_2_10bpc_sse4:        7985.4 (13.57x)
inv_txfm_add_64x64_dct_dct_2_10bpc_avx2:        3561.8 (30.42x)
inv_txfm_add_64x64_dct_dct_2_10bpc_avx512icl:   1962.6 (55.20x)
inv_txfm_add_64x64_dct_dct_3_10bpc_c:         108455.5 ( 1.00x)
inv_txfm_add_64x64_dct_dct_3_10bpc_sse4:        9709.0 (11.17x)
inv_txfm_add_64x64_dct_dct_3_10bpc_avx2:        4220.5 (25.70x)
inv_txfm_add_64x64_dct_dct_3_10bpc_avx512icl:   2991.1 (36.26x)
inv_txfm_add_64x64_dct_dct_4_10bpc_c:         108349.9 ( 1.00x)
inv_txfm_add_64x64_dct_dct_4_10bpc_sse4:       11048.0 ( 9.81x)
inv_txfm_add_64x64_dct_dct_4_10bpc_avx2:        4898.1 (22.12x)
inv_txfm_add_64x64_dct_dct_4_10bpc_avx512icl:   3108.1 (34.86x)

Merge request reports

Loading