x86: add AVX512-IceLake implementation of HBD 64x64 DCT^2
Also implement "fast3" path for pass2.dct64 (where 1/8th of the coefficients are non-zero), which affects 32x64 as well as 64x64.
Before:
inv_txfm_add_32x64_dct_dct_1_10bpc_c: 51008.6 ( 1.00x)
inv_txfm_add_32x64_dct_dct_1_10bpc_sse4: 3351.9 (15.22x)
inv_txfm_add_32x64_dct_dct_1_10bpc_avx2: 1419.5 (35.93x)
inv_txfm_add_32x64_dct_dct_1_10bpc_avx512icl: 744.8 (68.49x)
After:
inv_txfm_add_32x64_dct_dct_1_10bpc_c: 51019.5 ( 1.00x)
inv_txfm_add_32x64_dct_dct_1_10bpc_sse4: 3276.1 (15.57x)
inv_txfm_add_32x64_dct_dct_1_10bpc_avx2: 1420.7 (35.91x)
inv_txfm_add_32x64_dct_dct_1_10bpc_avx512icl: 668.3 (76.34x)
(Not sure why the SSE4 speed changed.)
And speed for 64x64:
inv_txfm_add_64x64_dct_dct_0_10bpc_c: 3506.9 ( 1.00x)
inv_txfm_add_64x64_dct_dct_0_10bpc_sse4: 535.6 ( 6.55x)
inv_txfm_add_64x64_dct_dct_0_10bpc_avx2: 223.5 (15.69x)
inv_txfm_add_64x64_dct_dct_0_10bpc_avx512icl: 252.4 (13.89x)
inv_txfm_add_64x64_dct_dct_1_10bpc_c: 108353.7 ( 1.00x)
inv_txfm_add_64x64_dct_dct_1_10bpc_sse4: 6551.9 (16.54x)
inv_txfm_add_64x64_dct_dct_1_10bpc_avx2: 2876.8 (37.66x)
inv_txfm_add_64x64_dct_dct_1_10bpc_avx512icl: 1310.1 (82.70x)
inv_txfm_add_64x64_dct_dct_2_10bpc_c: 108347.6 ( 1.00x)
inv_txfm_add_64x64_dct_dct_2_10bpc_sse4: 7985.4 (13.57x)
inv_txfm_add_64x64_dct_dct_2_10bpc_avx2: 3561.8 (30.42x)
inv_txfm_add_64x64_dct_dct_2_10bpc_avx512icl: 1962.6 (55.20x)
inv_txfm_add_64x64_dct_dct_3_10bpc_c: 108455.5 ( 1.00x)
inv_txfm_add_64x64_dct_dct_3_10bpc_sse4: 9709.0 (11.17x)
inv_txfm_add_64x64_dct_dct_3_10bpc_avx2: 4220.5 (25.70x)
inv_txfm_add_64x64_dct_dct_3_10bpc_avx512icl: 2991.1 (36.26x)
inv_txfm_add_64x64_dct_dct_4_10bpc_c: 108349.9 ( 1.00x)
inv_txfm_add_64x64_dct_dct_4_10bpc_sse4: 11048.0 ( 9.81x)
inv_txfm_add_64x64_dct_dct_4_10bpc_avx2: 4898.1 (22.12x)
inv_txfm_add_64x64_dct_dct_4_10bpc_avx512icl: 3108.1 (34.86x)