x86: add AVX512-IceLake implementation of HBD 64x16 DCT^2
inv_txfm_add_64x16_dct_dct_0_10bpc_c: 892.0 ( 1.00x)
inv_txfm_add_64x16_dct_dct_0_10bpc_sse4: 131.5 ( 6.78x)
inv_txfm_add_64x16_dct_dct_0_10bpc_avx2: 63.4 (14.07x)
inv_txfm_add_64x16_dct_dct_0_10bpc_avx512icl: 56.8 (15.71x)
inv_txfm_add_64x16_dct_dct_1_10bpc_c: 29253.7 ( 1.00x)
inv_txfm_add_64x16_dct_dct_1_10bpc_sse4: 1639.7 (17.84x)
inv_txfm_add_64x16_dct_dct_1_10bpc_avx2: 1106.8 (26.43x)
inv_txfm_add_64x16_dct_dct_1_10bpc_avx512icl: 532.9 (54.89x)
inv_txfm_add_64x16_dct_dct_2_10bpc_c: 29249.8 ( 1.00x)
inv_txfm_add_64x16_dct_dct_2_10bpc_sse4: 3065.6 ( 9.54x)
inv_txfm_add_64x16_dct_dct_2_10bpc_avx2: 1791.0 (16.33x)
inv_txfm_add_64x16_dct_dct_2_10bpc_avx512icl: 1108.0 (26.40x)
inv_txfm_add_64x16_dct_dct_3_10bpc_c: 29269.1 ( 1.00x)
inv_txfm_add_64x16_dct_dct_3_10bpc_sse4: 3738.2 ( 7.83x)
inv_txfm_add_64x16_dct_dct_3_10bpc_avx2: 1790.9 (16.34x)
inv_txfm_add_64x16_dct_dct_3_10bpc_avx512icl: 1203.8 (24.31x)
inv_txfm_add_64x16_dct_dct_4_10bpc_c: 29337.7 ( 1.00x)
inv_txfm_add_64x16_dct_dct_4_10bpc_sse4: 3749.7 ( 7.82x)
inv_txfm_add_64x16_dct_dct_4_10bpc_avx2: 1791.0 (16.38x)
inv_txfm_add_64x16_dct_dct_4_10bpc_avx512icl: 1203.8 (24.37x)
Edited by Ronald S. Bultje