x86: Add 8-bit ipred z3 AVX-512 (Ice Lake) asm
The z3 algorithm is column-based in nature, but by leveraging the AVX-512 permutation instructions we can convert it into a row-based implementation to avoid having to transpose the output. This results in a decent improvement over the existing AVX2 code:
intra_pred_z3_w4_8bpc_c: 208.9 ( 1.00x)
intra_pred_z3_w4_8bpc_ssse3: 53.0 ( 3.94x)
intra_pred_z3_w4_8bpc_avx2: 37.5 ( 5.57x)
intra_pred_z3_w4_8bpc_avx512icl: 25.9 ( 8.07x)
intra_pred_z3_w8_8bpc_c: 609.3 ( 1.00x)
intra_pred_z3_w8_8bpc_ssse3: 81.5 ( 7.48x)
intra_pred_z3_w8_8bpc_avx2: 59.8 (10.19x)
intra_pred_z3_w8_8bpc_avx512icl: 32.1 (19.01x)
intra_pred_z3_w16_8bpc_c: 1644.7 ( 1.00x)
intra_pred_z3_w16_8bpc_ssse3: 172.5 ( 9.54x)
intra_pred_z3_w16_8bpc_avx2: 106.4 (15.46x)
intra_pred_z3_w16_8bpc_avx512icl: 54.2 (30.33x)
intra_pred_z3_w32_8bpc_c: 3569.7 ( 1.00x)
intra_pred_z3_w32_8bpc_ssse3: 351.4 (10.16x)
intra_pred_z3_w32_8bpc_avx2: 210.7 (16.94x)
intra_pred_z3_w32_8bpc_avx512icl: 99.9 (35.74x)
intra_pred_z3_w64_8bpc_c: 8474.9 ( 1.00x)
intra_pred_z3_w64_8bpc_ssse3: 768.6 (11.03x)
intra_pred_z3_w64_8bpc_avx2: 451.1 (18.79x)
intra_pred_z3_w64_8bpc_avx512icl: 203.5 (41.64x)