x86: Add high bit-depth ipred z1 AVX-512 (Ice Lake) asm
Uses permutes (like the 8bpc AVX-512) for w4
and w8
, and memory loads from a temporary stack buffer (like AVX2) for the larger widths as this ended up being optimal.
intra_pred_z1_w4_16bpc_c: 232.4 ( 1.00x)
intra_pred_z1_w4_16bpc_ssse3: 42.0 ( 5.53x)
intra_pred_z1_w4_16bpc_avx2: 35.6 ( 6.53x)
intra_pred_z1_w4_16bpc_avx512icl: 23.2 (10.03x)
intra_pred_z1_w8_16bpc_c: 425.0 ( 1.00x)
intra_pred_z1_w8_16bpc_ssse3: 55.9 ( 7.60x)
intra_pred_z1_w8_16bpc_avx2: 39.3 (10.82x)
intra_pred_z1_w8_16bpc_avx512icl: 26.6 (15.99x)
intra_pred_z1_w16_16bpc_c: 1143.3 ( 1.00x)
intra_pred_z1_w16_16bpc_ssse3: 125.8 ( 9.09x)
intra_pred_z1_w16_16bpc_avx2: 75.7 (15.11x)
intra_pred_z1_w16_16bpc_avx512icl: 64.3 (17.79x)
intra_pred_z1_w32_16bpc_c: 1651.6 ( 1.00x)
intra_pred_z1_w32_16bpc_ssse3: 229.1 ( 7.21x)
intra_pred_z1_w32_16bpc_avx2: 144.6 (11.42x)
intra_pred_z1_w32_16bpc_avx512icl: 91.9 (17.98x)
intra_pred_z1_w64_16bpc_c: 2957.5 ( 1.00x)
intra_pred_z1_w64_16bpc_ssse3: 557.6 ( 5.30x)
intra_pred_z1_w64_16bpc_avx2: 297.4 ( 9.94x)
intra_pred_z1_w64_16bpc_avx512icl: 192.4 (15.37x)