x86: Add 8-bit ipred z1 AVX-512 (Ice Lake) asm
With AVX-512 we can now keep all the data in registers and use vpermb
/vpermi2b
to load the input pixels instead of having to do memory loads from a stack buffer.
intra_pred_z1_w4_8bpc_c: 204.5 ( 1.00x)
intra_pred_z1_w4_8bpc_ssse3: 37.6 ( 5.43x)
intra_pred_z1_w4_8bpc_avx2: 37.8 ( 5.41x)
intra_pred_z1_w4_8bpc_avx512icl: 23.5 ( 8.70x)
intra_pred_z1_w8_8bpc_c: 342.5 ( 1.00x)
intra_pred_z1_w8_8bpc_ssse3: 58.1 ( 5.90x)
intra_pred_z1_w8_8bpc_avx2: 40.9 ( 8.38x)
intra_pred_z1_w8_8bpc_avx512icl: 27.5 (12.44x)
intra_pred_z1_w16_8bpc_c: 1162.9 ( 1.00x)
intra_pred_z1_w16_8bpc_ssse3: 124.8 ( 9.31x)
intra_pred_z1_w16_8bpc_avx2: 79.5 (14.63x)
intra_pred_z1_w16_8bpc_avx512icl: 45.1 (25.76x)
intra_pred_z1_w32_8bpc_c: 1927.0 ( 1.00x)
intra_pred_z1_w32_8bpc_ssse3: 254.9 ( 7.56x)
intra_pred_z1_w32_8bpc_avx2: 160.5 (12.01x)
intra_pred_z1_w32_8bpc_avx512icl: 110.3 (17.47x)
intra_pred_z1_w64_8bpc_c: 2962.4 ( 1.00x)
intra_pred_z1_w64_8bpc_ssse3: 504.7 ( 5.87x)
intra_pred_z1_w64_8bpc_avx2: 284.4 (10.42x)
intra_pred_z1_w64_8bpc_avx512icl: 245.8 (12.05x)