arm64: ipred: 16 bpc NEON implementation of the Z2 function
Relative speedup over unvectorized C code:
Cortex A53 A55 A72 A73 A76 Apple M1
intra_pred_z2_w4_16bpc_neon: 2.98 2.98 2.38 2.77 3.19 7.75
intra_pred_z2_w8_16bpc_neon: 3.91 4.22 2.64 3.29 3.73 4.78
intra_pred_z2_w16_16bpc_neon: 4.43 5.12 2.89 3.90 3.50 4.26
intra_pred_z2_w32_16bpc_neon: 5.08 6.36 3.44 4.40 4.05 4.96
intra_pred_z2_w64_16bpc_neon: 4.68 5.97 3.29 4.40 3.68 5.23