arm64: ipred: 16 bpc NEON implementation of the Z1 and Z3 functions
As usual, there's a handful of minor things to fix in the 8 bpc case that I notice after looking closer at it again.
Overall relative speedup over C code:
Cortex A53 A55 A72 A73 A76 Apple M1
intra_pred_z1_w4_16bpc_neon: 3.49 2.63 2.83 3.85 3.14 9.00
intra_pred_z1_w8_16bpc_neon: 6.19 4.39 3.65 6.58 4.99 6.50
intra_pred_z1_w16_16bpc_neon: 6.65 4.64 3.97 7.78 4.87 7.00
intra_pred_z1_w32_16bpc_neon: 7.76 5.49 5.17 7.83 5.59 8.24
intra_pred_z1_w64_16bpc_neon: 8.02 5.80 5.33 8.41 5.77 8.70
intra_pred_z3_w4_16bpc_neon: 3.06 2.87 2.17 1.97 2.33 7.75
intra_pred_z3_w8_16bpc_neon: 3.90 3.94 2.97 3.16 2.93 4.43
intra_pred_z3_w16_16bpc_neon: 4.08 4.48 3.31 4.68 3.13 5.00
intra_pred_z3_w32_16bpc_neon: 4.43 4.85 3.50 4.02 3.33 5.62
intra_pred_z3_w64_16bpc_neon: 4.68 5.30 3.72 3.96 3.52 5.78