Skip to content

arm64: ipred: 16 bpc NEON implementation of the Z1 and Z3 functions

Martin Storsjö requested to merge mstorsjo/dav1d:arm64-z1-z3-16bpc into master

As usual, there's a handful of minor things to fix in the 8 bpc case that I notice after looking closer at it again.

Overall relative speedup over C code:

                          Cortex A53    A55    A72    A73    A76   Apple M1
intra_pred_z1_w4_16bpc_neon:    3.49   2.63   2.83   3.85   3.14   9.00
intra_pred_z1_w8_16bpc_neon:    6.19   4.39   3.65   6.58   4.99   6.50
intra_pred_z1_w16_16bpc_neon:   6.65   4.64   3.97   7.78   4.87   7.00
intra_pred_z1_w32_16bpc_neon:   7.76   5.49   5.17   7.83   5.59   8.24
intra_pred_z1_w64_16bpc_neon:   8.02   5.80   5.33   8.41   5.77   8.70
intra_pred_z3_w4_16bpc_neon:    3.06   2.87   2.17   1.97   2.33   7.75
intra_pred_z3_w8_16bpc_neon:    3.90   3.94   2.97   3.16   2.93   4.43
intra_pred_z3_w16_16bpc_neon:   4.08   4.48   3.31   4.68   3.13   5.00
intra_pred_z3_w32_16bpc_neon:   4.43   4.85   3.50   4.02   3.33   5.62
intra_pred_z3_w64_16bpc_neon:   4.68   5.30   3.72   3.96   3.52   5.78

Merge request reports

Loading