AArch64: Add Neon implementation of load_tmvs (!1774) · Merge requests · VideoLAN / dav1d

This patch adds a vectorised variant of the mv_projection calculation and a faster initialisation of motion vectors for load_tmvs_neon.

checkasm uplifts after this patch on some Neoverse and Cortex CPU cores compared to the C reference compiled with GCC-13 and Clang-19:

                     GCC    Clang
 AWS Graviton 4:   1.62x    1.59x
 Cortex-X4:        1.45x    1.46x
 Cortex-X3:        1.68x    1.69x
 Cortex-X1:        1.55x    1.52x
 Cortex-A720:      1.54x    1.57x
 Cortex-A715:      1.47x    1.55x
 Cortex-A78:       1.21x    1.18x
 Cortex-A76:       1.38x    1.37x
 Cortex-A72:       1.08x    1.11x
 Cortex-A520:      0.97x    1.18x
 Cortex-A510:      0.99x    1.14x
 Cortex-A55:       1.16x    1.23x

This patch increases the .text by ~660 bytes, but smaller than the reference implementation by about 0.5 KiB.

AArch64: Add Neon implementation of load_tmvs

Merge request reports