arm64: refmvs: Add NEON implementation of save_tmvs
Cortex A53 A55 A72 A73 A76 Apple M1
save_tmvs_c: 116768.4 122653.1 82587.7 90445.0 45386.8 242.1
save_tmvs_neon: 79184.7 79889.9 54720.2 54522.6 29919.6 216.4
Relative speedup compared with C:
Cortex A53 A55 A72 A73 A76 Apple M1
save_tmvs_neon: 1.47 1.54 1.51 1.66 1.52 1.12
The second commit changes the implementation to process two blocks at a time, like the x86 implementation does. However that only gives very marginal gains on some cores, and actually makes the code slower on the other cores.