AArch64: Optimize the init of DotProd+ 2D subpel filters
Removed some unnecessary vector register copies from the initial horizontal filter parts of the HV subpel filters. The performance improvements are better for the smaller filter block sizes.
The narrowing shifts were also rewritten at the end of the *filter8*
because it was only beneficial for the Cortex-A55 among the DotProd
capable CPU cores. On other out-of-order or newer CPUs the UZP1
+SHRN
instruction combination is better.
Relative performance of micro benchmarks (lower is better):
Cortex-A55:
mct regular w4: 0.980x mct sharp w4: 0.983x
mct regular w8: 1.007x mct sharp w8: 1.012x
mct regular w16: 1.007x mct sharp w16: 1.005x
Cortex-A510:
mct regular w4: 0.935x mct sharp w4: 0.927x
mct regular w8: 0.984x mct sharp w8: 0.983x
mct regular w16: 0.986x mct sharp w16: 0.987x
Cortex-A78:
mct regular w4: 0.974x mct sharp w4: 0.971x
mct regular w8: 0.988x mct sharp w8: 0.987x
mct regular w16: 0.991x mct sharp w16: 0.979x
Cortex-715:
mct regular w4: 0.958x mct sharp w4: 0.974x
mct regular w8: 0.993x mct sharp w8: 0.991x
mct regular w16: 0.998x mct sharp w16: 0.997x
Cortex-X1:
mct regular w4: 0.983x mct sharp w4: 0.974x
mct regular w8: 0.993x mct sharp w8: 0.990x
mct regular w16: 0.996x mct sharp w16: 0.995x
Cortex-X3:
mct regular w4: 0.953x mct sharp w4: 0.981x
mct regular w8: 0.993x mct sharp w8: 0.993x
mct regular w16: 0.997x mct sharp w16: 0.995x