AArch64: Simplify DotProd path of subpel filters
The purpose of this merge request is to transform the DotProd
code path to be similar to our upcoming i8mm
version.
The modifications include comment fixes, instruction reorderings, TBL rewrites, load/store refactoring and macro simplifications. These lead to simpler filter_8tap_fn
macro with less branches in it to reduce line count and to help the understanding. We also tried to avoid introducing any performance regression.
Some i8mm
tunings were back-ported to this DotProd
path as well,
most notably:
- horizontal filters with 2-register
TBL
instructions are simplified to use only 1-registerTBL
s, it improves performance on small cores like Cortex-A510 and newer. - the accumulators of vertical filters are initialised to make it possible for CPUs to use zero latency move instructions.
Details can be seen in the commit messages.
Edited by Arpad Panyik