arm64: looprestoration: Rewrite the wiener functions
Make them operate in a more cache friendly manner, interleaving horizontal and vertical filtering (reducing the amount of stack used from 51 KB to 4 KB), similar to what was done for x86 in 78d27b7d.
This also adds separate 5tap versions of the filters and unrolls the vertical filter a bit more (which maybe could have been done without doing the rewrite).
This does, however, increase the compiled code size by around 3.5 KB.
Before: Cortex A53 A72 A73
wiener_5tap_8bpc_neon: 136855.6 91446.2 87363.6
wiener_7tap_8bpc_neon: 136861.6 91454.9 87374.5
wiener_5tap_10bpc_neon: 167685.3 114720.3 116522.1
wiener_5tap_12bpc_neon: 167677.5 114724.7 116511.9
wiener_7tap_10bpc_neon: 167681.6 114738.5 116567.0
wiener_7tap_12bpc_neon: 167673.8 114720.8 116515.4
After:
wiener_5tap_8bpc_neon: 87102.1 60460.6 66803.8
wiener_7tap_8bpc_neon: 110831.7 78489.0 82015.9
wiener_5tap_10bpc_neon: 109999.2 90259.0 89238.0
wiener_5tap_12bpc_neon: 109978.3 90255.7 89220.7
wiener_7tap_10bpc_neon: 137877.6 107578.5 103435.6
wiener_7tap_12bpc_neon: 137868.8 107568.9 103390.4
Edited by Martin Storsjö