arm64: looprestoration: Rewrite the SGR functions
Make them operate in a more cache friendly manner, interleaving the various passes, and merging some of the functions that operate on data in similar patterns.
This reduces the amount of stack used from 207 KB to 14 KB for sgr_3x3, from 207 KB to 16 KB for sgr_5x5 and from 255 KB to 33 KB for sgr_mix.
This does however increase the size of the binary by about 12 KB. (The executable code generated from assembly actually shrinks by a little, but the higher level logic in C is quite nontrivial.)
This is somewhat similar to what was done for x86 in fe2bb774.
Benchmarks from checkasm:
Before: Cortex A53 A55 A72 A73 A76 Apple M1
sgr_3x3_8bpc_neon: 493005.0 483133.2 365056.3 345197.9 202819.1 537.3
sgr_5x5_8bpc_neon: 353152.6 349614.3 268962.2 248431.8 142302.4 385.9
sgr_mix_8bpc_neon: 829903.9 815910.9 622858.5 577238.0 333362.9 881.7
sgr_3x3_10bpc_neon: 504778.6 499851.6 379203.1 346695.2 199738.7 537.0
sgr_5x5_10bpc_neon: 363111.9 362489.7 267903.1 247506.5 138417.2 351.3
sgr_mix_10bpc_neon: 853053.7 846768.8 628349.6 584553.8 328399.5 843.6
After:
sgr_3x3_8bpc_neon: 387949.9 384216.4 294423.7 301968.2 184643.1 492.4
sgr_5x5_8bpc_neon: 259854.7 257233.2 193983.7 198388.4 128497.0 341.2
sgr_mix_8bpc_neon: 606401.5 595661.3 457209.7 462721.8 281906.7 738.6
sgr_3x3_10bpc_neon: 392472.7 394100.5 296048.1 304339.4 184271.4 471.3
sgr_5x5_10bpc_neon: 257248.3 257651.1 197552.5 199655.1 130739.7 322.9
sgr_mix_10bpc_neon: 605263.3 611197.4 441789.3 461339.2 286320.1 721.4
Speedup vs before:
27-41% 25-40% 23-42% 13-26% 5-18% 8-19%