arm64: looprestoration: Apply simplifications to align with C code
This applies the same simplifications that were done for the C code and the x86 assembly in 4613d3a5, to the arm64 implementation.
This gives a minor speedup of around a couple percent.
Before: Cortex A53 A55 A72 A73 A76 Apple
M3
sgr_3x3_8bpc_neon: 368583.2 363654.2 279958.1 272065.1 169353.3 354.6
sgr_5x5_8bpc_neon: 258570.7 255018.5 200410.6 199478.3 117968.3 260.9
sgr_mix_8bpc_neon: 603698.1 577383.3 482468.3 436540.4 256632.9 541.8
After:
sgr_3x3_8bpc_neon: 367873.2 357884.1 275462.4 268363.9 165909.8 346.0
sgr_5x5_8bpc_neon: 254988.4 248184.2 190875.1 196939.1 120517.2 252.1
sgr_mix_8bpc_neon: 589204.7 563565.8 414025.6 427702.2 251651.2 533.4