AArch64: Optimize put_neon function
Optimize the put_neon
function, details are in the commit messages.
Relative performance of micro benchmarks including all commits (lower is better):
Cortex-A55
w2: 0.991x
w4: 0.992x
w8: 0.999x
w16: 0.875x
w32: 0.775x
w64: 0.914x
w128: 0.998x
Cortex-A510
w2: 0.159x
w4: 0.080x
w8: 0.583x
w16: 0.588x
w32: 0.966x
w64: 1.111x
w128: 0.957x
Cortex-A76
w2: 0.903x
w4: 0.683x
w8: 0.944x
w16: 0.948x
w32: 0.919x
w64: 0.855x
w128: 0.991x
Cortex-A78
w32: 0.867x
w64: 0.820x
w128: 1.011x
Cortex-A715
w32: 0.834x
w64: 0.778x
w128: 1.000x
Cortex-X1
w32: 0.809x
w64: 0.762x
w128: 1.000x
Cortex-X3
w32: 0.733x
w64: 0.720x
w128: 0.999x