AArch64: Optimize prep_neon function
Optimize the prep_neon
function, details are in the commit messages.
Relative performance of micro benchmarks including all commits (lower is better):
Cortex-A55
mct_w4: 0.795x
mct_w8: 0.913x
mct_w16: 0.912x
mct_w32: 0.838x
mct_w64: 1.025x
mct_w128: 1.002x
Cortex-A510
mct_w4: 0.760x
mct_w8: 0.636x
mct_w16: 0.640x
mct_w32: 0.854x
mct_w64: 0.864x
mct_w128: 0.995x
Cortex-A72
mct_w4: 0.616x
mct_w8: 0.854x
mct_w16: 0.756x
mct_w32: 1.052x
mct_w64: 1.044x
mct_w128: 0.702x
Cortex-A76
mct_w4: 0.837x
mct_w8: 0.797x
mct_w16: 0.841x
mct_w32: 0.804x
mct_w64: 0.948x
mct_w128: 0.904x
Cortex-A78
mct_w16: 0.542x
mct_w32: 0.725x
mct_w64: 0.741x
mct_w128: 0.745x
Cortex-A715
mct_w16: 0.561x
mct_w32: 0.720x
mct_w64: 0.740x
mct_w128: 0.748x
Cortex-X1
mct_w32: 0.886x
mct_w64: 0.882x
mct_w128: 0.917x
Cortex-X3
mct_w32: 0.835x
mct_w64: 0.803x
mct_w128: 0.808x