AArch64: Optimize prep_neon function (!1660) · Merge requests · VideoLAN / dav1d

Optimize the prep_neon function, details are in the commit messages.

Relative performance of micro benchmarks including all commits (lower is better):

Cortex-A55


mct_w4:   0.795x
mct_w8:   0.913x
mct_w16:  0.912x
mct_w32:  0.838x
mct_w64:  1.025x
mct_w128: 1.002x

Cortex-A510


mct_w4:   0.760x
mct_w8:   0.636x
mct_w16:  0.640x
mct_w32:  0.854x
mct_w64:  0.864x
mct_w128: 0.995x

Cortex-A72


mct_w4:   0.616x
mct_w8:   0.854x
mct_w16:  0.756x
mct_w32:  1.052x
mct_w64:  1.044x
mct_w128: 0.702x

Cortex-A76


mct_w4:   0.837x
mct_w8:   0.797x
mct_w16:  0.841x
mct_w32:  0.804x
mct_w64:  0.948x
mct_w128: 0.904x

Cortex-A78


mct_w16:  0.542x
mct_w32:  0.725x
mct_w64:  0.741x
mct_w128: 0.745x

Cortex-A715


mct_w16:  0.561x
mct_w32:  0.720x
mct_w64:  0.740x
mct_w128: 0.748x

Cortex-X1


mct_w32:  0.886x
mct_w64:  0.882x
mct_w128: 0.917x

Cortex-X3


mct_w32:  0.835x
mct_w64:  0.803x
mct_w128: 0.808x

AArch64: Optimize prep_neon function