AArch64: Optimize lane load/store in MC functions
Partial register writes can create long dependency chains which can reduce performance on out-of-order CPUs. This patch removes most of these kinds of problems in MC functions by filling the full register before other lane loading instructions.
Most lane extracting stores can also be optimized using FP scalar stores when the 0th lane would be extracted.
Relative runtime of micro benchmarks after this patch on some Neoverse and Cortex CPU cores:
8bpc neon V2 V1 X3 X1 A715 A78 A76
avg w8: 0.942x 1.030x 0.936x 0.935x 1.000x 0.877x 0.976x
w_avg w8: 0.908x 0.913x 0.919x 0.914x 0.999x 0.905x 0.910x
mask w8: 0.937x 0.905x 0.929x 0.907x 1.009x 0.921x 0.868x
w_mask 420 w4: 0.969x 0.968x 0.951x 0.962x 0.995x 0.976x 0.958x
w_mask 420 w8: 0.979x 0.935x 0.936x 0.935x 0.996x 0.948x 0.959x
blend w4: 0.721x 0.841x 0.764x 0.822x 0.772x 0.826x 0.883x
blend w8: 0.692x 0.733x 0.686x 0.730x 0.828x 0.723x 0.762x
blend h w2: 0.738x 0.776x 0.746x 0.775x 0.683x 0.827x 0.851x
blend h w4: 0.858x 0.942x 0.880x 0.933x 0.784x 0.924x 0.965x
blend h w8: 0.804x 0.807x 0.806x 0.805x 0.814x 0.810x 0.748x
blend v w2: 0.898x 0.931x 0.903x 0.949x 0.784x 0.867x 0.875x
blend v w4: 0.935x 0.905x 0.933x 0.922x 0.763x 0.777x 0.807x
blend v w8: 0.803x 0.802x 0.804x 0.815x 0.674x 0.677x 0.678x
16bpc neon V2 V1 X3 X1 A715 A78 A76
avg w4: 0.899x 0.967x 0.897x 0.948x 1.002x 0.901x 0.884x
w_avg w4: 0.952x 0.951x 0.936x 0.946x 0.997x 0.937x 0.925x
mask w4: 0.893x 0.958x 0.887x 0.948x 1.003x 0.938x 0.934x
w_mask 420 w4: 0.933x 0.932x 0.932x 0.939x 1.000x 0.910x 0.955x
w_mask 420 w8: 0.966x 0.962x 0.967x 0.961x 1.000x 0.990x 1.010x
blend w4: 0.367x 0.361x 0.370x 0.352x 0.418x 0.394x 0.476x
blend h w2: 0.365x 0.445x 0.369x 0.437x 0.416x 0.576x 0.699x
blend h w4: 0.343x 0.402x 0.342x 0.398x 0.418x 0.525x 0.603x
blend v w2: 0.464x 0.460x 0.460x 0.447x 0.494x 0.446x 0.503x
blend v w4: 0.432x 0.424x 0.437x 0.416x 0.433x 0.427x 0.534x
blend v w8: 0.936x 0.847x 0.949x 0.848x 1.007x 0.811x 0.785x
bilinear 8bpc neon V2 V1 X3 X1 A715 A78 A76
mct w4 0: 0.982x 0.983x 0.955x 1.029x 0.784x 0.817x 0.814x
mc w2 h: 0.277x 0.333x 0.275x 0.325x 0.299x 0.435x 0.518x
mct w4 h: 0.835x 0.862x 0.814x 0.887x 1.074x 0.899x 0.884x
mc w2 v: 0.887x 0.966x 0.894x 0.945x 0.808x 0.953x 0.997x
mc w4 v: 0.762x 0.899x 0.766x 0.867x 0.695x 0.915x 1.017x
mct w4 v: 0.700x 0.812x 0.740x 0.777x 0.777x 0.824x 0.853x
mc w2 hv: 0.928x 0.985x 0.929x 0.978x 0.789x 0.969x 1.010x
mct w4 hv: 0.887x 0.913x 0.912x 0.920x 1.001x 0.922x 0.937x
bilinear 16bpc neon V2 V1 X3 X1 A715 A78 A76
mc w2 0: 0.991x 1.032x 0.993x 0.970x 0.878x 0.925x 0.999x
mct w4 0: 0.811x 0.730x 0.797x 0.680x 0.808x 0.711x 0.805x
mc w4 h: 0.885x 0.901x 0.895x 0.905x 1.003x 0.909x 0.910x
mct w4 h: 0.902x 0.914x 0.898x 0.896x 1.000x 0.897x 0.934x
mc w2 v: 0.888x 0.966x 0.913x 0.955x 0.824x 0.958x 1.005x
mc w4 v: 0.897x 0.894x 0.903x 0.902x 1.001x 0.895x 0.895x
mct w4 v: 0.924x 0.908x 0.921x 0.901x 1.001x 0.904x 0.918x
mc w4 hv: 0.927x 0.925x 0.924x 0.933x 1.000x 0.936x 0.959x
mct w4 hv: 0.923x 0.944x 0.923x 0.944x 0.999x 0.931x 0.956x
8tap 8bpc neon V2 V1 X3 X1 A715 A78 A76
mct regular w4 0: 0.829x 0.854x 0.735x 0.861x 0.769x 0.766x 0.840x
mc regular w2 h: 0.984x 1.008x 0.983x 1.012x 0.986x 0.989x 0.995x
mc sharp w2 h: 0.987x 1.008x 0.986x 1.011x 0.985x 0.989x 0.995x
mc regular w4 h: 0.907x 0.911x 0.916x 0.908x 0.997x 0.936x 0.932x
mc sharp w4 h: 0.916x 0.914x 0.918x 0.913x 0.999x 0.939x 0.905x
mct regular w4 h: 0.992x 0.979x 0.993x 0.971x 1.000x 0.986x 0.976x
mct sharp w4 h: 0.991x 0.979x 0.989x 0.984x 1.001x 0.979x 0.983x
mc regular w2 v: 1.002x 1.001x 1.005x 1.000x 1.000x 0.998x 0.983x
mc sharp w2 v: 1.005x 1.001x 1.009x 0.998x 0.994x 0.997x 0.989x
mc regular w4 v: 0.985x 0.998x 0.991x 0.998x 1.000x 1.000x 0.983x
mc sharp w4 v: 1.005x 1.002x 1.006x 1.002x 0.998x 0.991x 0.999x
mct regular w4 v: 0.966x 0.967x 0.961x 0.974x 0.996x 0.954x 0.982x
mct sharp w4 v: 0.970x 0.944x 0.967x 0.944x 0.997x 0.951x 0.966x
mc regular w2 hv: 0.993x 0.993x 0.994x 0.987x 0.993x 0.985x 0.999x
mc sharp w2 hv: 0.994x 0.996x 0.992x 0.998x 0.997x 0.999x 0.999x
mc regular w4 hv: 0.964x 0.958x 0.964x 0.960x 0.982x 0.938x 0.958x
mc sharp w4 hv: 0.982x 0.981x 0.980x 0.982x 0.995x 0.986x 0.941x
mct regular w4 hv: 0.993x 0.994x 0.992x 0.994x 0.996x 0.992x 0.988x
mct sharp w4 hv: 0.993x 0.996x 0.991x 0.996x 0.954x 0.992x 1.011x
8tap 16bpc neon V2 V1 X3 X1 A715 A78 A76
mc regular w2 0: 0.869x 1.059x 0.874x 0.956x 0.883x 0.932x 1.000x
mct regular w4 0: 0.348x 0.369x 0.354x 0.377x 0.560x 0.409x 0.648x
mc regular w2 h: 0.996x 0.988x 0.992x 0.985x 0.989x 0.991x 1.006x
mc sharp w2 h: 0.996x 0.989x 0.979x 0.991x 0.987x 0.988x 0.997x
mc regular w4 h: 0.957x 0.937x 0.957x 0.948x 0.961x 0.927x 0.994x
mc sharp w4 h: 0.966x 0.940x 0.962x 0.954x 0.985x 0.929x 0.970x
mct regular w4 h: 0.922x 0.942x 0.932x 0.933x 1.007x 0.938x 0.905x
mct sharp w4 h: 0.919x 0.943x 0.919x 0.931x 0.971x 0.943x 0.929x
mc regular w2 v: 1.000x 0.997x 1.001x 1.003x 1.001x 0.999x 0.984x
mc sharp w2 v: 1.000x 0.999x 1.000x 0.999x 1.000x 1.000x 0.993x
mc regular w4 v: 0.936x 0.941x 0.936x 0.939x 0.999x 0.928x 0.981x
mc sharp w4 v: 0.955x 0.961x 0.949x 0.956x 0.999x 0.947x 0.953x
mct regular w4 v: 0.977x 0.966x 0.979x 0.968x 0.990x 0.972x 0.972x
mct sharp w4 v: 0.973x 0.965x 0.981x 0.963x 0.994x 0.977x 0.974x
mc regular w2 hv: 0.995x 1.001x 0.995x 0.995x 0.995x 1.000x 0.981x
mc sharp w2 hv: 0.993x 1.012x 0.993x 0.988x 0.996x 0.992x 1.008x
mc regular w4 hv: 0.938x 0.943x 0.939x 0.943x 0.986x 0.943x 0.997x
mc sharp w4 hv: 0.969x 0.959x 0.970x 0.974x 0.986x 0.993x 0.997x
mct regular w4 hv: 0.942x 0.970x 0.951x 0.960x 0.977x 0.958x 1.018x
mct sharp w4 hv: 0.923x 0.958x 0.934x 0.955x 0.973x 0.946x 0.986x