Skip to content

AArch64: Optimize lane load/store in MC functions

Arpad Panyik requested to merge arpadpanyik-arm/dav1d:mc_lane_neon into master

Partial register writes can create long dependency chains which can reduce performance on out-of-order CPUs. This patch removes most of these kinds of problems in MC functions by filling the full register before other lane loading instructions.

Most lane extracting stores can also be optimized using FP scalar stores when the 0th lane would be extracted.

Relative runtime of micro benchmarks after this patch on some Neoverse and Cortex CPU cores:

8bpc neon                V2      V1      X3      X1    A715     A78     A76
 avg        w8:       0.942x  1.030x  0.936x  0.935x  1.000x  0.877x  0.976x
 w_avg      w8:       0.908x  0.913x  0.919x  0.914x  0.999x  0.905x  0.910x
 mask       w8:       0.937x  0.905x  0.929x  0.907x  1.009x  0.921x  0.868x
 w_mask 420 w4:       0.969x  0.968x  0.951x  0.962x  0.995x  0.976x  0.958x
 w_mask 420 w8:       0.979x  0.935x  0.936x  0.935x  0.996x  0.948x  0.959x
 blend      w4:       0.721x  0.841x  0.764x  0.822x  0.772x  0.826x  0.883x
 blend      w8:       0.692x  0.733x  0.686x  0.730x  0.828x  0.723x  0.762x
 blend    h w2:       0.738x  0.776x  0.746x  0.775x  0.683x  0.827x  0.851x
 blend    h w4:       0.858x  0.942x  0.880x  0.933x  0.784x  0.924x  0.965x
 blend    h w8:       0.804x  0.807x  0.806x  0.805x  0.814x  0.810x  0.748x
 blend    v w2:       0.898x  0.931x  0.903x  0.949x  0.784x  0.867x  0.875x
 blend    v w4:       0.935x  0.905x  0.933x  0.922x  0.763x  0.777x  0.807x
 blend    v w8:       0.803x  0.802x  0.804x  0.815x  0.674x  0.677x  0.678x
16bpc neon               V2      V1      X3      X1    A715     A78     A76
 avg        w4:       0.899x  0.967x  0.897x  0.948x  1.002x  0.901x  0.884x
 w_avg      w4:       0.952x  0.951x  0.936x  0.946x  0.997x  0.937x  0.925x
 mask       w4:       0.893x  0.958x  0.887x  0.948x  1.003x  0.938x  0.934x
 w_mask 420 w4:       0.933x  0.932x  0.932x  0.939x  1.000x  0.910x  0.955x
 w_mask 420 w8:       0.966x  0.962x  0.967x  0.961x  1.000x  0.990x  1.010x
 blend      w4:       0.367x  0.361x  0.370x  0.352x  0.418x  0.394x  0.476x
 blend    h w2:       0.365x  0.445x  0.369x  0.437x  0.416x  0.576x  0.699x
 blend    h w4:       0.343x  0.402x  0.342x  0.398x  0.418x  0.525x  0.603x
 blend    v w2:       0.464x  0.460x  0.460x  0.447x  0.494x  0.446x  0.503x
 blend    v w4:       0.432x  0.424x  0.437x  0.416x  0.433x  0.427x  0.534x
 blend    v w8:       0.936x  0.847x  0.949x  0.848x  1.007x  0.811x  0.785x
bilinear 8bpc neon       V2      V1      X3      X1    A715     A78     A76
 mct     w4  0:       0.982x  0.983x  0.955x  1.029x  0.784x  0.817x  0.814x
 mc      w2  h:       0.277x  0.333x  0.275x  0.325x  0.299x  0.435x  0.518x
 mct     w4  h:       0.835x  0.862x  0.814x  0.887x  1.074x  0.899x  0.884x
 mc      w2  v:       0.887x  0.966x  0.894x  0.945x  0.808x  0.953x  0.997x
 mc      w4  v:       0.762x  0.899x  0.766x  0.867x  0.695x  0.915x  1.017x
 mct     w4  v:       0.700x  0.812x  0.740x  0.777x  0.777x  0.824x  0.853x
 mc      w2 hv:       0.928x  0.985x  0.929x  0.978x  0.789x  0.969x  1.010x
 mct     w4 hv:       0.887x  0.913x  0.912x  0.920x  1.001x  0.922x  0.937x
bilinear 16bpc neon      V2      V1      X3      X1    A715     A78     A76
 mc      w2  0:       0.991x  1.032x  0.993x  0.970x  0.878x  0.925x  0.999x
 mct     w4  0:       0.811x  0.730x  0.797x  0.680x  0.808x  0.711x  0.805x
 mc      w4  h:       0.885x  0.901x  0.895x  0.905x  1.003x  0.909x  0.910x
 mct     w4  h:       0.902x  0.914x  0.898x  0.896x  1.000x  0.897x  0.934x
 mc      w2  v:       0.888x  0.966x  0.913x  0.955x  0.824x  0.958x  1.005x
 mc      w4  v:       0.897x  0.894x  0.903x  0.902x  1.001x  0.895x  0.895x
 mct     w4  v:       0.924x  0.908x  0.921x  0.901x  1.001x  0.904x  0.918x
 mc      w4 hv:       0.927x  0.925x  0.924x  0.933x  1.000x  0.936x  0.959x
 mct     w4 hv:       0.923x  0.944x  0.923x  0.944x  0.999x  0.931x  0.956x
8tap 8bpc neon           V2      V1      X3      X1    A715     A78     A76
 mct regular w4  0:   0.829x  0.854x  0.735x  0.861x  0.769x  0.766x  0.840x
 mc  regular w2  h:   0.984x  1.008x  0.983x  1.012x  0.986x  0.989x  0.995x
 mc  sharp   w2  h:   0.987x  1.008x  0.986x  1.011x  0.985x  0.989x  0.995x
 mc  regular w4  h:   0.907x  0.911x  0.916x  0.908x  0.997x  0.936x  0.932x
 mc  sharp   w4  h:   0.916x  0.914x  0.918x  0.913x  0.999x  0.939x  0.905x
 mct regular w4  h:   0.992x  0.979x  0.993x  0.971x  1.000x  0.986x  0.976x
 mct sharp   w4  h:   0.991x  0.979x  0.989x  0.984x  1.001x  0.979x  0.983x
 mc  regular w2  v:   1.002x  1.001x  1.005x  1.000x  1.000x  0.998x  0.983x
 mc  sharp   w2  v:   1.005x  1.001x  1.009x  0.998x  0.994x  0.997x  0.989x
 mc  regular w4  v:   0.985x  0.998x  0.991x  0.998x  1.000x  1.000x  0.983x
 mc  sharp   w4  v:   1.005x  1.002x  1.006x  1.002x  0.998x  0.991x  0.999x
 mct regular w4  v:   0.966x  0.967x  0.961x  0.974x  0.996x  0.954x  0.982x
 mct sharp   w4  v:   0.970x  0.944x  0.967x  0.944x  0.997x  0.951x  0.966x
 mc  regular w2 hv:   0.993x  0.993x  0.994x  0.987x  0.993x  0.985x  0.999x
 mc  sharp   w2 hv:   0.994x  0.996x  0.992x  0.998x  0.997x  0.999x  0.999x
 mc  regular w4 hv:   0.964x  0.958x  0.964x  0.960x  0.982x  0.938x  0.958x
 mc  sharp   w4 hv:   0.982x  0.981x  0.980x  0.982x  0.995x  0.986x  0.941x
 mct regular w4 hv:   0.993x  0.994x  0.992x  0.994x  0.996x  0.992x  0.988x
 mct sharp   w4 hv:   0.993x  0.996x  0.991x  0.996x  0.954x  0.992x  1.011x
8tap 16bpc neon          V2      V1      X3      X1    A715     A78     A76
 mc  regular w2  0:   0.869x  1.059x  0.874x  0.956x  0.883x  0.932x  1.000x
 mct regular w4  0:   0.348x  0.369x  0.354x  0.377x  0.560x  0.409x  0.648x
 mc  regular w2  h:   0.996x  0.988x  0.992x  0.985x  0.989x  0.991x  1.006x
 mc  sharp   w2  h:   0.996x  0.989x  0.979x  0.991x  0.987x  0.988x  0.997x
 mc  regular w4  h:   0.957x  0.937x  0.957x  0.948x  0.961x  0.927x  0.994x
 mc  sharp   w4  h:   0.966x  0.940x  0.962x  0.954x  0.985x  0.929x  0.970x
 mct regular w4  h:   0.922x  0.942x  0.932x  0.933x  1.007x  0.938x  0.905x
 mct sharp   w4  h:   0.919x  0.943x  0.919x  0.931x  0.971x  0.943x  0.929x
 mc  regular w2  v:   1.000x  0.997x  1.001x  1.003x  1.001x  0.999x  0.984x
 mc  sharp   w2  v:   1.000x  0.999x  1.000x  0.999x  1.000x  1.000x  0.993x
 mc  regular w4  v:   0.936x  0.941x  0.936x  0.939x  0.999x  0.928x  0.981x
 mc  sharp   w4  v:   0.955x  0.961x  0.949x  0.956x  0.999x  0.947x  0.953x
 mct regular w4  v:   0.977x  0.966x  0.979x  0.968x  0.990x  0.972x  0.972x
 mct sharp   w4  v:   0.973x  0.965x  0.981x  0.963x  0.994x  0.977x  0.974x
 mc  regular w2 hv:   0.995x  1.001x  0.995x  0.995x  0.995x  1.000x  0.981x
 mc  sharp   w2 hv:   0.993x  1.012x  0.993x  0.988x  0.996x  0.992x  1.008x
 mc  regular w4 hv:   0.938x  0.943x  0.939x  0.943x  0.986x  0.943x  0.997x
 mc  sharp   w4 hv:   0.969x  0.959x  0.970x  0.974x  0.986x  0.993x  0.997x
 mct regular w4 hv:   0.942x  0.970x  0.951x  0.960x  0.977x  0.958x  1.018x
 mct sharp   w4 hv:   0.923x  0.958x  0.934x  0.955x  0.973x  0.946x  0.986x

Merge request reports