AArch64: Optimize the init of DotProd+ 2D subpel filters (!1665) · Merge requests · VideoLAN / dav1d

Arpad Panyik requested to merge arpadpanyik-arm/dav1d:mc_sbd_dotprod_init_hv into master May 09, 2024

Removed some unnecessary vector register copies from the initial horizontal filter parts of the HV subpel filters. The performance improvements are better for the smaller filter block sizes.

The narrowing shifts were also rewritten at the end of the *filter8* because it was only beneficial for the Cortex-A55 among the DotProd capable CPU cores. On other out-of-order or newer CPUs the UZP1+SHRN instruction combination is better.

Relative performance of micro benchmarks (lower is better):

Cortex-A55:

  mct regular w4:  0.980x    mct sharp w4:    0.983x
  mct regular w8:  1.007x    mct sharp w8:    1.012x
  mct regular w16: 1.007x    mct sharp w16:   1.005x

Cortex-A510:

  mct regular w4:  0.935x    mct sharp w4:    0.927x
  mct regular w8:  0.984x    mct sharp w8:    0.983x
  mct regular w16: 0.986x    mct sharp w16:   0.987x

Cortex-A78:

  mct regular w4:  0.974x    mct sharp w4:    0.971x
  mct regular w8:  0.988x    mct sharp w8:    0.987x
  mct regular w16: 0.991x    mct sharp w16:   0.979x

Cortex-715:

  mct regular w4:  0.958x    mct sharp w4:    0.974x
  mct regular w8:  0.993x    mct sharp w8:    0.991x
  mct regular w16: 0.998x    mct sharp w16:   0.997x

Cortex-X1:

  mct regular w4:  0.983x    mct sharp w4:    0.974x
  mct regular w8:  0.993x    mct sharp w8:    0.990x
  mct regular w16: 0.996x    mct sharp w16:   0.995x

Cortex-X3:

  mct regular w4:  0.953x    mct sharp w4:    0.981x
  mct regular w8:  0.993x    mct sharp w8:    0.993x
  mct regular w16: 0.997x    mct sharp w16:   0.995x

AArch64: Optimize the init of DotProd+ 2D subpel filters

Merge request reports