Skip to content

AArch64: Add USMMLA impl. for SBD 6-tap H/HV filters

Arpad Panyik requested to merge arpadpanyik-arm/dav1d:mc_sbd_usmmla into master

Add 6-tap variant of standard bit-depth horizontal subpel filters using the Armv8.6 I8MM USMMLA matrix multiply instruction. This patch also extends the HV filter with 6-tap horizontal pass using USMMLA.

This MR also contains a typo fix in the SBD 6-tap 2D/HV subpel filter.

Benchmarks show up-to 6-7% FPS increase depending on the input video and the CPU used.

This patch will increase the .text by around 1.2 KiB.


Relative runtime of micro benchmarks after this patch on Neoverse and Cortex CPU cores:

regular      V2      V1      X3    A720    A715    A520    A510
  w8 hv:  0.860x  0.895x  0.870x  0.896x  0.896x  0.938x  0.936x
 w16 hv:  0.829x  0.886x  0.865x  0.908x  0.906x  0.946x  0.944x
 w32 hv:  0.837x  0.883x  0.862x  0.914x  0.915x  0.953x  0.949x
 w64 hv:  0.840x  0.883x  0.862x  0.914x  0.914x  0.955x  0.952x

  w8 h:   0.746x  0.754x  0.747x  0.723x  0.724x  0.874x  0.866x
 w16 h:   0.749x  0.764x  0.745x  0.731x  0.731x  0.858x  0.852x
 w32 h:   0.739x  0.754x  0.738x  0.729x  0.729x  0.839x  0.837x
 w64 h:   0.736x  0.749x  0.733x  0.725x  0.726x  0.847x  0.836x

Some benchmark results for USMMLA version against using USDOT:

Models 1080p:

AWS Graviton 3:  193.69 fps  ->  200.04 fps ( +3.28% )
AWS Graviton 4:  246.09 fps  ->  255.00 fps ( +3.62% )

Balloons 1080p:

AWS Graviton 3:  176.33 fps  ->  180.61 fps ( +2.43% )

AWS Graviton 4:  225.18 fps  ->  231.24 fps ( +2.69% )

Mountain Bike 1080p:

AWS Graviton 3:  144.80 fps  ->  147.89 fps ( +2.13% )
AWS Graviton 4:  183.44 fps  ->  187.38 fps ( +2.15% )

Nature 1080p:

AWS Graviton 3:  140.22 fps  ->  142.42 fps ( +1.57% )
AWS Graviton 4:  178.17 fps  ->  181.14 fps ( +1.67% )

Vision Pro 1080p:

AWS Graviton 3:  200.66 fps  ->  205.25 fps ( +2.29% )
AWS Graviton 4:  260.92 fps  ->  266.68 fps ( +2.21% )

Bosphorus:

AWS Graviton 3 - 720p:   540.45 fps  ->  580.43 fps ( +7.40% )
AWS Graviton 4 - 720p:   680.19 fps  ->  724.72 fps ( +6.55% )

AWS Graviton 3 - 1080p:  242.12 fps  ->  256.48 fps ( +5.93% )
AWS Graviton 4 - 1080p:  305.77 fps  ->  326.12 fps ( +6.66% )

AWS Graviton 3 - 2160p:   60.60 fps  ->   63.37 fps ( +4.57% )
AWS Graviton 4 - 2160p:   76.71 fps  ->   80.69 fps ( +5.19% )

Bosphorus videos were encoded by aomenc (3.7.1+):

aomenc --good --cpu-used=5 -w 1280 -h  720 --bit-depth=8 --ivf -o Bosphorus_720p_8bit.ivf  Bosphorus_1920x1080_120fps_420_8bit_YUV.y4m
aomenc --good --cpu-used=5 -w 1920 -h 1080 --bit-depth=8 --ivf -o Bosphorus_1080p_8bit.ivf Bosphorus_1920x1080_120fps_420_8bit_YUV.y4m
aomenc --good --cpu-used=5 -w 3840 -h 2160 --bit-depth=8 --ivf -o Bosphorus_2160p_8bit.ivf Bosphorus_3840x2160_120fps_420_8bit_YUV.y4m
Edited by Arpad Panyik

Merge request reports