Skip to content

AArch64: Add HBD subpel filters using 128-bit SVE2

Arpad Panyik requested to merge arpadpanyik-arm/dav1d:mc_hbd_sve2_128b into master

Add an Armv9.0-A SVE2 code path for high bitdepth convolutions. Only 2D convolutions have 6-tap specialisations of their vertical passes. All other convolutions are 4- or 8-tap filters which fit well with the 4-element 16-bit SDOT instruction of SVE2.

Benchmarks show up-to 17% FPS increase depending on the input video and the CPU used.

This patch will increase the .text by around 8 KiB.


Relative performance to the C reference on some Cortex-A/X CPUs:

Regular HV subpel filter micro benchmarks regular A715 A720 X3 X4 A510 A520 w4 hv neon: 3.93x 4.10x 5.21x 5.17x 3.57x 5.27x w4 hv sve2: 4.99x 5.14x 6.00x 6.05x 4.33x 3.99x w8 hv neon: 1.72x 1.67x 1.98x 2.18x 2.95x 2.94x w8 hv sve2: 2.12x 2.29x 2.52x 2.62x 2.60x 2.60x w16 hv neon: 1.59x 1.53x 1.83x 1.89x 2.35x 2.24x w16 hv sve2: 1.94x 2.12x 2.33x 2.18x 2.06x 2.06x w32 hv neon: 1.49x 1.50x 1.66x 1.76x 2.10x 2.16x w32 hv sve2: 1.81x 2.09x 2.11x 2.09x 1.84x 1.87x w64 hv neon: 1.52x 1.50x 1.55x 1.71x 1.95x 2.05x w64 hv sve2: 1.84x 2.08x 1.97x 1.98x 1.74x 1.77x
Regular H subpel filter micro benchmarks regular A715 A720 X3 X4 A510 A520 w4 h neon: 5.35x 5.47x 7.39x 5.78x 3.92x 5.19x w4 h sve2: 7.91x 8.35x 11.95x 10.33x 5.81x 5.42x w8 h neon: 4.49x 4.43x 6.50x 4.87x 7.18x 6.17x w8 h sve2: 6.09x 6.22x 9.59x 7.70x 7.89x 6.83x w16 h neon: 2.53x 2.52x 2.34x 1.86x 2.71x 2.75x w16 h sve2: 3.41x 3.47x 3.53x 3.25x 2.89x 2.96x w32 h neon: 2.07x 2.08x 1.97x 1.56x 2.17x 2.21x w32 h sve2: 2.76x 2.84x 2.94x 2.75x 2.24x 2.29x w64 h neon: 1.86x 1.86x 1.76x 1.41x 1.87x 1.88x w64 h sve2: 2.47x 2.54x 2.65x 2.46x 1.94x 1.94x
Regular V subpel filter micro benchmarks regular A715 A720 X3 X4 A510 A520 w4 v neon: 5.22x 5.17x 6.36x 5.60x 4.23x 7.30x w4 v sve2: 5.86x 5.90x 7.81x 7.16x 4.86x 4.15x w8 v neon: 4.83x 4.79x 6.96x 6.45x 4.74x 8.40x w8 v sve2: 5.25x 5.23x 7.76x 6.79x 4.84x 4.13x w16 v neon: 2.59x 2.60x 2.93x 2.47x 1.80x 4.16x w16 v sve2: 2.85x 2.88x 3.36x 2.73x 1.86x 2.00x w32 v neon: 2.12x 2.13x 2.33x 2.03x 1.34x 3.11x w32 v sve2: 2.36x 2.40x 2.73x 2.32x 1.41x 1.48x w64 v neon: 1.94x 1.92x 2.02x 1.78x 1.12x 2.59x w64 v sve2: 2.16x 2.15x 2.37x 2.03x 1.17x 1.22x
prep_sve micro benchmarks regular A715 A720 X3 X4 A510 A520 w4 0 neon: 1.75x 1.71x 1.44x 1.56x 3.18x 2.87x w4 0 sve2: 4.28x 4.39x 5.72x 6.42x 5.50x 4.68x w8 0 neon: 3.05x 3.04x 4.44x 4.64x 3.84x 3.52x w8 0 sve2: 3.85x 3.80x 5.45x 6.01x 4.92x 4.26x w16 0 neon: 2.92x 2.93x 3.82x 3.23x 4.58x 4.44x w16 0 sve2: 4.29x 4.27x 4.25x 4.15x 5.58x 5.29x w32 0 neon: 2.73x 2.76x 3.50x 2.67x 4.44x 4.26x w32 0 sve2: 4.09x 4.10x 3.75x 3.39x 5.67x 5.22x w64 0 neon: 2.73x 2.70x 3.27x 3.14x 4.57x 4.68x w64 0 sve2: 4.06x 3.97x 3.54x 3.18x 6.36x 6.25x
Sharp HV subpel filter micro benchmarks sharp A715 A720 X3 X4 A510 A520 w4 hv neon: 3.54x 3.64x 4.43x 4.45x 3.03x 4.72x w4 hv sve2: 4.30x 4.55x 5.38x 5.26x 4.04x 3.76x w8 hv neon: 1.30x 1.25x 1.51x 1.60x 2.44x 2.43x w8 hv sve2: 1.86x 2.06x 2.09x 2.18x 2.37x 2.39x w16 hv neon: 1.19x 1.16x 1.43x 1.36x 1.95x 1.98x w16 hv sve2: 1.68x 1.91x 1.94x 1.84x 1.89x 1.94x w32 hv neon: 1.13x 1.12x 1.30x 1.29x 1.75x 1.81x w32 hv sve2: 1.58x 1.84x 1.75x 1.74x 1.70x 1.76x w64 hv neon: 1.13x 1.13x 1.21x 1.25x 1.65x 1.69x w64 hv sve2: 1.57x 1.84x 1.62x 1.67x 1.62x 1.65x
Sharp H subpel filter micro benchmarks sharp A715 A720 X3 X4 A510 A520 w4 h neon: 5.38x 5.49x 7.46x 5.74x 3.93x 5.23x w4 h sve2: 7.86x 8.37x 11.99x 10.38x 5.81x 5.40x w8 h neon: 3.46x 3.49x 5.36x 4.64x 6.40x 5.62x w8 h sve2: 5.95x 6.23x 9.61x 7.76x 7.86x 6.89x w16 h neon: 1.99x 1.97x 2.07x 1.91x 2.43x 2.51x w16 h sve2: 3.42x 3.46x 3.75x 3.23x 2.89x 2.98x w32 h neon: 1.67x 1.62x 1.66x 1.63x 1.95x 2.01x w32 h sve2: 2.86x 2.84x 2.94x 2.72x 2.21x 2.29x w64 h neon: 1.45x 1.45x 1.51x 1.48x 1.69x 1.70x w64 h sve2: 2.47x 2.54x 2.64x 2.46x 1.93x 1.95x
Sharp V subpel filter micro benchmarks sharp A715 A720 X3 X4 A510 A520 w4 v neon: 4.07x 4.01x 5.15x 4.74x 3.38x 6.56x w4 v sve2: 5.88x 5.86x 7.81x 7.15x 4.85x 4.39x w8 v neon: 3.64x 3.59x 5.38x 4.92x 3.59x 7.23x w8 v sve2: 5.23x 5.19x 7.77x 6.66x 4.81x 4.13x w16 v neon: 1.93x 1.95x 2.25x 1.92x 1.35x 3.46x w16 v sve2: 2.85x 2.88x 3.36x 2.71x 1.86x 1.94x w32 v neon: 1.57x 1.58x 1.78x 1.60x 1.01x 2.67x w32 v sve2: 2.36x 2.39x 2.73x 2.35x 1.41x 1.50x w64 v neon: 1.44x 1.42x 1.54x 1.43x 0.85x 2.19x w64 v sve2: 2.17x 2.15x 2.37x 2.06x 1.18x 1.25x

Some benchmark results for SVE2 version against Neon:

Georgia HDR:

AWS Graviton 4 -  720p:  241.63 fps  ->  266.03 fps ( +10.1% )
AWS Graviton 4 - 1080p:  113.06 fps  ->  124.85 fps ( +10.4% )
AWS Graviton 4 - 2160p:   30.26 fps  ->   33.09 fps ( +9.35% )


RED Reel HDR:

AWS Graviton 4 -  720p:  232.84 fps  ->  247.88 fps ( +6.46% )
AWS Graviton 4 - 1080p:  115.73 fps  ->  122.66 fps ( +5.99% )

AWS Graviton 4 - 2160p:   29.82 fps  ->   31.13 fps ( +4.39% )

Bosphorus:

AWS Graviton 4 -  720p:  429.74 fps  ->  503.43 fps ( +17.1% )
AWS Graviton 4 - 1080p:  188.38 fps  ->  218.43 fps ( +16.0% )
AWS Graviton 4 - 2160p:   38.57 fps  ->   44.94 fps ( +16.5% )

Bosphorus videos were encoded by aomenc (3.7.1+):

aomenc --good --cpu-used=5 -w 1280 -h  720 --bit-depth=10 --input-bit-depth=10 --ivf -o Bosphorus_720p_10bit.ivf  Bosphorus_3840x2160_120fps_420_10bit_YUV_Y4M.y4m
aomenc --good --cpu-used=5 -w 1920 -h 1080 --bit-depth=10 --input-bit-depth=10 --ivf -o Bosphorus_1080p_10bit.ivf Bosphorus_3840x2160_120fps_420_10bit_YUV_Y4M.y4m
aomenc --good --cpu-used=5 -w 3840 -h 2160 --bit-depth=10 --input-bit-depth=10 --ivf -o Bosphorus_2160p_10bit.ivf Bosphorus_3840x2160_120fps_420_10bit_YUV_Y4M.y4m
Edited by Arpad Panyik

Merge request reports

Loading