AArch64: Add HBD subpel filters using 128-bit SVE2
Add an Armv9.0-A SVE2 code path for high bitdepth convolutions. Only
2D convolutions have 6-tap specialisations of their vertical passes.
All other convolutions are 4- or 8-tap filters which fit well with
the 4-element 16-bit SDOT
instruction of SVE2.
Benchmarks show up-to 17% FPS increase depending on the input video and the CPU used.
This patch will increase the .text
by around 8 KiB.
Relative performance to the C reference on some Cortex-A/X CPUs:
Regular HV subpel filter micro benchmarks
regular A715 A720 X3 X4 A510 A520
w4 hv neon: 3.93x 4.10x 5.21x 5.17x 3.57x 5.27x
w4 hv sve2: 4.99x 5.14x 6.00x 6.05x 4.33x 3.99x
w8 hv neon: 1.72x 1.67x 1.98x 2.18x 2.95x 2.94x
w8 hv sve2: 2.12x 2.29x 2.52x 2.62x 2.60x 2.60x
w16 hv neon: 1.59x 1.53x 1.83x 1.89x 2.35x 2.24x
w16 hv sve2: 1.94x 2.12x 2.33x 2.18x 2.06x 2.06x
w32 hv neon: 1.49x 1.50x 1.66x 1.76x 2.10x 2.16x
w32 hv sve2: 1.81x 2.09x 2.11x 2.09x 1.84x 1.87x
w64 hv neon: 1.52x 1.50x 1.55x 1.71x 1.95x 2.05x
w64 hv sve2: 1.84x 2.08x 1.97x 1.98x 1.74x 1.77x
Regular H subpel filter micro benchmarks
regular A715 A720 X3 X4 A510 A520
w4 h neon: 5.35x 5.47x 7.39x 5.78x 3.92x 5.19x
w4 h sve2: 7.91x 8.35x 11.95x 10.33x 5.81x 5.42x
w8 h neon: 4.49x 4.43x 6.50x 4.87x 7.18x 6.17x
w8 h sve2: 6.09x 6.22x 9.59x 7.70x 7.89x 6.83x
w16 h neon: 2.53x 2.52x 2.34x 1.86x 2.71x 2.75x
w16 h sve2: 3.41x 3.47x 3.53x 3.25x 2.89x 2.96x
w32 h neon: 2.07x 2.08x 1.97x 1.56x 2.17x 2.21x
w32 h sve2: 2.76x 2.84x 2.94x 2.75x 2.24x 2.29x
w64 h neon: 1.86x 1.86x 1.76x 1.41x 1.87x 1.88x
w64 h sve2: 2.47x 2.54x 2.65x 2.46x 1.94x 1.94x
Regular V subpel filter micro benchmarks
regular A715 A720 X3 X4 A510 A520
w4 v neon: 5.22x 5.17x 6.36x 5.60x 4.23x 7.30x
w4 v sve2: 5.86x 5.90x 7.81x 7.16x 4.86x 4.15x
w8 v neon: 4.83x 4.79x 6.96x 6.45x 4.74x 8.40x
w8 v sve2: 5.25x 5.23x 7.76x 6.79x 4.84x 4.13x
w16 v neon: 2.59x 2.60x 2.93x 2.47x 1.80x 4.16x
w16 v sve2: 2.85x 2.88x 3.36x 2.73x 1.86x 2.00x
w32 v neon: 2.12x 2.13x 2.33x 2.03x 1.34x 3.11x
w32 v sve2: 2.36x 2.40x 2.73x 2.32x 1.41x 1.48x
w64 v neon: 1.94x 1.92x 2.02x 1.78x 1.12x 2.59x
w64 v sve2: 2.16x 2.15x 2.37x 2.03x 1.17x 1.22x
prep_sve micro benchmarks
regular A715 A720 X3 X4 A510 A520
w4 0 neon: 1.75x 1.71x 1.44x 1.56x 3.18x 2.87x
w4 0 sve2: 4.28x 4.39x 5.72x 6.42x 5.50x 4.68x
w8 0 neon: 3.05x 3.04x 4.44x 4.64x 3.84x 3.52x
w8 0 sve2: 3.85x 3.80x 5.45x 6.01x 4.92x 4.26x
w16 0 neon: 2.92x 2.93x 3.82x 3.23x 4.58x 4.44x
w16 0 sve2: 4.29x 4.27x 4.25x 4.15x 5.58x 5.29x
w32 0 neon: 2.73x 2.76x 3.50x 2.67x 4.44x 4.26x
w32 0 sve2: 4.09x 4.10x 3.75x 3.39x 5.67x 5.22x
w64 0 neon: 2.73x 2.70x 3.27x 3.14x 4.57x 4.68x
w64 0 sve2: 4.06x 3.97x 3.54x 3.18x 6.36x 6.25x
Sharp HV subpel filter micro benchmarks
sharp A715 A720 X3 X4 A510 A520
w4 hv neon: 3.54x 3.64x 4.43x 4.45x 3.03x 4.72x
w4 hv sve2: 4.30x 4.55x 5.38x 5.26x 4.04x 3.76x
w8 hv neon: 1.30x 1.25x 1.51x 1.60x 2.44x 2.43x
w8 hv sve2: 1.86x 2.06x 2.09x 2.18x 2.37x 2.39x
w16 hv neon: 1.19x 1.16x 1.43x 1.36x 1.95x 1.98x
w16 hv sve2: 1.68x 1.91x 1.94x 1.84x 1.89x 1.94x
w32 hv neon: 1.13x 1.12x 1.30x 1.29x 1.75x 1.81x
w32 hv sve2: 1.58x 1.84x 1.75x 1.74x 1.70x 1.76x
w64 hv neon: 1.13x 1.13x 1.21x 1.25x 1.65x 1.69x
w64 hv sve2: 1.57x 1.84x 1.62x 1.67x 1.62x 1.65x
Sharp H subpel filter micro benchmarks
sharp A715 A720 X3 X4 A510 A520
w4 h neon: 5.38x 5.49x 7.46x 5.74x 3.93x 5.23x
w4 h sve2: 7.86x 8.37x 11.99x 10.38x 5.81x 5.40x
w8 h neon: 3.46x 3.49x 5.36x 4.64x 6.40x 5.62x
w8 h sve2: 5.95x 6.23x 9.61x 7.76x 7.86x 6.89x
w16 h neon: 1.99x 1.97x 2.07x 1.91x 2.43x 2.51x
w16 h sve2: 3.42x 3.46x 3.75x 3.23x 2.89x 2.98x
w32 h neon: 1.67x 1.62x 1.66x 1.63x 1.95x 2.01x
w32 h sve2: 2.86x 2.84x 2.94x 2.72x 2.21x 2.29x
w64 h neon: 1.45x 1.45x 1.51x 1.48x 1.69x 1.70x
w64 h sve2: 2.47x 2.54x 2.64x 2.46x 1.93x 1.95x
Sharp V subpel filter micro benchmarks
sharp A715 A720 X3 X4 A510 A520
w4 v neon: 4.07x 4.01x 5.15x 4.74x 3.38x 6.56x
w4 v sve2: 5.88x 5.86x 7.81x 7.15x 4.85x 4.39x
w8 v neon: 3.64x 3.59x 5.38x 4.92x 3.59x 7.23x
w8 v sve2: 5.23x 5.19x 7.77x 6.66x 4.81x 4.13x
w16 v neon: 1.93x 1.95x 2.25x 1.92x 1.35x 3.46x
w16 v sve2: 2.85x 2.88x 3.36x 2.71x 1.86x 1.94x
w32 v neon: 1.57x 1.58x 1.78x 1.60x 1.01x 2.67x
w32 v sve2: 2.36x 2.39x 2.73x 2.35x 1.41x 1.50x
w64 v neon: 1.44x 1.42x 1.54x 1.43x 0.85x 2.19x
w64 v sve2: 2.17x 2.15x 2.37x 2.06x 1.18x 1.25x
Some benchmark results for SVE2 version against Neon:
AWS Graviton 4 - 720p: 241.63 fps -> 266.03 fps ( +10.1% )
AWS Graviton 4 - 1080p: 113.06 fps -> 124.85 fps ( +10.4% )
AWS Graviton 4 - 2160p: 30.26 fps -> 33.09 fps ( +9.35% )
AWS Graviton 4 - 720p: 232.84 fps -> 247.88 fps ( +6.46% )
AWS Graviton 4 - 1080p: 115.73 fps -> 122.66 fps ( +5.99% )
AWS Graviton 4 - 2160p: 29.82 fps -> 31.13 fps ( +4.39% )
AWS Graviton 4 - 720p: 429.74 fps -> 503.43 fps ( +17.1% )
AWS Graviton 4 - 1080p: 188.38 fps -> 218.43 fps ( +16.0% )
AWS Graviton 4 - 2160p: 38.57 fps -> 44.94 fps ( +16.5% )
Bosphorus videos were encoded by aomenc (3.7.1+):
aomenc --good --cpu-used=5 -w 1280 -h 720 --bit-depth=10 --input-bit-depth=10 --ivf -o Bosphorus_720p_10bit.ivf Bosphorus_3840x2160_120fps_420_10bit_YUV_Y4M.y4m
aomenc --good --cpu-used=5 -w 1920 -h 1080 --bit-depth=10 --input-bit-depth=10 --ivf -o Bosphorus_1080p_10bit.ivf Bosphorus_3840x2160_120fps_420_10bit_YUV_Y4M.y4m
aomenc --good --cpu-used=5 -w 3840 -h 2160 --bit-depth=10 --input-bit-depth=10 --ivf -o Bosphorus_2160p_10bit.ivf Bosphorus_3840x2160_120fps_420_10bit_YUV_Y4M.y4m
Edited by Arpad Panyik