- Mar 22, 2024
-
-
Henrik Gramner authored
-
Henrik Gramner authored
-
- Mar 21, 2024
-
-
Henrik Gramner authored
6-taps filters are sufficient in the majority of cases, and are quite a bit faster than the equivalent 8-tap filters.
-
Henrik Gramner authored
-
- Mar 18, 2024
-
-
When dav1d is built with HWASan, the build fails because globals are tagged and the normal adrp/add instruction sequence does not have enough range to take the tagged address. Therefore, use an alternative instruction sequence when HWASan is enabled, which is the same as what the compiler generates.
-
- Mar 15, 2024
-
-
-
Henrik Gramner authored
-
- Mar 09, 2024
-
-
Jean-Baptiste Kempf authored
-
- Mar 08, 2024
-
-
Jean-Baptiste Kempf authored
-
Matthias Dressel authored
-
Martin Storsjö authored
These shifts used the wrong element size; this only was noticed in some argon tests.
-
Nathan E. Egge authored
-
Nathan E. Egge authored
-
Nathan E. Egge authored
arm32: 2 byte alignment saves 136 bytes arm64: 4 byte alignment saves 1200 bytes
-
Nathan E. Egge authored
-
Nathan E. Egge authored
-
Nathan E. Egge authored
When -Dtrim_dsp=true, this commit saves 740 bytes. inv_txfm_add_4x4_wht_wht_0_12bpc_c: 192.4 ( 1.00x) inv_txfm_add_4x4_wht_wht_0_12bpc_neon: 46.1 ( 4.17x) inv_txfm_add_4x4_wht_wht_1_12bpc_c: 192.4 ( 1.00x) inv_txfm_add_4x4_wht_wht_1_12bpc_neon: 45.7 ( 4.21x)
-
Nathan E. Egge authored
When -Dtrim_dsp=true, this commit saves 940 bytes. inv_txfm_add_4x4_wht_wht_0_12bpc_c: 145.2 ( 1.00x) inv_txfm_add_4x4_wht_wht_0_12bpc_neon: 42.9 ( 3.39x) inv_txfm_add_4x4_wht_wht_1_12bpc_c: 145.4 ( 1.00x) inv_txfm_add_4x4_wht_wht_1_12bpc_neon: 42.9 ( 3.39x)
-
- Mar 07, 2024
-
-
When decoding a stream with a width of less than 4 pixels this could cause a segfault if the frame buffer was allocated on a page boundary.
-
- Mar 05, 2024
-
-
The 8-tap sub-pel filters used for motion vector interpolation are: regular, smooth, sharp. The regular and smooth filter kernels are zero-padded, so they are effectively 6-tap filters (some of them are 5-tap or even 4-tap). This patch specialises the high bit-depth versions of put_8tap_neon and prep_8tap_neon functions for 6-tap filters, avoiding a lot of redundant work to multiply by and add zero. Wherever the sharp filtering is used the 8-tap path will be always selected. Benchmarks can show a 0.5-10.8% FPS uplift highly depending on the input video source. Binary size increase is ~8.5 KiB.
-
Optimize the 6-tap standard bit-depth horizontal-vertical combined convolution to avoid unnecessary reads and horizontal convolution steps at the beginning and end of the algorithm. This also saves some instructions in the final binary. Performance of this function increases by up to 5.5% depending on block size.
-
- Mar 04, 2024
-
-
Martin Storsjö authored
-
Martin Storsjö authored
First check if the assembler supports the ".arch" directive, and what architecture levels are supported. In principle, we'd only need to check for support for ".arch armv8.2-a", since that's enough for enabling the i8mm and sve2 extensions. However, recent Clang versions (before version 17) wasn't able to enable the dotprod and i8mm extensions via the ".arch_extension" directives, so check for support for armv8.4-a and armv8.6-a as well, which enable dotprod and i8mm implicitly. This allows assembling these instructions on most commonly available GCC and Clang based toolchains, while still allowing toggling support for the instruction sets on and off within the source files. Within assembly, we disable these extensions by default, so that instructions enabled within these extension sets can't be used by accident in unintended functions. Code meaning to use these extensions can be assembled like this: #if HAVE_SVE ENABLE_SVE // code DISABLE_SVE #endif
-
- Feb 29, 2024
-
-
Henrik Gramner authored
Prints a list of cpuflags available for the current architecture. Flags which are supported on the current system will be printed in green, and flags which are unsupported in red with a ~ prefix.
-
- Feb 28, 2024
-
-
Martin Storsjö authored
-
Martin Storsjö authored
This allows testing all modern aarch64 CPU features, that the HW based test runners might not support. Especially for SVE, this allows testing all valid vector lengths, which might not exist in hardware form yet.
-
Martin Storsjö authored
This one contains aarch64 cross tools, for use with QEMU.
-
Add run-time CPU feature detection for DotProd, i8mm, SVE and SVE2. SVE and SVE2 are AArch64-only features.
-
- Feb 27, 2024
-
-
Nathan E. Egge authored
Kendryte K230 Before After inv_txfm_add_16x16_adst_adst_0_8bpc_rvv: 1804.9 (8.45x) 1374.3 (11.18x) inv_txfm_add_16x16_adst_adst_1_8bpc_rvv: 1805.2 (8.45x) 1374.3 (11.17x) inv_txfm_add_16x16_adst_dct_0_8bpc_rvv: 1626.6 (8.92x) 1185.8 (12.22x) inv_txfm_add_16x16_adst_dct_1_8bpc_rvv: 1626.5 (8.91x) 1185.9 (12.22x) inv_txfm_add_16x16_adst_flipadst_0_8bpc_rvv: 1824.2 (8.38x) 1372.1 (11.22x) inv_txfm_add_16x16_adst_flipadst_1_8bpc_rvv: 1824.2 (8.37x) 1372.2 (11.21x) inv_txfm_add_16x16_dct_adst_0_8bpc_rvv: 1627.3 (8.94x) 1283.5 (11.29x) inv_txfm_add_16x16_dct_adst_1_8bpc_rvv: 1627.2 (8.95x) 1283.2 (11.29x) inv_txfm_add_16x16_dct_dct_0_8bpc_rvv: 1449.3 (1.08x) 1095.2 ( 1.44x) inv_txfm_add_16x16_dct_dct_1_8bpc_rvv: 1449.1 (9.52x) 1095.1 (12.45x) inv_txfm_add_16x16_dct_flipadst_0_8bpc_rvv: 1643.0 (8.87x) 1283.5 (11.29x) inv_txfm_add_16x16_dct_flipadst_1_8bpc_rvv: 1643.3 (8.87x) 1283.3 (11.30x) inv_txfm_add_16x16_dct_identity_0_8bpc_rvv: 1155.4 (9.23x) 805.9 (13.17x) inv_txfm_add_16x16_dct_identity_1_8bpc_rvv: 1155.4 (9.24x) 805.9 (13.17x) inv_txfm_add_16x16_flipadst_adst_0_8bpc_rvv: 1812.2 (8.43x) 1370.9 (11.23x) inv_txfm_add_16x16_flipadst_adst_1_8bpc_rvv: 1811.7 (8.44x) 1370.8 (11.24x) inv_txfm_add_16x16_flipadst_dct_0_8bpc_rvv: 1637.2 (8.88x) 1190.8 (12.19x) inv_txfm_add_16x16_flipadst_dct_1_8bpc_rvv: 1637.6 (8.87x) 1190.9 (12.19x) inv_txfm_add_16x16_flipadst_flipadst_0_8bpc_rvv: 1831.1 (8.34x) 1374.7 (11.21x) inv_txfm_add_16x16_flipadst_flipadst_1_8bpc_rvv: 1830.8 (8.35x) 1374.5 (11.22x) inv_txfm_add_16x16_identity_dct_0_8bpc_rvv: 1156.2 (8.67x) 948.6 (10.49x) inv_txfm_add_16x16_identity_dct_1_8bpc_rvv: 1156.3 (8.68x) 948.6 (10.49x) inv_txfm_add_16x16_identity_identity_0_8bpc_rvv: 879.3 (7.81x) 673.5 (10.28x) inv_txfm_add_16x16_identity_identity_1_8bpc_rvv: 879.3 (7.81x) 673.5 (10.28x)
-
Nathan E. Egge authored
Kendryte K230 Before After inv_txfm_add_8x16_adst_adst_0_8bpc_rvv: 853.9 ( 9.00x) 698.3 (11.03x) inv_txfm_add_8x16_adst_adst_1_8bpc_rvv: 853.8 ( 9.00x) 698.3 (11.03x) inv_txfm_add_8x16_adst_dct_0_8bpc_rvv: 763.0 ( 9.55x) 609.2 (12.00x) inv_txfm_add_8x16_adst_dct_1_8bpc_rvv: 763.1 ( 9.55x) 609.3 (11.94x) inv_txfm_add_8x16_adst_flipadst_0_8bpc_rvv: 857.1 ( 8.99x) 701.6 (11.00x) inv_txfm_add_8x16_adst_flipadst_1_8bpc_rvv: 856.8 ( 8.98x) 701.3 (10.97x) inv_txfm_add_8x16_adst_identity_0_8bpc_rvv: 622.9 ( 9.22x) 468.5 (12.36x) inv_txfm_add_8x16_adst_identity_1_8bpc_rvv: 622.9 ( 9.23x) 468.6 (12.37x) inv_txfm_add_8x16_dct_adst_0_8bpc_rvv: 770.1 ( 9.32x) 655.1 (10.93x) inv_txfm_add_8x16_dct_adst_1_8bpc_rvv: 770.1 ( 9.34x) 655.4 (10.93x) inv_txfm_add_8x16_dct_dct_0_8bpc_rvv: 679.8 ( 1.23x) 566.1 ( 1.48x) inv_txfm_add_8x16_dct_dct_1_8bpc_rvv: 679.8 ( 9.98x) 566.5 (11.89x) inv_txfm_add_8x16_dct_flipadst_0_8bpc_rvv: 771.1 ( 9.34x) 667.4 (10.75x) inv_txfm_add_8x16_dct_flipadst_1_8bpc_rvv: 771.1 ( 9.34x) 667.3 (10.76x) inv_txfm_add_8x16_dct_identity_0_8bpc_rvv: 532.3 ( 9.84x) 422.1 (12.42x) inv_txfm_add_8x16_dct_identity_1_8bpc_rvv: 532.4 ( 9.85x) 422.2 (12.40x) inv_txfm_add_8x16_flipadst_adst_0_8bpc_rvv: 858.4 ( 8.98x) 699.2 (11.03x) inv_txfm_add_8x16_flipadst_adst_1_8bpc_rvv: 858.5 ( 8.98x) 699.3 (11.03x) inv_txfm_add_8x16_flipadst_dct_0_8bpc_rvv: 768.6 ( 9.52x) 609.7 (11.97x) inv_txfm_add_8x16_flipadst_dct_1_8bpc_rvv: 768.4 ( 9.52x) 609.6 (11.97x) inv_txfm_add_8x16_flipadst_flipadst_0_8bpc_rvv: 866.5 ( 8.91x) 706.5 (10.92x) inv_txfm_add_8x16_flipadst_flipadst_1_8bpc_rvv: 866.4 ( 8.92x) 706.6 (10.95x) inv_txfm_add_8x16_flipadst_identity_0_8bpc_rvv: 621.9 ( 9.28x) 464.6 (12.46x) inv_txfm_add_8x16_flipadst_identity_1_8bpc_rvv: 621.8 ( 9.28x) 464.6 (12.46x) inv_txfm_add_8x16_identity_adst_0_8bpc_rvv: 584.9 ( 9.78x) 564.1 (10.12x) inv_txfm_add_8x16_identity_adst_1_8bpc_rvv: 584.8 ( 9.78x) 563.9 (10.12x) inv_txfm_add_8x16_identity_dct_0_8bpc_rvv: 495.0 (10.75x) 474.6 (11.13x) inv_txfm_add_8x16_identity_dct_1_8bpc_rvv: 494.3 (10.75x) 474.7 (11.12x) inv_txfm_add_8x16_identity_flipadst_0_8bpc_rvv: 588.1 ( 9.76x) 568.1 (10.07x) inv_txfm_add_8x16_identity_flipadst_1_8bpc_rvv: 588.7 ( 9.74x) 568.0 (10.07x) inv_txfm_add_8x16_identity_identity_0_8bpc_rvv: 349.5 (10.78x) 328.8 (11.46x) inv_txfm_add_8x16_identity_identity_1_8bpc_rvv: 349.4 (10.79x) 328.7 (11.46x)
-
Nathan E. Egge authored
Kendryte K230 Before After inv_txfm_add_4x16_adst_adst_0_8bpc_rvv: 429.9 (7.45x) 381.3 (8.58x) inv_txfm_add_4x16_adst_adst_1_8bpc_rvv: 430.0 (7.45x) 381.3 (8.57x) inv_txfm_add_4x16_adst_dct_0_8bpc_rvv: 381.0 (8.01x) 332.5 (9.19x) inv_txfm_add_4x16_adst_dct_1_8bpc_rvv: 381.0 (8.00x) 332.5 (9.19x) inv_txfm_add_4x16_adst_flipadst_0_8bpc_rvv: 432.8 (7.42x) 384.5 (8.52x) inv_txfm_add_4x16_adst_flipadst_1_8bpc_rvv: 432.8 (7.42x) 384.4 (8.52x) inv_txfm_add_4x16_adst_identity_0_8bpc_rvv: 304.6 (7.32x) 249.8 (9.18x) inv_txfm_add_4x16_adst_identity_1_8bpc_rvv: 304.5 (7.32x) 249.8 (9.18x) inv_txfm_add_4x16_dct_adst_0_8bpc_rvv: 407.2 (7.68x) 371.4 (8.57x) inv_txfm_add_4x16_dct_adst_1_8bpc_rvv: 407.1 (7.68x) 371.5 (8.58x) inv_txfm_add_4x16_dct_dct_0_8bpc_rvv: 357.9 (1.27x) 323.1 (1.41x) inv_txfm_add_4x16_dct_dct_1_8bpc_rvv: 357.9 (8.29x) 322.9 (9.16x) inv_txfm_add_4x16_dct_flipadst_0_8bpc_rvv: 410.0 (7.62x) 376.6 (8.45x) inv_txfm_add_4x16_dct_flipadst_1_8bpc_rvv: 410.0 (7.62x) 376.5 (8.47x) inv_txfm_add_4x16_dct_identity_0_8bpc_rvv: 275.2 (7.79x) 240.5 (9.21x) inv_txfm_add_4x16_dct_identity_1_8bpc_rvv: 275.3 (7.78x) 240.6 (9.19x) inv_txfm_add_4x16_flipadst_adst_0_8bpc_rvv: 430.5 (7.51x) 382.6 (8.60x) inv_txfm_add_4x16_flipadst_adst_1_8bpc_rvv: 430.1 (7.52x) 382.8 (8.60x) inv_txfm_add_4x16_flipadst_dct_0_8bpc_rvv: 381.1 (8.09x) 333.8 (9.21x) inv_txfm_add_4x16_flipadst_dct_1_8bpc_rvv: 381.0 (8.08x) 333.7 (9.21x) inv_txfm_add_4x16_flipadst_flipadst_0_8bpc_rvv: 433.0 (7.48x) 385.7 (8.55x) inv_txfm_add_4x16_flipadst_flipadst_1_8bpc_rvv: 433.0 (7.48x) 385.7 (8.55x) inv_txfm_add_4x16_flipadst_identity_0_8bpc_rvv: 298.6 (7.57x) 250.8 (9.28x) inv_txfm_add_4x16_flipadst_identity_1_8bpc_rvv: 298.6 (7.57x) 250.9 (9.27x) inv_txfm_add_4x16_identity_adst_0_8bpc_rvv: 361.5 (7.93x) 347.3 (8.35x) inv_txfm_add_4x16_identity_adst_1_8bpc_rvv: 361.4 (7.93x) 347.4 (8.35x) inv_txfm_add_4x16_identity_dct_0_8bpc_rvv: 310.9 (8.69x) 297.8 (9.02x) inv_txfm_add_4x16_identity_dct_1_8bpc_rvv: 311.0 (8.69x) 297.8 (9.02x) inv_txfm_add_4x16_identity_flipadst_0_8bpc_rvv: 364.1 (7.88x) 350.5 (8.29x) inv_txfm_add_4x16_identity_flipadst_1_8bpc_rvv: 364.2 (7.88x) 350.4 (8.31x) inv_txfm_add_4x16_identity_identity_0_8bpc_rvv: 229.7 (8.22x) 211.4 (9.11x) inv_txfm_add_4x16_identity_identity_1_8bpc_rvv: 229.7 (8.21x) 211.2 (9.12x)
-
- Feb 26, 2024
-
-
Nathan E. Egge authored
-
Nathan E. Egge authored
This commit enables msac NEON assembly optimizations when building with MSVC targeting ARM. Note, the test for __APPLE__ is redundant and added for consistency.
-
- Feb 22, 2024
-
-
Matthias Dressel authored
-
Matthias Dressel authored
Now contains clang 17.
-
Matthias Dressel authored
gcovr 7.0 fixed a config file parsing bug [0]. Valid options are 'all', 'negative_hits.warn', 'negative_hits.warn_once_per_file'. [0] https://github.com/gcovr/gcovr/pull/816
-
Nathan E. Egge authored
-
-
Arpad Panyik authored
The 6-tap sub-pel filter specialisation uses different code paths for sharp (8-tap) and regular/smooth (6-tap) filtering kernels. This patch enables benchmarking for the different code paths.
-
Arpad Panyik authored
The 8-tap sub-pel filters used for motion vector interpolation are: regular, smooth, sharp. The regular and smooth filter kernels are zero-padded, so they are effectively 6-tap filters (some of them are 5-tap or even 4-tap). This patch specialises the put_8tap_neon and prep_8tap_neon functions for 6-tap filters, avoiding a lot of redundant work to multiply by and add zero. Wherever the sharp filtering is used the 8-tap path will be always selected. Benchmarking this on a broad range of recent CPUs shows a 7-15% FPS uplift. Get raw sample video: https://ultravideo.fi/video/Bosphorus_1920x1080_120fps_420_8bit_YUV_RAW.7z Encode using: aomenc --good --cpu-used=5 -w 1920 -h 1080 --bit-depth=8 --ivf -o Bosphorus_1080p_8bit.ivf Bosphorus_1920x1080_120fps_420_8bit_YUV.y4m
-