Commits · 3d98a242a055438ca76020434a530ebe074fa892 · Nick Galloway / dav1d

Mar 22, 2024
- x86: Add 6-tap variants of high bit-depth mc AVX2 functions · 3d98a242
  Henrik Gramner authored 1 year ago
  
  3d98a242
- x86: Add minor high bit-depth mc 8-tap AVX2 improvements · b3323a8c
  Henrik Gramner authored 1 year ago
  
  b3323a8c
Mar 21, 2024
- x86: Add 6-tap variants of 8bpc mc AVX2 functions · 9849ede1
  Henrik Gramner authored 1 year ago
```
6-taps filters are sufficient in the majority of cases, and are
quite a bit faster than the equivalent 8-tap filters.
```
  9849ede1
- x86: Add minor 8bpc mc 8-tap AVX2 improvements · 02c2033a
  Henrik Gramner authored 1 year ago
  
  02c2033a
Mar 18, 2024

arm64: Use different instruction sequence for taking global address with HWASan · 8e084264

Peter Collingbourne authored 1 year ago and

Martin Storsjö committed 1 year ago

When dav1d is built with HWASan, the build fails because globals are
tagged and the normal adrp/add instruction sequence does not have
enough range to take the tagged address. Therefore, use an alternative
instruction sequence when HWASan is enabled, which is the same as
what the compiler generates.

8e084264

Mar 15, 2024
- x86: Update x86inc.asm · 645da277
  Henrik Gramner authored 1 year ago
```
x86inc.asm@8494a52b
x86inc.asm@04f14f43
```
  645da277
- ci: Make checkasm work on the x86-32 build · 8b461668
  Henrik Gramner authored 1 year ago
  
  8b461668
Mar 09, 2024
- NEWS: Forgotten intro sentence · 872e470e
  Jean-Baptiste Kempf authored 1 year ago
  
  872e470e
Mar 08, 2024
- Update to 1.4.1 · 162fb6d8
  Jean-Baptiste Kempf authored 1 year ago
  
  162fb6d8
- Update THANKS.md · b9312c8d
  Matthias Dressel authored 1 year ago
  
  b9312c8d
- arm32: Fix right shifts in the 16bpc iwht implementation · 024b260c
  Martin Storsjö authored 1 year ago
```
These shifts used the wrong element size; this only was noticed in
some argon tests.
```
  024b260c
- arm32/msac: Trim C functions, saves 1024 bytes · 0fff614a
  Nathan E. Egge authored 1 year ago
  
  0fff614a
- arm64/msac: Trim C functions, saves 1392 bytes · b9f53330
  Nathan E. Egge authored 1 year ago
  
  b9f53330
- arm: Use -fno-align-functions when building · b5b394cd
  Nathan E. Egge authored 1 year ago
```
arm32: 2 byte alignment saves 136 bytes
arm64: 4 byte alignment saves 1200 bytes
```
  b5b394cd
- arm32/itx: Trim dav1d_inv_wht4_1d_c, saves 68 bytes · 61d16e07
  Nathan E. Egge authored 1 year ago
  
  61d16e07
- arm64/itx: Trim dav1d_inv_wht4_1d_c, saves 92 bytes · 485413b0
  Nathan E. Egge authored 1 year ago
  
  485413b0
- arm32/itx16: Add 4x4 12bpc NEON wht_wht transform · ec695854
  Nathan E. Egge authored 1 year ago
```
When -Dtrim_dsp=true, this commit saves 740 bytes.

inv_txfm_add_4x4_wht_wht_0_12bpc_c:       192.4 ( 1.00x)
inv_txfm_add_4x4_wht_wht_0_12bpc_neon:     46.1 ( 4.17x)
inv_txfm_add_4x4_wht_wht_1_12bpc_c:       192.4 ( 1.00x)
inv_txfm_add_4x4_wht_wht_1_12bpc_neon:     45.7 ( 4.21x)
```
  ec695854
- arm64/itx16: Add 4x4 12bpc NEON wht_wht transform · 3b852b15
  Nathan E. Egge authored 1 year ago
```
When -Dtrim_dsp=true, this commit saves 940 bytes.

inv_txfm_add_4x4_wht_wht_0_12bpc_c:       145.2 ( 1.00x)
inv_txfm_add_4x4_wht_wht_0_12bpc_neon:     42.9 ( 3.39x)
inv_txfm_add_4x4_wht_wht_1_12bpc_c:       145.4 ( 1.00x)
inv_txfm_add_4x4_wht_wht_1_12bpc_neon:     42.9 ( 3.39x)
```
  3b852b15
Mar 07, 2024
- x86: Fix out-of-bounds read in 8bpc SSE2/SSSE3 wiener_filter · 006ca01d
  Henrik Gramner authored 1 year ago and Henrik Gramner committed 1 year ago
```
When decoding a stream with a width of less than 4 pixels this could
cause a segfault if the frame buffer was allocated on a page boundary.
```
  006ca01d
Mar 05, 2024

AArch64: Specialise HBD Neon convolutions for 6-tap filters · 932b323c

Arpad Panyik authored 1 year ago and

Martin Storsjö committed 1 year ago

The 8-tap sub-pel filters used for motion vector interpolation are:
regular, smooth, sharp. The regular and smooth filter kernels are
zero-padded, so they are effectively 6-tap filters (some of them are
5-tap or even 4-tap).

This patch specialises the high bit-depth versions of put_8tap_neon
and prep_8tap_neon functions for 6-tap filters, avoiding a lot of
redundant work to multiply by and add zero. Wherever the sharp
filtering is used the 8-tap path will be always selected.

Benchmarks can show a 0.5-10.8% FPS uplift highly depending on the
input video source. Binary size increase is ~8.5 KiB.

932b323c

AArch64: Optimize 6-tap SBD HV Neon convolution · b0a329d6

Arpad Panyik authored 1 year ago and

Arpad Panyik committed 1 year ago

Optimize the 6-tap standard bit-depth horizontal-vertical combined
convolution to avoid unnecessary reads and horizontal convolution
steps at the beginning and end of the algorithm. This also saves some
instructions in the final binary.

Performance of this function increases by up to 5.5% depending on
block size.

b0a329d6

Mar 04, 2024

checkasm: aarch64: Print the SVE vector length, if available · fd60097e
Martin Storsjö authored 1 year ago

fd60097e

aarch64: Check for assembler support for various aarch64 extensions · e1f80dec

Martin Storsjö authored 1 year ago

First check if the assembler supports the ".arch" directive, and
what architecture levels are supported.

In principle, we'd only need to check for support for ".arch armv8.2-a",
since that's enough for enabling the i8mm and sve2 extensions.

However, recent Clang versions (before version 17) wasn't able to
enable the dotprod and i8mm extensions via the ".arch_extension"
directives, so check for support for armv8.4-a and armv8.6-a as well,
which enable dotprod and i8mm implicitly.

This allows assembling these instructions on most commonly available
GCC and Clang based toolchains, while still allowing toggling support
for the instruction sets on and off within the source files.

Within assembly, we disable these extensions by default, so that
instructions enabled within these extension sets can't be used
by accident in unintended functions. Code meaning to use these
extensions can be assembled like this:

    #if HAVE_SVE
    ENABLE_SVE
    // code
    DISABLE_SVE
    #endif

e1f80dec

Feb 29, 2024

checkasm: Add --list-cpuflags option · 85a10359

Henrik Gramner authored 1 year ago

Prints a list of cpuflags available for the current architecture.

Flags which are supported on the current system will be printed in
green, and flags which are unsupported in red with a ~ prefix.

85a10359

Feb 28, 2024
- ci: Add an aarch64 cross compile CI job with a recent Clang · 0d2e83cc
  Martin Storsjö authored 1 year ago
  
  0d2e83cc
- ci: Test aarch64 with QEMU, with varying SVE vector lengths · 302334a6
  Martin Storsjö authored 1 year ago
```
This allows testing all modern aarch64 CPU features, that the
HW based test runners might not support.

Especially for SVE, this allows testing all valid vector lengths,
which might not exist in hardware form yet.
```
  302334a6
- ci: Bump to the latest dav1d-debian-unstable image · 39be9fb4
  Martin Storsjö authored 1 year ago
```
This one contains aarch64 cross tools, for use with QEMU.
```
  39be9fb4
- Extend Arm and AArch64 run-time CPU feature detection · acc1121d
  Arpad Panyik authored 1 year ago and Martin Storsjö committed 1 year ago
```
Add run-time CPU feature detection for DotProd, i8mm, SVE and SVE2.
SVE and SVE2 are AArch64-only features.
```
  acc1121d
Feb 27, 2024

riscv64/itx: Add 16x16 8bpc eob test · b7963a73

Nathan E. Egge authored 1 year ago

Kendryte K230 Before After

inv_txfm_add_16x16_adst_adst_0_8bpc_rvv: 1804.9 (8.45x) 1374.3 (11.18x)
inv_txfm_add_16x16_adst_adst_1_8bpc_rvv: 1805.2 (8.45x) 1374.3 (11.17x)
inv_txfm_add_16x16_adst_dct_0_8bpc_rvv: 1626.6 (8.92x) 1185.8 (12.22x)
inv_txfm_add_16x16_adst_dct_1_8bpc_rvv: 1626.5 (8.91x) 1185.9 (12.22x)
inv_txfm_add_16x16_adst_flipadst_0_8bpc_rvv: 1824.2 (8.38x) 1372.1 (11.22x)
inv_txfm_add_16x16_adst_flipadst_1_8bpc_rvv: 1824.2 (8.37x) 1372.2 (11.21x)
inv_txfm_add_16x16_dct_adst_0_8bpc_rvv: 1627.3 (8.94x) 1283.5 (11.29x)
inv_txfm_add_16x16_dct_adst_1_8bpc_rvv: 1627.2 (8.95x) 1283.2 (11.29x)
inv_txfm_add_16x16_dct_dct_0_8bpc_rvv: 1449.3 (1.08x) 1095.2 ( 1.44x)
inv_txfm_add_16x16_dct_dct_1_8bpc_rvv: 1449.1 (9.52x) 1095.1 (12.45x)
inv_txfm_add_16x16_dct_flipadst_0_8bpc_rvv: 1643.0 (8.87x) 1283.5 (11.29x)
inv_txfm_add_16x16_dct_flipadst_1_8bpc_rvv: 1643.3 (8.87x) 1283.3 (11.30x)
inv_txfm_add_16x16_dct_identity_0_8bpc_rvv: 1155.4 (9.23x) 805.9 (13.17x)
inv_txfm_add_16x16_dct_identity_1_8bpc_rvv: 1155.4 (9.24x) 805.9 (13.17x)
inv_txfm_add_16x16_flipadst_adst_0_8bpc_rvv: 1812.2 (8.43x) 1370.9 (11.23x)
inv_txfm_add_16x16_flipadst_adst_1_8bpc_rvv: 1811.7 (8.44x) 1370.8 (11.24x)
inv_txfm_add_16x16_flipadst_dct_0_8bpc_rvv: 1637.2 (8.88x) 1190.8 (12.19x)
inv_txfm_add_16x16_flipadst_dct_1_8bpc_rvv: 1637.6 (8.87x) 1190.9 (12.19x)
inv_txfm_add_16x16_flipadst_flipadst_0_8bpc_rvv: 1831.1 (8.34x) 1374.7 (11.21x)
inv_txfm_add_16x16_flipadst_flipadst_1_8bpc_rvv: 1830.8 (8.35x) 1374.5 (11.22x)
inv_txfm_add_16x16_identity_dct_0_8bpc_rvv: 1156.2 (8.67x) 948.6 (10.49x)
inv_txfm_add_16x16_identity_dct_1_8bpc_rvv: 1156.3 (8.68x) 948.6 (10.49x)
inv_txfm_add_16x16_identity_identity_0_8bpc_rvv: 879.3 (7.81x) 673.5 (10.28x)
inv_txfm_add_16x16_identity_identity_1_8bpc_rvv: 879.3 (7.81x) 673.5 (10.28x)

b7963a73

riscv64/itx: Add 8x16 8bpc eob test · 70122512

Nathan E. Egge authored 1 year ago

Kendryte K230 Before After

inv_txfm_add_8x16_adst_adst_0_8bpc_rvv: 853.9 ( 9.00x) 698.3 (11.03x)
inv_txfm_add_8x16_adst_adst_1_8bpc_rvv: 853.8 ( 9.00x) 698.3 (11.03x)
inv_txfm_add_8x16_adst_dct_0_8bpc_rvv: 763.0 ( 9.55x) 609.2 (12.00x)
inv_txfm_add_8x16_adst_dct_1_8bpc_rvv: 763.1 ( 9.55x) 609.3 (11.94x)
inv_txfm_add_8x16_adst_flipadst_0_8bpc_rvv: 857.1 ( 8.99x) 701.6 (11.00x)
inv_txfm_add_8x16_adst_flipadst_1_8bpc_rvv: 856.8 ( 8.98x) 701.3 (10.97x)
inv_txfm_add_8x16_adst_identity_0_8bpc_rvv: 622.9 ( 9.22x) 468.5 (12.36x)
inv_txfm_add_8x16_adst_identity_1_8bpc_rvv: 622.9 ( 9.23x) 468.6 (12.37x)
inv_txfm_add_8x16_dct_adst_0_8bpc_rvv: 770.1 ( 9.32x) 655.1 (10.93x)
inv_txfm_add_8x16_dct_adst_1_8bpc_rvv: 770.1 ( 9.34x) 655.4 (10.93x)
inv_txfm_add_8x16_dct_dct_0_8bpc_rvv: 679.8 ( 1.23x) 566.1 ( 1.48x)
inv_txfm_add_8x16_dct_dct_1_8bpc_rvv: 679.8 ( 9.98x) 566.5 (11.89x)
inv_txfm_add_8x16_dct_flipadst_0_8bpc_rvv: 771.1 ( 9.34x) 667.4 (10.75x)
inv_txfm_add_8x16_dct_flipadst_1_8bpc_rvv: 771.1 ( 9.34x) 667.3 (10.76x)
inv_txfm_add_8x16_dct_identity_0_8bpc_rvv: 532.3 ( 9.84x) 422.1 (12.42x)
inv_txfm_add_8x16_dct_identity_1_8bpc_rvv: 532.4 ( 9.85x) 422.2 (12.40x)
inv_txfm_add_8x16_flipadst_adst_0_8bpc_rvv: 858.4 ( 8.98x) 699.2 (11.03x)
inv_txfm_add_8x16_flipadst_adst_1_8bpc_rvv: 858.5 ( 8.98x) 699.3 (11.03x)
inv_txfm_add_8x16_flipadst_dct_0_8bpc_rvv: 768.6 ( 9.52x) 609.7 (11.97x)
inv_txfm_add_8x16_flipadst_dct_1_8bpc_rvv: 768.4 ( 9.52x) 609.6 (11.97x)
inv_txfm_add_8x16_flipadst_flipadst_0_8bpc_rvv: 866.5 ( 8.91x) 706.5 (10.92x)
inv_txfm_add_8x16_flipadst_flipadst_1_8bpc_rvv: 866.4 ( 8.92x) 706.6 (10.95x)
inv_txfm_add_8x16_flipadst_identity_0_8bpc_rvv: 621.9 ( 9.28x) 464.6 (12.46x)
inv_txfm_add_8x16_flipadst_identity_1_8bpc_rvv: 621.8 ( 9.28x) 464.6 (12.46x)
inv_txfm_add_8x16_identity_adst_0_8bpc_rvv: 584.9 ( 9.78x) 564.1 (10.12x)
inv_txfm_add_8x16_identity_adst_1_8bpc_rvv: 584.8 ( 9.78x) 563.9 (10.12x)
inv_txfm_add_8x16_identity_dct_0_8bpc_rvv: 495.0 (10.75x) 474.6 (11.13x)
inv_txfm_add_8x16_identity_dct_1_8bpc_rvv: 494.3 (10.75x) 474.7 (11.12x)
inv_txfm_add_8x16_identity_flipadst_0_8bpc_rvv: 588.1 ( 9.76x) 568.1 (10.07x)
inv_txfm_add_8x16_identity_flipadst_1_8bpc_rvv: 588.7 ( 9.74x) 568.0 (10.07x)
inv_txfm_add_8x16_identity_identity_0_8bpc_rvv: 349.5 (10.78x) 328.8 (11.46x)
inv_txfm_add_8x16_identity_identity_1_8bpc_rvv: 349.4 (10.79x) 328.7 (11.46x)

70122512

riscv64/itx: Add 4x16 8bpc eob test · afeeb3cc

Nathan E. Egge authored 1 year ago

Kendryte K230 Before After

inv_txfm_add_4x16_adst_adst_0_8bpc_rvv: 429.9 (7.45x) 381.3 (8.58x)
inv_txfm_add_4x16_adst_adst_1_8bpc_rvv: 430.0 (7.45x) 381.3 (8.57x)
inv_txfm_add_4x16_adst_dct_0_8bpc_rvv: 381.0 (8.01x) 332.5 (9.19x)
inv_txfm_add_4x16_adst_dct_1_8bpc_rvv: 381.0 (8.00x) 332.5 (9.19x)
inv_txfm_add_4x16_adst_flipadst_0_8bpc_rvv: 432.8 (7.42x) 384.5 (8.52x)
inv_txfm_add_4x16_adst_flipadst_1_8bpc_rvv: 432.8 (7.42x) 384.4 (8.52x)
inv_txfm_add_4x16_adst_identity_0_8bpc_rvv: 304.6 (7.32x) 249.8 (9.18x)
inv_txfm_add_4x16_adst_identity_1_8bpc_rvv: 304.5 (7.32x) 249.8 (9.18x)
inv_txfm_add_4x16_dct_adst_0_8bpc_rvv: 407.2 (7.68x) 371.4 (8.57x)
inv_txfm_add_4x16_dct_adst_1_8bpc_rvv: 407.1 (7.68x) 371.5 (8.58x)
inv_txfm_add_4x16_dct_dct_0_8bpc_rvv: 357.9 (1.27x) 323.1 (1.41x)
inv_txfm_add_4x16_dct_dct_1_8bpc_rvv: 357.9 (8.29x) 322.9 (9.16x)
inv_txfm_add_4x16_dct_flipadst_0_8bpc_rvv: 410.0 (7.62x) 376.6 (8.45x)
inv_txfm_add_4x16_dct_flipadst_1_8bpc_rvv: 410.0 (7.62x) 376.5 (8.47x)
inv_txfm_add_4x16_dct_identity_0_8bpc_rvv: 275.2 (7.79x) 240.5 (9.21x)
inv_txfm_add_4x16_dct_identity_1_8bpc_rvv: 275.3 (7.78x) 240.6 (9.19x)
inv_txfm_add_4x16_flipadst_adst_0_8bpc_rvv: 430.5 (7.51x) 382.6 (8.60x)
inv_txfm_add_4x16_flipadst_adst_1_8bpc_rvv: 430.1 (7.52x) 382.8 (8.60x)
inv_txfm_add_4x16_flipadst_dct_0_8bpc_rvv: 381.1 (8.09x) 333.8 (9.21x)
inv_txfm_add_4x16_flipadst_dct_1_8bpc_rvv: 381.0 (8.08x) 333.7 (9.21x)
inv_txfm_add_4x16_flipadst_flipadst_0_8bpc_rvv: 433.0 (7.48x) 385.7 (8.55x)
inv_txfm_add_4x16_flipadst_flipadst_1_8bpc_rvv: 433.0 (7.48x) 385.7 (8.55x)
inv_txfm_add_4x16_flipadst_identity_0_8bpc_rvv: 298.6 (7.57x) 250.8 (9.28x)
inv_txfm_add_4x16_flipadst_identity_1_8bpc_rvv: 298.6 (7.57x) 250.9 (9.27x)
inv_txfm_add_4x16_identity_adst_0_8bpc_rvv: 361.5 (7.93x) 347.3 (8.35x)
inv_txfm_add_4x16_identity_adst_1_8bpc_rvv: 361.4 (7.93x) 347.4 (8.35x)
inv_txfm_add_4x16_identity_dct_0_8bpc_rvv: 310.9 (8.69x) 297.8 (9.02x)
inv_txfm_add_4x16_identity_dct_1_8bpc_rvv: 311.0 (8.69x) 297.8 (9.02x)
inv_txfm_add_4x16_identity_flipadst_0_8bpc_rvv: 364.1 (7.88x) 350.5 (8.29x)
inv_txfm_add_4x16_identity_flipadst_1_8bpc_rvv: 364.2 (7.88x) 350.4 (8.31x)
inv_txfm_add_4x16_identity_identity_0_8bpc_rvv: 229.7 (8.22x) 211.4 (9.11x)
inv_txfm_add_4x16_identity_identity_1_8bpc_rvv: 229.7 (8.21x) 211.2 (9.12x)

afeeb3cc

Feb 26, 2024
- riscv/checkasm: Print the RVV vector length, if available · 52948bbf
  Nathan E. Egge authored 1 year ago
  
  52948bbf
- arm/msac: Enable NEON optimizations on more platforms · 8c209190
  Nathan E. Egge authored 1 year ago
```
This commit enables msac NEON assembly optimizations when building with
 MSVC targeting ARM.
Note, the test for __APPLE__ is redundant and added for consistency.
```
  8c209190
Feb 22, 2024

CI: Add riscv64 clang build · 9d57a654
Matthias Dressel authored 1 year ago

9d57a654
CI: Update image · bada810c
Matthias Dressel authored 1 year ago
```
Now contains clang 17.
```
bada810c

gcovr: Fix config file · 91ddba0b

Matthias Dressel authored 1 year ago

gcovr 7.0 fixed a config file parsing bug [0].
Valid options are 'all', 'negative_hits.warn',
'negative_hits.warn_once_per_file'.

[0] https://github.com/gcovr/gcovr/pull/816

91ddba0b

riscv64/itx: Fix build issues with clang · 2ab2ec38
Nathan E. Egge authored 1 year ago

2ab2ec38
x86inc: Fix warnings with old nasm versions · 36184ce0
Henrik Gramner authored 1 year ago and Henrik Gramner committed 1 year ago

36184ce0

AArch64: Enable benchmarks for 8-tap sharp filters · f1d42ae8

Arpad Panyik authored 1 year ago

The 6-tap sub-pel filter specialisation uses different code paths for
sharp (8-tap) and regular/smooth (6-tap) filtering kernels.

This patch enables benchmarking for the different code paths.

f1d42ae8

AArch64: Specialise Neon convolutions for 6-tap filters · e51f4377

Arpad Panyik authored 1 year ago

The 8-tap sub-pel filters used for motion vector interpolation are:
regular, smooth, sharp. The regular and smooth filter kernels are
zero-padded, so they are effectively 6-tap filters (some of them are
5-tap or even 4-tap).

This patch specialises the put_8tap_neon and prep_8tap_neon functions
for 6-tap filters, avoiding a lot of redundant work to multiply by
and add zero. Wherever the sharp filtering is used the 8-tap path
will be always selected.

Benchmarking this on a broad range of recent CPUs shows a 7-15% FPS
uplift.

Get raw sample video:
https://ultravideo.fi/video/Bosphorus_1920x1080_120fps_420_8bit_YUV_RAW.7z

Encode using:
aomenc --good --cpu-used=5 -w 1920 -h 1080 --bit-depth=8 --ivf -o Bosphorus_1080p_8bit.ivf Bosphorus_1920x1080_120fps_420_8bit_YUV.y4m

e51f4377