Commits · master · Christophe Gisquet / dav1d

Jun 18, 2021
- x86: Fix warp_affine_8x8t_16bpc_ssse3 on 64-bit Windows + LLVM · b3d988fa
  Henrik Gramner authored 3 years ago and Henrik Gramner committed 3 years ago
```
The stack size calculation ended up being incorrect when the stack
alignment was larger than 16 due to auto-generated alignment padding.
```
  b3d988fa
- x86: Add bpc suffix to itx functions · 770c9c83
  Matthias Dressel authored 3 years ago
  
  770c9c83
Jun 17, 2021
- x86: Add high bitdepth warp8x8 SSSE3 asm · 3ff8f571
  Henrik Gramner authored 3 years ago
  
  3ff8f571
Jun 15, 2021

x86inc: Support memory operands in src1 in 3-operand instructions · e6497c2a

Henrik Gramner authored 3 years ago and

Henrik Gramner committed 3 years ago

Particularly in code that makes heavy use of macros it's possible
to end up with 3-operand instructions with a memory operand in src1.
In the case of SSE this works fine due to automatic move insertions,
but in AVX that fails since memory operands are only allowed in src2.

The main purpose of this feature is to minimize the amount of code
changes required to facilitate conversion of existing SSE code to AVX.

e6497c2a

Disable TMVP code when disabled in sequence header · b52be250
Christophe Gisquet authored 3 years ago and Ronald S. Bultje committed 3 years ago
```
Speeds up decoding by ~4% when TMVP is disabled in the sequence header.
```
b52be250
Add SSSE3 HBD filmgrain assembly optimizations · af16b652
Ronald S. Bultje authored 3 years ago

af16b652

Jun 12, 2021

arm32: filmgrain: Add NEON implementation of fgy and fguv for 16 bpc · ddbbfde1

Martin Storsjö authored 3 years ago and

Jean-Baptiste Kempf committed 3 years ago

Relative speedup over C code:
Cortex A7 A8 A9 A53 A72 A73
fguv_32x32xn_16bpc_420_csfl0_neon: 3.47 1.72 2.99 4.18 2.68 6.19
fguv_32x32xn_16bpc_420_csfl1_neon: 3.24 1.36 2.58 3.78 2.73 5.27
fguv_32x32xn_16bpc_422_csfl0_neon: 3.57 2.07 3.05 4.32 2.74 6.20
fguv_32x32xn_16bpc_422_csfl1_neon: 3.33 1.44 2.62 3.89 2.71 5.28
fguv_32x32xn_16bpc_444_csfl0_neon: 3.48 1.69 3.06 4.48 2.97 6.69
fguv_32x32xn_16bpc_444_csfl1_neon: 3.06 1.16 2.36 3.85 2.75 5.19
fgy_32x32xn_16bpc_neon: 2.89 1.05 2.29 3.49 2.49 3.15

Absolute numbers:
Cortex A7 A8 A9 A53 A72 A73
fguv_32x32xn_16bpc_420_csfl0_neon: 6237.3 12701.0 6687.1 4525.8 3220.8 3195.4
fguv_32x32xn_16bpc_420_csfl1_neon: 5143.2 11684.8 5926.4 3857.2 2604.7 2556.5
fguv_32x32xn_16bpc_422_csfl0_neon: 6347.3 11005.2 6797.5 4582.4 3300.4 3250.5
fguv_32x32xn_16bpc_422_csfl1_neon: 5275.2 11594.8 5992.6 3931.1 2668.7 2607.3
fguv_32x32xn_16bpc_444_csfl0_neon: 5181.6 11310.0 5575.4 3629.7 2383.8 2530.0
fguv_32x32xn_16bpc_444_csfl1_neon: 4081.9 10958.8 4868.5 2962.9 1870.3 2034.2
fgy_32x32xn_16bpc_neon: 15439.1 43129.0 19406.6 11542.3 7463.9 7827.8

Corresponding numbers for arm64:
Cortex A53 A72 A73
fguv_32x32xn_16bpc_420_csfl0_neon: 4019.2 3247.4 3259.6
fguv_32x32xn_16bpc_420_csfl1_neon: 3460.1 2628.7 2640.8
fguv_32x32xn_16bpc_422_csfl0_neon: 4034.4 3329.9 3287.5
fguv_32x32xn_16bpc_422_csfl1_neon: 3468.3 2749.3 2686.6
fguv_32x32xn_16bpc_444_csfl0_neon: 3117.7 2447.4 2539.8
fguv_32x32xn_16bpc_444_csfl1_neon: 2641.2 1977.2 2132.8
fgy_32x32xn_16bpc_neon: 9873.5 7605.7 7656.2

ddbbfde1

Jun 11, 2021

Add 10/12-bit deblock SSSE3 implementation · f7043e47
Ronald S. Bultje authored 3 years ago
```
Currently 64-bit only.
```
f7043e47

arm32: filmgrain: Add NEON implementations of fgy and fguv for 8 bpc · c187e704

Martin Storsjö authored 3 years ago

Relative speedup over C code:
Cortex A7 A8 A9 A53 A72 A73
fguv_32x32xn_8bpc_420_csfl0_neon: 4.20 2.19 3.48 4.93 3.60 5.93
fguv_32x32xn_8bpc_420_csfl1_neon: 3.92 1.52 2.84 4.34 3.82 5.93
fguv_32x32xn_8bpc_422_csfl0_neon: 4.27 2.13 3.58 5.02 4.04 5.95
fguv_32x32xn_8bpc_422_csfl1_neon: 3.99 1.56 2.91 4.43 3.89 6.00
fguv_32x32xn_8bpc_444_csfl0_neon: 4.48 2.08 3.89 5.66 4.07 6.51
fguv_32x32xn_8bpc_444_csfl1_neon: 4.45 1.41 2.99 5.28 3.63 6.09
fgy_32x32xn_8bpc_neon: 3.61 1.10 2.62 4.35 3.06 3.74

Absolute numbers:
Cortex A7 A8 A9 A53 A72 A73
fguv_32x32xn_8bpc_420_csfl0_neon: 5318.8 11167.7 6024.6 3909.9 2945.2 2993.5
fguv_32x32xn_8bpc_420_csfl1_neon: 4351.0 10929.7 5269.5 3316.8 2166.5 2256.9
fguv_32x32xn_8bpc_422_csfl0_neon: 5387.9 11746.7 6080.0 3945.8 2988.1 3046.3
fguv_32x32xn_8bpc_422_csfl1_neon: 4396.0 11083.2 5300.8 3354.9 2216.4 2291.4
fguv_32x32xn_8bpc_444_csfl0_neon: 4347.9 10595.0 5134.4 3079.1 2277.7 2392.9
fguv_32x32xn_8bpc_444_csfl1_neon: 3295.0 10518.2 4442.6 2476.3 1716.3 1829.2
fgy_32x32xn_8bpc_neon: 12376.2 41046.9 17259.7 9153.1 6610.4 7005.3

Corresponding numbers for arm64: Cortex A53 A72 A73
fguv_32x32xn_8bpc_420_csfl0_neon: 3822.9 2920.0 2935.7
fguv_32x32xn_8bpc_420_csfl1_neon: 3209.7 2231.7 2335.4
fguv_32x32xn_8bpc_422_csfl0_neon: 3807.9 2886.5 2966.7
fguv_32x32xn_8bpc_422_csfl1_neon: 3197.1 2187.9 2355.9
fguv_32x32xn_8bpc_444_csfl0_neon: 2757.8 2227.4 2334.4
fguv_32x32xn_8bpc_444_csfl1_neon: 2244.6 1719.1 1786.7
fgy_32x32xn_8bpc_neon: 8192.2 6563.3 6969.1

c187e704

Jun 10, 2021

checkasm: Validate the benchmark call configurations even if not benchmarking · ea9c5afa

Martin Storsjö authored 3 years ago

This should help catch issues like the one fixed in
185194be, by making sure that we
call the benchmarked function at least once with the given parameters,
even if not benchmarking. Otherwise the benchmark codepath is
essentially dead untested code until somebody works on that piece
of code.

ea9c5afa

arm64: filmgrain16: Add a NEON implementation of fguv_32x32xn for 16 bpc · 5c5860ab

Martin Storsjö authored 3 years ago and

Jean-Baptiste Kempf committed 3 years ago

Relative speedup over C code:
                               Cortex A53    A82    A83   Apple M1
fguv_32x32xn_16bpc_420_csfl0_neon:   4.57   2.08   3.57   7.61
fguv_32x32xn_16bpc_420_csfl1_neon:   4.92   2.89   3.96   4.26
fguv_32x32xn_16bpc_422_csfl0_neon:   4.59   2.14   3.61   5.88
fguv_32x32xn_16bpc_422_csfl1_neon:   4.92   2.90   3.90   5.00
fguv_32x32xn_16bpc_444_csfl0_neon:   3.64   1.89   2.86   4.72
fguv_32x32xn_16bpc_444_csfl1_neon:   3.59   2.26   2.76   3.22

5c5860ab

arm64: filmgrain: Back up and restore one register fewer in fguv 8bpc · 1e4472c8
Martin Storsjö authored 3 years ago and Jean-Baptiste Kempf committed 3 years ago

1e4472c8
arm64: filmgrain: Stray cosmetic fixes · f65d3271
Martin Storsjö authored 3 years ago and Jean-Baptiste Kempf committed 3 years ago

f65d3271

arm64: filmgrain: Do the right amount of gathers for subsampled fguv · 64926847

Martin Storsjö authored 3 years ago and

Jean-Baptiste Kempf committed 3 years ago

Previously we did 32 gathers even though only 16 are
needed.

Before:                          Cortex A53      A72      A73   Apple M1
fguv_32x32xn_8bpc_420_csfl0_neon:    5352.1   3985.0   4068.9   8.3
fguv_32x32xn_8bpc_420_csfl1_neon:    4738.2   3297.8   3633.0   8.2
fguv_32x32xn_8bpc_422_csfl0_neon:    5386.0   4036.8   4093.5   8.3
fguv_32x32xn_8bpc_422_csfl1_neon:    4779.9   3392.6   3641.6   8.2
fguv_32x32xn_8bpc_444_csfl0_neon:    3068.4   2422.0   2436.5   4.9
fguv_32x32xn_8bpc_444_csfl1_neon:    2558.3   1908.4   1926.6   4.4
After:
fguv_32x32xn_8bpc_420_csfl0_neon:    4330.4   3118.5   3224.6   5.3
fguv_32x32xn_8bpc_420_csfl1_neon:    3731.8   2416.9   2619.6   4.7
fguv_32x32xn_8bpc_422_csfl0_neon:    4364.7   3129.3   3247.6   5.4
fguv_32x32xn_8bpc_422_csfl1_neon:    3762.5   2450.2   2661.8   4.7
fguv_32x32xn_8bpc_444_csfl0_neon:    3075.1   2376.4   2429.4   4.9
fguv_32x32xn_8bpc_444_csfl1_neon:    2564.5   1865.9   1952.8   4.4

64926847

Jun 09, 2021
- mc: add HBD/SSSE3 mc.emu_edge optimizations · 1156c044
  Ronald S. Bultje authored 3 years ago
  
  1156c044
- x86: Add high bitdepth wiener filter SSSE3 asm · 193db389
  Victorien Le Couviour--Tuffet authored 3 years ago
  
  193db389
Jun 07, 2021
- checkasm: Make sure that all pixels are in range before benchmarking · 185194be
  Martin Storsjö authored 3 years ago
```
The pixel data as initialized by the test above only have proper
pixels up to whatever random 'w' it used last.
```
  185194be
Jun 05, 2021
- checkasm: allow 1 >= h >= 2 in fgy_32x32xn unit test · e00e7411
  Ronald S. Bultje authored 3 years ago
  
  e00e7411
- arm64: filmgrain16: Use sqrdmulh for the scaling*grain multiplication · 3e044a7d
  Martin Storsjö authored 3 years ago
```
Before:               Cortex A53      A72      A73  Apple M1
fgy_32x32xn_16bpc_neon:  10396.8   8150.8   8718.3  19.5
After:
fgy_32x32xn_16bpc_neon:   9665.1   7558.8   7652.8  19.5
```
  3e044a7d
Jun 04, 2021
- Do avx2/hbd scaling*grain multiplication in 16bit instead of 32bit · a8b13fc1
  Ronald S. Bultje authored 3 years ago
  
  a8b13fc1
Jun 02, 2021

checkasm: Remove an unused variable/parameter · 90dad3ee

Martin Storsjö authored 3 years ago

Clang 13 got support for warning about variables that are set but
not used. We disable warnings for unused parameters, but in this case,
the parameter variable is updated within the function too, which
Clang warns about.

90dad3ee

May 31, 2021
- x86: Add high bitdepth prep_8tap SSSE3 asm · d2a3f5b4
  Henrik Gramner authored 3 years ago and Henrik Gramner committed 3 years ago
  
  d2a3f5b4
- x86: Add high bitdepth put_8tap SSSE3 asm · 9e38fd56
  Henrik Gramner authored 3 years ago and Henrik Gramner committed 3 years ago
  
  9e38fd56
- x86: Add high bitdepth put_bilin/prep_bilin SSSE3 asm · e476d7cb
  Henrik Gramner authored 3 years ago and Henrik Gramner committed 3 years ago
  
  e476d7cb
- x86: Add high bitdepth blend/blend_v/blend_h SSSE3 asm · 9f706f45
  Henrik Gramner authored 3 years ago and Henrik Gramner committed 3 years ago
  
  9f706f45
- x86: Add high bitdepth w_mask SSSE3 asm · fb037c75
  Henrik Gramner authored 3 years ago and Henrik Gramner committed 3 years ago
  
  fb037c75
- x86: Add high bitdepth avg/w_avg/mask SSSE3 asm · d0b0d587
  Henrik Gramner authored 3 years ago and Henrik Gramner committed 3 years ago
  
  d0b0d587
May 27, 2021
- Move #dav1d to Libera.chat · 41be890d
  Vibhoothi authored 3 years ago
  
  41be890d
May 25, 2021

arm64: filmgrain: Fix overflows in gen_grain · c389d895

Martin Storsjö authored 3 years ago

After multiplying two int8_t, the maximum possible output is
-128*-128 = 16384. One can't add two such values in an int16_t (even if
all the products of all other int8_t combinations can be).

Previously the summing used 16 bit intermediates for the sum of two
products and only lengtheted the result to 32 bit when accumulating
three or more products.

Before:                    Cortex A53       A72       A73   Apple M1
gen_grain_y_ar1_8bpc_neon:   112598.5   71309.2   74889.8   372.2
gen_grain_y_ar2_8bpc_neon:   139932.4   91442.3   95788.4   387.3
gen_grain_y_ar3_8bpc_neon:   185607.6  115691.6  131655.8   403.0
After:
gen_grain_y_ar1_8bpc_neon:   112968.8   71897.9   76171.2   371.2
gen_grain_y_ar2_8bpc_neon:   142768.8   94517.9   97934.4   387.5
gen_grain_y_ar3_8bpc_neon:   191625.2  121083.0  135975.3   405.6

c389d895

May 18, 2021
- x86: itx: Add 10/12-bit SSE2 WHT · c54add02
  Matthias Dressel authored 3 years ago
  
  c54add02
May 16, 2021
- Update some copyright dates to 2021 · 8636b4f2
  Jean-Baptiste Kempf authored 3 years ago
  
  0.9.0
  
  8636b4f2
May 14, 2021
- arm64: filmgrain16: Simplify constructing the constant 0x0fff · 24036ca3
  Martin Storsjö authored 3 years ago
```
Use the mvni instruction instead of setting the constant in a GPR
first.
```
  24036ca3
May 13, 2021

x86: itx: Add 12-bit wht · 477cc158
Matthias Dressel authored 3 years ago

477cc158
Update NEWS with a mention to the new event flag API · a823252b
James Almer authored 3 years ago

a823252b

arm64: filmgrain16: Guard against out of range pixels in the gather function · 3aac0252

Martin Storsjö authored 3 years ago and

Jean-Baptiste Kempf committed 3 years ago

In 16 bpc, the pixels are 16 bit integers, but valid pixels only
are up to 12 bits, and the scaling buffer only contains 4096
elements.

The src pixels are, normally, supposed to be valid pixels, but when
processing blocks of 32 pixels at a time, it can operate on
uninitialized pixels past the right edge.

Before:               Cortex A53      A72      A73  Apple M1
fgy_32x32xn_16bpc_neon:  10372.5   8194.4   8612.1  24.2
After:
fgy_32x32xn_16bpc_neon:  10837.9   8469.5   8885.1   24.6

3aac0252

On the road to 0.9.0 · 1cf1b309
Jean-Baptiste Kempf authored 3 years ago

1cf1b309

May 12, 2021

arm64: filmgrain: Add a NEON implementation of fgy_32x32xn for 16 bpc · f75854cb

Martin Storsjö authored 3 years ago

Relative speedup over C code:

                    Cortex A53    A72    A73   Apple M1
fgy_32x32xn_16bpc_neon:   3.87   2.28   2.78   3.45

f75854cb

Only call GetModuleHandle and AddVectoredExceptionHandler when targeting Windows Desktop · 0ced2f9e

Martin Storsjö authored 3 years ago

Don't call them when targeting e.g. UWP.

This requires building with a new enough SDK that does have the
winapifamily.h header (and that it's included implicitly by regular
platform headers); it's been available since the Windows 8.0 SDK
(and since mingw-w64 v3.0.0) so it should be safe.

Also rewrite the GetProcAddress call to avoid calling it if
GetModuleHandleW(L"kernel32.dll") would return NULL for some reason.

0ced2f9e

May 11, 2021
- x86: add 10/12-bpc AVX2 version of mc.emu_edge · d16ddb34
  Ronald S. Bultje authored 3 years ago
  
  d16ddb34
May 10, 2021
- x86: Add high bitdepth filmgrain AVX2 asm · 3a663070
  Ronald S. Bultje authored 4 years ago and Henrik Gramner committed 3 years ago
  
  3a663070