Commits · master · kossh1 / dav1d

Jan 29, 2020

checkasm: Increase buffer alignment to 64-byte on x86-64 · 9c29f229
Henrik Gramner authored 5 years ago
```
Required for AVX-512.
```
9c29f229

arm: cdef: Add special cased versions for pri_strength/sec_strength being zero · 361a3c8e

Martin Storsjö authored 5 years ago

Before:
ARM32:                    Cortex A7      A8      A9     A53     A72     A73
cdef_filter_4x4_8bpc_neon:    964.6   599.5   707.9   601.2   465.1   405.2
cdef_filter_4x8_8bpc_neon:   1726.0  1066.2  1238.7  1041.7   798.6   725.3
cdef_filter_8x8_8bpc_neon:   2974.4  1671.8  1943.9  1806.1  1229.8  1242.1
ARM64:
cdef_filter_4x4_8bpc_neon:                            569.2   337.8   348.7
cdef_filter_4x8_8bpc_neon:                           1031.1   623.3   633.6
cdef_filter_8x8_8bpc_neon:                           1847.5  1097.7  1117.5

After:
ARM32:                    Cortex A7      A8      A9     A53     A72     A73
cdef_filter_4x4_8bpc_neon:    798.4   524.2   617.3   506.8   432.4   361.1
cdef_filter_4x8_8bpc_neon:   1394.7   910.4  1054.0   863.6   730.2   632.2
cdef_filter_8x8_8bpc_neon:   2364.6  1453.8  1675.1  1466.0  1086.4  1107.7
ARM64:
cdef_filter_4x4_8bpc_neon:                            461.7   303.1   308.6
cdef_filter_4x8_8bpc_neon:                            833.0   547.5   556.0
cdef_filter_8x8_8bpc_neon:                           1459.3   934.1   967.9

361a3c8e

arm: cdef: Fix some comment typos · 6ad9bd5f
Martin Storsjö authored 5 years ago

6ad9bd5f

Jan 28, 2020
- Fix crash in dav1d_apply_grain() with negative picture strides · ba23ac8c
  Henrik Gramner authored 5 years ago and Henrik Gramner committed 5 years ago
  
  ba23ac8c
Jan 27, 2020

Optimize the cdef_filter C implementation · eaaf2218

Henrik Gramner authored 5 years ago and

Henrik Gramner committed 5 years ago

The main feature is splitting the main filter code into three
different code paths depending on the strength values.

Clipping is only required when both the primary and secondary
strengths are non-zero, which is an uncommon case. Being able to
skip that complexity in the common cases is significantly faster.

eaaf2218

checkasm: Improve cdef_filter test · fad6db20
Henrik Gramner authored 5 years ago and Henrik Gramner committed 5 years ago

fad6db20

Jan 25, 2020

Avoid redundant calls to CDEF DSP functions · 6385cde2

Henrik Gramner authored 5 years ago and

Henrik Gramner committed 5 years ago

If the primary strengths for both luma and chroma are zero
the direction is always zero.

If both the primary and secondary luma strengths are zero the
entire luma filtering process is a no-op.

6385cde2

Jan 21, 2020
- x86: Bump nasm version requirement to 2.14 · 447b7c63
  Henrik Gramner authored 5 years ago
```
Required for the AVX-512 instructions added in Ice Lake.
```
  447b7c63
- CI: Use a newer image to build snap packages · e636a2f4
  Konstantin Pavlov authored 5 years ago
```
The image now includes nasm 2.14.02 which is needed to assemble AVX512
code.
```
  e636a2f4
Jan 20, 2020

x86: add prep_8tap AVX512 asm · e706fac9
Victorien Le Couviour--Tuffet authored 5 years ago

e706fac9

x86: replace "mov hb, Xb" by "movzx hd, Xb" in MC · b83cb964

Victorien Le Couviour--Tuffet authored 5 years ago

It's a little easier for the CPU to simply overwrite a 32-bit reg rather
than writing it's low 8 bits while conserving bits 8 to 31.
In order to do that it actually fetches those bits, merge to a 32-bit
value, and write that back to the 32-bit GPR.

As those are always cleared, perform a zero extend mov to dword instead.

b83cb964

x86inc: save xmm_regs_used in spill_xmm for non-win64 · 289ca2ce

Victorien Le Couviour--Tuffet authored 5 years ago

This allows for AVX-512 code to issue vzeroupper automatically in RET,
when not specified in cglobal but later on, along WIN64_SPILL_XMM.

289ca2ce

Jan 15, 2020

arm64: itx: Fix overflow/clipping in negation in idct16 · 010eae8b

Martin Storsjö authored 5 years ago

Don't assume we can do a clipped negation in 16 bit before the
multiplication (as it might affect the end result), but do the
multiplication first and negate in 32 bit, just like in the reference.

010eae8b

Jan 14, 2020
- x86: Fix overflows in SSSE3 idct · ef64567e
  Henrik Gramner authored 5 years ago and Henrik Gramner committed 5 years ago
  
  ef64567e
- x86: Fix missing saturations in inverse identity asm · 3a3af969
  Henrik Gramner authored 5 years ago and Henrik Gramner committed 5 years ago
  
  3a3af969
Jan 10, 2020

SSSE3 implementations of film grain · 8ff89463

Ronald S. Bultje authored 5 years ago

gen_grain_y_ar0_8bpc_c: 84853.3
gen_grain_y_ar0_8bpc_ssse3: 23528.0
gen_grain_y_ar1_8bpc_c: 140775.5
gen_grain_y_ar1_8bpc_ssse3: 70410.2
gen_grain_y_ar2_8bpc_c: 251311.3
gen_grain_y_ar2_8bpc_ssse3: 95222.2
gen_grain_y_ar3_8bpc_c: 394763.0
gen_grain_y_ar3_8bpc_ssse3: 103541.9

gen_grain_uv_ar0_8bpc_420_c: 29773.7
gen_grain_uv_ar0_8bpc_420_ssse3: 7068.9
gen_grain_uv_ar1_8bpc_420_c: 46113.2
gen_grain_uv_ar1_8bpc_420_ssse3: 22148.1
gen_grain_uv_ar2_8bpc_420_c: 70061.4
gen_grain_uv_ar2_8bpc_420_ssse3: 25479.0
gen_grain_uv_ar3_8bpc_420_c: 113826.0
gen_grain_uv_ar3_8bpc_420_ssse3: 30004.9

fguv_32x32xn_8bpc_420_csfl0_c: 8148.9
fguv_32x32xn_8bpc_420_csfl0_ssse3: 1371.3
fguv_32x32xn_8bpc_420_csfl1_c: 6391.9
fguv_32x32xn_8bpc_420_csfl1_ssse3: 1034.8

fgy_32x32xn_8bpc_c: 14201.3
fgy_32x32xn_8bpc_ssse3: 3443.0

8ff89463

Reduce scope of NO_SANITIZE usage · e79e5ceb

Dale Curtis authored 5 years ago

dav1d_open() is part of the public API and should be sanitized, limit
sanitizer disable to just the problematic dlsym() method.

e79e5ceb

Add a workaround for -fsanitize=cfi + dlsym() issue · c192e0db

Henrik Gramner authored 5 years ago and

Henrik Gramner committed 5 years ago

CFI will SIGILL when calling a function pointer obtained through
dlsym(), regardless of whether or not the signature is correct.

See https://bugs.llvm.org/show_bug.cgi?id=44500

c192e0db

Jan 09, 2020

x86: add prep_bilin AVX512 asm · 5462c2a8

Victorien Le Couviour--Tuffet authored 5 years ago and

Ronald S. Bultje committed 5 years ago

------------------------------------------
mct_bilinear_w4_0_8bpc_avx2:      3.8
mct_bilinear_w4_0_8bpc_avx512icl: 3.7
---------------------
mct_bilinear_w8_0_8bpc_avx2:      5.0
mct_bilinear_w8_0_8bpc_avx512icl: 4.8
---------------------
mct_bilinear_w16_0_8bpc_avx2:      8.5
mct_bilinear_w16_0_8bpc_avx512icl: 7.1
---------------------
mct_bilinear_w32_0_8bpc_avx2:      29.5
mct_bilinear_w32_0_8bpc_avx512icl: 17.1
---------------------
mct_bilinear_w64_0_8bpc_avx2:      68.1
mct_bilinear_w64_0_8bpc_avx512icl: 34.7
---------------------
mct_bilinear_w128_0_8bpc_avx2:      180.5
mct_bilinear_w128_0_8bpc_avx512icl: 138.0
------------------------------------------
mct_bilinear_w4_h_8bpc_avx2:      4.0
mct_bilinear_w4_h_8bpc_avx512icl: 3.9
---------------------
mct_bilinear_w8_h_8bpc_avx2:      5.3
mct_bilinear_w8_h_8bpc_avx512icl: 5.0
---------------------
mct_bilinear_w16_h_8bpc_avx2:      11.7
mct_bilinear_w16_h_8bpc_avx512icl:  7.5
---------------------
mct_bilinear_w32_h_8bpc_avx2:      41.8
mct_bilinear_w32_h_8bpc_avx512icl: 20.3
---------------------
mct_bilinear_w64_h_8bpc_avx2:      94.9
mct_bilinear_w64_h_8bpc_avx512icl: 35.0
---------------------
mct_bilinear_w128_h_8bpc_avx2:      240.1
mct_bilinear_w128_h_8bpc_avx512icl: 143.8
------------------------------------------
mct_bilinear_w4_v_8bpc_avx2:      4.1
mct_bilinear_w4_v_8bpc_avx512icl: 4.0
---------------------
mct_bilinear_w8_v_8bpc_avx2:      6.0
mct_bilinear_w8_v_8bpc_avx512icl: 5.4
---------------------
mct_bilinear_w16_v_8bpc_avx2:      10.3
mct_bilinear_w16_v_8bpc_avx512icl:  8.9
---------------------
mct_bilinear_w32_v_8bpc_avx2:      29.5
mct_bilinear_w32_v_8bpc_avx512icl: 25.9
---------------------
mct_bilinear_w64_v_8bpc_avx2:      64.3
mct_bilinear_w64_v_8bpc_avx512icl: 41.3
---------------------
mct_bilinear_w128_v_8bpc_avx2:      198.2
mct_bilinear_w128_v_8bpc_avx512icl: 139.6
------------------------------------------
mct_bilinear_w4_hv_8bpc_avx2:      5.6
mct_bilinear_w4_hv_8bpc_avx512icl: 5.2
---------------------
mct_bilinear_w8_hv_8bpc_avx2:      8.3
mct_bilinear_w8_hv_8bpc_avx512icl: 7.0
---------------------
mct_bilinear_w16_hv_8bpc_avx2:      19.4
mct_bilinear_w16_hv_8bpc_avx512icl: 12.1
---------------------
mct_bilinear_w32_hv_8bpc_avx2:      69.1
mct_bilinear_w32_hv_8bpc_avx512icl: 32.5
---------------------
mct_bilinear_w64_hv_8bpc_avx2:      164.4
mct_bilinear_w64_hv_8bpc_avx512icl:  71.1
---------------------
mct_bilinear_w128_hv_8bpc_avx2:      405.2
mct_bilinear_w128_hv_8bpc_avx512icl: 193.1
------------------------------------------

5462c2a8

x86: add avx512icl cpu flag to x86inc.asm · 40891aab
Victorien Le Couviour--Tuffet authored 5 years ago and Ronald S. Bultje committed 5 years ago

40891aab

checkasm: x86: ensure all SIMD lanes are turned on at all times · 430967a6

Victorien Le Couviour--Tuffet authored 5 years ago and

Ronald S. Bultje committed 5 years ago

YMM and ZMM registers on x86 are turned off to save power when they haven't
been used for some period of time. When they are used there will be a
"warmup" period during which performance will be reduced and inconsistent
which is problematic when trying to benchmark individual functions.

Periodically issue "dummy" instructions that uses those registers to
prevent them from being powered down. The end result is more consistent
benchmark results.

Credits to Henrik Gramner's commit
1878c7f2af0a9c73e291488209109782c428cfcf from x264.

430967a6

Jan 08, 2020
- Add misc. inverse transform C optimizations · a4721225
  Henrik Gramner authored 5 years ago and Henrik Gramner committed 5 years ago
  
  a4721225
- Skip clipping in the inverse wht transform C implementation · c37b5ee3
  Henrik Gramner authored 5 years ago and Henrik Gramner committed 5 years ago
```
The spec requires that the final residual values of a compliant
bitstream must be representable in bitdepth+1 bits, which is not
possible if intermediate values are large enough to be clipped.
```
  c37b5ee3
- x86: Fix SSSE3 inverse identity transform overflow/clipping · a7ca7b22
  Henrik Gramner authored 5 years ago and Henrik Gramner committed 5 years ago
  
  a7ca7b22
- x86: Fix AVX2 inverse identity transform overflow/clipping · f16b43cd
  Henrik Gramner authored 5 years ago and Henrik Gramner committed 5 years ago
```
The coefficients after the first (8-bit) 1D identity transform may
require more than 16 bits precision before downshifting in some cases,
and they may also need to be clipped to int16_t after downshifting.
```
  f16b43cd
- Fix building as a meson subproject · 50220456
  Xavier Claessens authored 5 years ago and Xavier Claessens committed 5 years ago
```
- When building dav1d as a dependency fallback, the parent project needs
to provide the variable name of a dependency object, added dav1d_dep for
that.
- SOURCE_ROOT is the root of the main project source tree, use
current_source_dir() instead.
```
  50220456
Jan 07, 2020
- Fix missing include for limits.h · 115fe773
  Dale Curtis authored 5 years ago and Dale Curtis committed 5 years ago
```
Needed to use UINT_MAX.
```
  115fe773
Jan 05, 2020

arm64: msac: Avoid 32 bit intermediates in symbol_adapt · 8d574f70

Martin Storsjö authored 5 years ago

This gives small gains on A72 and A73, and on A53 on symbol_adapt16.

Before:                      Cortex A53    A72    A73
msac_decode_symbol_adapt4_neon:    63.2   52.8   53.3
msac_decode_symbol_adapt8_neon:    68.5   57.9   55.7
msac_decode_symbol_adapt16_neon:   92.8   59.7   62.8
After:
msac_decode_symbol_adapt4_neon:    63.3   48.3   50.0
msac_decode_symbol_adapt8_neon:    68.7   55.5   54.0
msac_decode_symbol_adapt16_neon:   88.6   58.8   60.0

8d574f70

Jan 02, 2020

arm64: itx: Use sqrdmulh in the preexisting identity transform functions · 9f084b0d
Martin Storsjö authored 5 years ago

9f084b0d

arm64: itx: Specialcase transforms with identity in the first pass with downshift · e36088e4

Martin Storsjö authored 5 years ago

Make sure to not clip to a 16 bit range before the downshift is done.

Add clipping to 16 bit range in all other identity transforms, where
there is no downshift.

4x4, 8x4 and 4x8 don't have any downshift, thus the existing code
structure works fine.

The identity transforms of size 32 already are specialcased with the
downshift folded in where possible. Clamping properly in them should
be enough, as any out of range values will be clamped to pixel range
in the end anyway.

Therefore we only need specialcased identity in the first pass (to
keep intermediates in 32 bit until downshifting) for 8x8 and all the
size 16 variants.

e36088e4

arm64: itx: Adjust .irp in the 4x16/16x4/8x16/16x8 functions · 33e65d80

Martin Storsjö authored 5 years ago

Don't use the \() token concatenation operator in the .irp loops;
if the function definition is enclosed in a .macro, we can't use \()
in the loop as it is expanded already when the macro is expanded,
before the loop is expanded.

33e65d80

Don't interleave the skip mode index finding loops · 4504ae3f
Ronald S. Bultje authored 5 years ago
```
The second loop depends on the output of the first loop. Fixes #326.
```
4504ae3f

Jan 01, 2020

Prevent shift by >= 32 · 3d166b97
Ronald S. Bultje authored 5 years ago

3d166b97
Take lossless into account when assigning loopfilter strength · acadacfa
Ronald S. Bultje authored 5 years ago
```
Fixes #327.
```
acadacfa

Deal with chroma coefficients that are exactly 0x100000 · de6e3170

Ronald S. Bultje authored 5 years ago

For chroma coefficients that are masked (&= 0xfffff) to no value, the
context becomes in a weird state where it has no magnitude (ctx & 0x3f
== 0), but it does have a sign (ctx & 0xc0 != 0x40). Our old code
checked just the magnitude part of the context to set the skip context
of neighbouring blocks, but libaom uses both sign and magnitude for
this purpose. Therefore, adjust our code so it does the same thing.

Luma code only checks magnitude for this purpose and is thus not
affected by this peculiarity. Fixes #325.

de6e3170

Dec 31, 2019
- x86: Fix inverse (flip)adst 8x4 clipping · f306f969
  Henrik Gramner authored 5 years ago and Henrik Gramner committed 5 years ago
  
  f306f969
- Fix C inverse ADST clipping · 81aba493
  Henrik Gramner authored 5 years ago and Henrik Gramner committed 5 years ago
  
  81aba493
Dec 29, 2019

av1: use chroma txtp inference over default DCT_DCT if qidx=0 · a4178cc2

Ronald S. Bultje authored 5 years ago

Fixes #320. The problem here is that qidx=0 is (in libaom) a shortcut
for lossless, which normally becomes WHT_WHT, but under some obscure
conditions, it can also be non-lossless, in which case qidx=0 implies
DCT_DCT for luma. For chroma, apparently we should use the default
inference pattern, which becomes DCT_DCT for inter also, but requires
the standard lookup table for intra.

a4178cc2

av1: skip super-resolution upscaling if width < 16 · 1d36922f
Ronald S. Bultje authored 5 years ago and James Almer committed 5 years ago
```
This is consistent with libaom's av1_superres_scaled(). Fixes #322.
```
1d36922f

Dec 28, 2019
- av1: do C inverse transforms in int32_t precision · 7aea6858
  Ronald S. Bultje authored 5 years ago
```
Fixes C part of #321.
```
  7aea6858