Commits · master · Marvin Scholz / dav1d

Apr 21, 2020
- x86: Eliminate redundant 3-operand register syntax in itx · 114e8f0e
  Henrik Gramner authored 5 years ago and Henrik Gramner committed 5 years ago
```
Purely a cosmetic change.
```
  114e8f0e
- x86: Remove identity/adst itx fast paths · e0b88bd2
  Henrik Gramner authored 5 years ago and Henrik Gramner committed 5 years ago
```
Testing shows that those code paths are essentially never executed
with real-world bitstreams so they just add redundant branches,
increase code size, and add complexity for no actual benefit.
```
  e0b88bd2
Apr 16, 2020

Fix MC masks alignment for sizes >= 64 for AVX-512 · 98ed9be6

Victorien Le Couviour--Tuffet authored 5 years ago

Those need to be aligned when w*h >= 64, as we will try to load by 64 bytes.

(also realigns the 4x4 masks to 16 as a 32-byte alignment is unnecessary)

98ed9be6

Apr 11, 2020
- CLI: Add missing cpumask for VSX · 6ea3fda5
  Matthias Dressel authored 5 years ago
  
  6ea3fda5
Apr 10, 2020

memory sanitizer: mask all CPU flags · ef990c9d

Janne Grunau authored 5 years ago

Memory sanitizer depends on compiler instrumentation which makes it
inherently incompatible with asm DSP functions. Refs #336

ef990c9d

Apr 09, 2020
- Remove 422 check from cdef loop · 87d5dc8e
  Luc Trudeau authored 5 years ago
```
Tested in isolation, this appears to be faster, but hard to tell overall.
```
  87d5dc8e
Apr 07, 2020
- const correctness in ipred_tmpl.c · fced1a14
  Luc Trudeau authored 5 years ago
  
  fced1a14
- x86: Split AVX2 / AVX-512 CDEF into dedicated files · 604d93c5
  Victorien Le Couviour--Tuffet authored 5 years ago
  
  604d93c5
- x86: Add cdef_filter_{4,8}x8 AVX-512 (Ice Lake) asm · 95068df6
  Victorien Le Couviour--Tuffet authored 5 years ago
```
cdef_filter_4x8_8bpc_avx2: 54.0
cdef_filter_4x8_8bpc_avx512icl: 35.5
=> +52.1%

cdef_filter_8x8_8bpc_avx2: 71.0
cdef_filter_8x8_8bpc_avx512icl: 49.0
=> +44.9%
```
  95068df6
Apr 05, 2020

arm: mc: NEON implementation of emu_edge for 8bpc · daaf4489

Martin Storsjö authored 5 years ago

Relative speedup over C code:
Cortex A7 A8 A9 A53 A72 A73
emu_edge_w4_8bpc_neon: 4.23 3.39 2.55 3.58 3.11 3.57
emu_edge_w8_8bpc_neon: 4.02 3.61 2.47 3.74 3.50 3.77
emu_edge_w16_8bpc_neon: 4.56 3.63 2.93 3.97 3.44 4.11
emu_edge_w32_8bpc_neon: 3.82 3.05 2.04 3.79 2.34 3.10
emu_edge_w64_8bpc_neon: 3.27 2.97 1.84 3.70 2.39 1.97
emu_edge_w128_8bpc_neon: 2.58 2.64 1.54 3.04 1.28 1.87

daaf4489

arm64: mc: NEON implementation of emu_edge for 16bpc · e03f2d06

Martin Storsjö authored 5 years ago

Relative speedup over C code:
                      Cortex A53    A72    A73
emu_edge_w4_16bpc_neon:     2.49   1.53   1.91
emu_edge_w8_16bpc_neon:     2.27   1.55   1.90
emu_edge_w16_16bpc_neon:    2.46   1.46   2.09
emu_edge_w32_16bpc_neon:    2.20   1.39   1.73
emu_edge_w64_16bpc_neon:    1.65   1.00   1.46
emu_edge_w128_16bpc_neon:   1.55   1.44   1.54

e03f2d06

Apr 04, 2020

arm64: mc: NEON implementation of emu_edge for 8bpc · ea54dbe2

Martin Storsjö authored 5 years ago

Relative speedups over C code:
                     Cortex A53    A72    A73
emu_edge_w4_8bpc_neon:     3.82   2.93   2.41
emu_edge_w8_8bpc_neon:     3.28   2.86   2.51
emu_edge_w16_8bpc_neon:    3.58   3.27   2.63
emu_edge_w32_8bpc_neon:    3.04   1.68   2.12
emu_edge_w64_8bpc_neon:    2.58   1.45   1.48
emu_edge_w128_8bpc_neon:   1.79   1.02   1.57

The benchmark numbers for the larger size on A72 fluctuate a
whole lot and thus seem very unreliable.

ea54dbe2

Apr 03, 2020
- CI: Remove port number from registry URL · ad392d71
  Matthias Dressel authored 5 years ago
  
  ad392d71
- CI: style: Allow the name 'David' in connection with copyright · 5a3e64f1
  Matthias Dressel authored 5 years ago and Jean-Baptiste Kempf committed 5 years ago
  
  5a3e64f1
- CI: Add debug messages to style checks · 85519d0e
  Matthias Dressel authored 5 years ago and Jean-Baptiste Kempf committed 5 years ago
  
  85519d0e
- x86: add some explanatory comment to wiener_filter_h · 71f27407
  Victorien Le Couviour--Tuffet authored 5 years ago
```
Explains how the clipping to the range defined in the spec works.
```
  71f27407
- Look up __pthread_get_minstack only in glibc · 42af404e
  Wan-Teh Chang authored 5 years ago and Jean-Baptiste Kempf committed 5 years ago
```
Android uses bionic, not glibc, as the C library. __linux__ is also
defined for Android, so also test __GLIBC__ to avoid looking up
__pthread_get_minstack in Android bionic.

Also, include <dlfcn.h> only if HAVE_DLSYM is defined. In glibc,
<dlfcn.h> includes <features.h>, which defines __GLIBC__.
```
  42af404e
Apr 02, 2020

Update a stale comment for dav1d_alloc_aligned() · ab350c2f

Wan-Teh Chang authored 5 years ago

Also, the assertion that 'align'  is a power of 2 can be used by all
cases in dav1d_alloc_aligned().

ab350c2f

x86: mc: Skip checks for zero leftext/rightext within the need_left_ext/need_right_ext blocks · 23517a3e

Martin Storsjö authored 5 years ago

If leftext/rightext are zero, we invoke a version of v_loop with
the whole need_left_ext/need_right_ext parts left out altogether,
so these checks seem to be redundant.

23517a3e

Skip loop restoration cache buffer resize for too-small buffers · 41cd4199

Ronald S. Bultje authored 5 years ago

Fixes crashes in dav1d_resize_{avx2,ssse3} on very small resolutions
with super_res enabled but skipped because the width is too small.

41cd4199

Apr 01, 2020

x86: add SSSE3 versions for filmgrain.fguv_32x32xn[422/444] · 4687c469

Ronald S. Bultje authored 5 years ago

fguv_32x32xn_8bpc_420_csfl0_c: 14568.2
fguv_32x32xn_8bpc_420_csfl0_ssse3: 1162.3
fguv_32x32xn_8bpc_420_csfl1_c: 10682.0
fguv_32x32xn_8bpc_420_csfl1_ssse3: 910.3
fguv_32x32xn_8bpc_422_csfl0_c: 16370.5
fguv_32x32xn_8bpc_422_csfl0_ssse3: 1202.6
fguv_32x32xn_8bpc_422_csfl1_c: 11333.8
fguv_32x32xn_8bpc_422_csfl1_ssse3: 958.8
fguv_32x32xn_8bpc_444_csfl0_c: 12950.1
fguv_32x32xn_8bpc_444_csfl0_ssse3: 1133.6
fguv_32x32xn_8bpc_444_csfl1_c: 8806.7
fguv_32x32xn_8bpc_444_csfl1_ssse3: 731.0

4687c469

x86: use btc instead of xor+test or 32byte alignment in fgy_32x32xn_ssse3 · b73acaa8
Ronald S. Bultje authored 5 years ago

b73acaa8

x86: add AVX2 versions for filmgrain.fguv_32x32xn[422/444] · 275e91de

Ronald S. Bultje authored 5 years ago

fguv_32x32xn_8bpc_420_csfl0_c: 14568.2
fguv_32x32xn_8bpc_420_csfl0_avx2: 940.2
fguv_32x32xn_8bpc_420_csfl1_c: 10682.0
fguv_32x32xn_8bpc_420_csfl1_avx2: 783.3
fguv_32x32xn_8bpc_422_csfl0_c: 16370.5
fguv_32x32xn_8bpc_422_csfl0_avx2: 1557.3
fguv_32x32xn_8bpc_422_csfl1_c: 11333.8
fguv_32x32xn_8bpc_422_csfl1_avx2: 902.1
fguv_32x32xn_8bpc_444_csfl0_c: 12950.1
fguv_32x32xn_8bpc_444_csfl0_avx2: 822.9
fguv_32x32xn_8bpc_444_csfl1_c: 8806.7
fguv_32x32xn_8bpc_444_csfl1_avx2: 708.2

275e91de

x86: use btc instead of xor+test in fgy_32x32xn_avx2 · fcc94fa9
Ronald S. Bultje authored 5 years ago

fcc94fa9

Mar 31, 2020
- Align dav1d_resize_filter[] · 9d34160a
  Henrik Gramner authored 5 years ago and Henrik Gramner committed 5 years ago
```
Ensure that unaligned memory access overhead is avoided.
```
  9d34160a
- x86: don't use vptest in SSSE3 version · 4dd94315
  Ronald S. Bultje authored 5 years ago
```
This is the VEX (AVX) encoded variant for the SSE4 instruction ptest,
so emulate it using pmovmskb in the SSSE3 version.
```
  4dd94315
- x86: add SSSE3 version of mc.resize() · e308ae49
  Ronald S. Bultje authored 5 years ago
```
resize_8bpc_c: 1613670.2
resize_8bpc_ssse3: 110469.5
resize_8bpc_avx2: 93580.6
```
  e308ae49
- x86: add AVX2 version of mc.resize() · 9e36b9b0
  Ronald S. Bultje authored 5 years ago
```
resize_8bpc_c: 1637609.7
resize_8bpc_avx2: 95162.6
```
  9e36b9b0
- checkasm: add test for mc.resize() · 862e5bc7
  Ronald S. Bultje authored 5 years ago
  
  862e5bc7
- Invert src_w/h argument in mc.resize() · aa1866f2
  Ronald S. Bultje authored 5 years ago
  
  aa1866f2
- Make dav1d_resize_filter[] negative so it fits in int8_t · 8fd5dc3a
  Ronald S. Bultje authored 5 years ago
  
  8fd5dc3a
Mar 29, 2020
- const correctness in itx_1d · 63f96a1f
  Luc Trudeau authored 5 years ago
  
  63f96a1f
Mar 28, 2020
- const correctness msac · 1787089d
  Luc Trudeau authored 5 years ago
  
  1787089d
- const correctness in obu.c · 9f676719
  Luc Trudeau authored 5 years ago and Jean-Baptiste Kempf committed 5 years ago
  
  9f676719
Mar 27, 2020
- meson/x86: add option to disable AVX-512 asm · 7cd94693
  Janne Grunau authored 5 years ago
```
Allows building with nasm < 2.14.
```
  7cd94693
- const correctness in picture.c · 8e129520
  Luc Trudeau authored 5 years ago
  
  8e129520
Mar 26, 2020

Extract sub_h4 out of inner loop · f05d6706
Luc Trudeau authored 5 years ago and Jean-Baptiste Kempf committed 5 years ago
```
Also contains const correctness changes.
```
f05d6706

arm64: ipred: Add NEON implementation of ipred for 16 bpc · 41a58e64

Martin Storsjö authored 5 years ago

The FILTER_PRED function is templated and has two separate instantations
for 10 and 12 bit separately. (They're switched between using a
runtime check on entry to the function.)

41a58e64

checkasm: ipred: Test and benchmark FILTER_PRED separately for 10 and 12 bpc · 59c31e77

Martin Storsjö authored 5 years ago

This allows testing cases where this function internally switches
between two implementations for those two, without adding a bpc
argument to the init function (forcing testing and benchmarking every
16 bpc ipred function twice).

59c31e77

arm: ipred: Prepare for 16 bpc · a9323ef5
Martin Storsjö authored 5 years ago

a9323ef5