Commits · f06148e7c755098666b9c0ed97a672a51785413a · Pranav Kant / dav1d

Feb 21, 2021
- Final update for 0.8.2 NEWS · f06148e7
  Jean-Baptiste Kempf authored 4 years ago
  
  0.8.2
  
  f06148e7
Feb 19, 2021

arm32: itx: Add a NEON implementation of itx for 10 bpc · b4b225d8

Martin Storsjö authored 4 years ago

Relative speedup vs C for a few functions:

Cortex A7 A8 A9 A53 A72 A73
inv_txfm_add_4x4_dct_dct_0_10bpc_neon: 2.79 5.08 2.99 2.83 3.49 4.44
inv_txfm_add_4x4_dct_dct_1_10bpc_neon: 5.74 9.43 5.72 7.19 6.73 6.92
inv_txfm_add_8x8_dct_dct_0_10bpc_neon: 3.13 3.68 2.79 3.25 3.21 3.33
inv_txfm_add_8x8_dct_dct_1_10bpc_neon: 7.09 10.41 7.00 10.55 8.06 9.02
inv_txfm_add_16x16_dct_dct_0_10bpc_neon: 5.01 6.76 4.56 5.58 5.52 2.97
inv_txfm_add_16x16_dct_dct_1_10bpc_neon: 8.62 12.48 13.71 11.75 15.94 16.86
inv_txfm_add_16x16_dct_dct_2_10bpc_neon: 6.05 8.81 6.13 8.18 7.90 12.27
inv_txfm_add_32x32_dct_dct_0_10bpc_neon: 2.90 3.90 2.16 2.63 3.56 2.74
inv_txfm_add_32x32_dct_dct_1_10bpc_neon: 13.57 17.00 13.30 13.76 14.54 17.08
inv_txfm_add_32x32_dct_dct_2_10bpc_neon: 8.29 10.54 8.05 10.68 12.75 14.36
inv_txfm_add_32x32_dct_dct_3_10bpc_neon: 6.78 8.40 7.60 10.12 8.97 12.96
inv_txfm_add_32x32_dct_dct_4_10bpc_neon: 6.48 6.74 6.00 7.38 7.67 9.70
inv_txfm_add_64x64_dct_dct_0_10bpc_neon: 3.02 4.59 2.21 2.65 3.36 2.47
inv_txfm_add_64x64_dct_dct_1_10bpc_neon: 9.86 11.30 9.14 13.80 12.46 14.83
inv_txfm_add_64x64_dct_dct_2_10bpc_neon: 8.65 9.76 7.60 12.05 10.55 12.62
inv_txfm_add_64x64_dct_dct_3_10bpc_neon: 7.78 8.65 6.98 10.63 9.15 11.73
inv_txfm_add_64x64_dct_dct_4_10bpc_neon: 6.61 7.01 5.52 8.41 8.33 9.69

b4b225d8

arm64: itx16: Add missing clipping on narrowings · 7f5b334b
Martin Storsjö authored 4 years ago
```
While these might not be needed in practice, add them for consistency.
```
7f5b334b
arm64: itx16: Improve scheduling in idct4 · 27cb9dad
Martin Storsjö authored 4 years ago

27cb9dad
arm: itx: Add missing/fix conditions around loading eob threshold value · bf60da6c
Martin Storsjö authored 4 years ago
```
This makes these instances consistent with the rest of similar cases.
```
bf60da6c
arm: itx: Fix comment typos · 0940cb34
Martin Storsjö authored 4 years ago

0940cb34
arm32: ipred16: Fix overwrites to the right of the buffer for ipred_dc_left_w64 · be5200c4
Martin Storsjö authored 4 years ago
```
In these cases, the function wrote a 64 pixel wide output,
regardless of the actual width.
```
be5200c4
arm32: Fix the descriptive comment for the sub_sp_align macro · 84555a44
Martin Storsjö authored 4 years ago

84555a44

x86: lr: Add AVX2 implementation of wiener filter for 16 bpc · 9ca341fe

Nathan E. Egge authored 4 years ago

Relative speed-ups over C code (compared with gcc-9.3.0):

C AVX2
wiener_5tap_10bpc: 194892.0 14831.9 13.14x
wiener_5tap_12bpc: 194295.4 14828.9 13.10x
wiener_7tap_10bpc: 194391.7 19461.4 9.99x
wiener_7tap_12bpc: 194136.1 19418.7 10.00x

9ca341fe

Add bpc suffix to lr functions · a3c1c676
Nathan E. Egge authored 4 years ago

a3c1c676

Feb 17, 2021
- Update NEWS for 0.8.2 release · baa92371
  Jean-Baptiste Kempf authored 4 years ago
  
  baa92371
- x86: cdef: Add SIMD implementation of cdef_dir for 16bpc · bfbee860
  Nathan E. Egge authored 4 years ago and Nathan E. Egge committed 4 years ago
```
Relative speed-ups over C code (compared with gcc-9.3.0):

                                           C       ASM
cdef_dir_16bpc_avx2:                   534.2      72.5     7.36x
cdef_dir_16bpc_ssse3:                  534.2     104.8     5.10x
cdef_dir_16bpc_ssse3 (x86-32):         854.1     116.2     7.35x
```
  bfbee860
- Add bpc suffix to cdef functions · ec95ea52
  Nathan E. Egge authored 4 years ago and Nathan E. Egge committed 4 years ago
  
  ec95ea52
Feb 16, 2021
- build: Fix compilation with clang-cl · 1d6aae47
  Henrik Gramner authored 4 years ago and Henrik Gramner committed 4 years ago
  
  1d6aae47
- msvc: Silence false positive uninitialized variable warnings · f1aa1b0e
  Henrik Gramner authored 4 years ago and Henrik Gramner committed 4 years ago
  
  f1aa1b0e
Feb 15, 2021

Update NEWS for 0.8.2 · 54e43f90
Jean-Baptiste Kempf authored 4 years ago

54e43f90

msvc: Disable the C4090 warning · d7d125f1

Henrik Gramner authored 4 years ago and

Henrik Gramner committed 4 years ago

It's supposed to warn about const-correctness issues, but it doesn't
handle arrays of pointers correctly and will cause false positive
warnings when using memset() to zero such arrays for example.

d7d125f1

Eliminate 1D scan tables · 5faff383
Henrik Gramner authored 4 years ago and Henrik Gramner committed 4 years ago

5faff383

Optimize non-qmatrix coefficient decoding · 989057fb

Henrik Gramner authored 4 years ago and

Henrik Gramner committed 4 years ago

Not having a quantizer matrix is the most common case, so it's
worth having a separate code path for it that eliminates some
calculations and table lookups.

Without a qm, not only can we skip calculating dq * qm, but only
Exp-Golomb-coded coefficients will have the potential to overflow,
so we can also skip clipping for the vast majority of coefficients.

989057fb

Optimize decoding of non-zero coefficients · a92e307f

Henrik Gramner authored 4 years ago and

Henrik Gramner committed 4 years ago

Cache indices of non-zero coefficients during the AC token decoding
loop in order to speed up the sign decoding/dequant loop later.

a92e307f

Feb 13, 2021

Looprestoration: Over-allocate the lpf buffer · f2967b05

Henrik Gramner authored 4 years ago

Looprestoration SIMD code may overread the input buffers by a small
amount. Pad the buffer to make sure this memory is valid to access.

f2967b05

Feb 12, 2021
- CI: Add jobs for specific bit-depths · b768fdbd
  Marvin Scholz authored 4 years ago
  
  b768fdbd
- arm64: looprestoration16: Fix parameter reading from the stack on darwin · 5cf45058
  Martin Storsjö authored 4 years ago
```
On darwin, 32 bit parameters that aren't passed in registers but
on the stack, are packed tightly instead of each of them occupying
an 8 byte slot.
```
  5cf45058
- Silence -Warray-bounds warning on some gcc versions · ecf153b1
  Nathan E. Egge authored 4 years ago and Jean-Baptiste Kempf committed 4 years ago
  
  ecf153b1
Feb 11, 2021

Set thread names on Haiku · b44ec453
Emmanuel Gil Peyrot authored 4 years ago and Link Mauve committed 4 years ago

b44ec453

x86: Rewrite SGR AVX2 asm · fe2bb774

Henrik Gramner authored 4 years ago

The previous implementation did multiple passes in the horizontal
and vertical directions, with the intermediate values being stored
in buffers on the stack. This caused bad cache thrashing.

By interleaving the all the different passes in combination with a
ring buffer for storing only a few rows at a time the performance
is improved by a significant amount.

Also slightly speed up neighbor calculations by packing the a and b
values into a single 32-bit unsigned integer which allows calculations
on both values simultaneously.

fe2bb774

Add minor SGR optimizations · c290c02e

Henrik Gramner authored 4 years ago

Split the 5x5, 3x3, and mix cases into separate functions.

Shrink some tables.

Move some scalar calculations out of the DSP function.

Make Wiener and SGR share the same function prototype to
eliminate a branch in lr_stripe().

c290c02e

x86inc: Add stack probing on Windows · c36b191a

Henrik Gramner authored 4 years ago

Large stack allocations on Windows need to use stack probing in order
to guarantee that all stack memory is committed before accessing it.
This is done by ensuring that the guard page(s) at the end of the
currently committed pages are touched prior to any pages beyond that.

c36b191a

dav1dplay: Add -lm for llround() support · 58cb4cf0

Emmanuel Gil Peyrot authored 4 years ago

Neither --buildtype=plain nor --buildtype=debug set -ffast-math, so
llround() is kept as a function call and isn’t optimised out into
cvttsd2siq (on amd64), thus requiring the math lib to be linked.

Note that even with -ffast-math, it isn’t guaranteed that a call to
llround() will always be omitted (I have reproduced this on PowerPC), so
this fix is correct even if we ever decide to enable -ffast-math in
other build types.

58cb4cf0

Feb 10, 2021

arm64: itx16: Use usqadd to avoid separate clamping of negative values · 6f9f3391

Martin Storsjö authored 4 years ago

Before: Cortex A53 A72 A73
inv_txfm_add_4x4_dct_dct_0_10bpc_neon: 40.7 23.0 24.0
inv_txfm_add_4x4_dct_dct_1_10bpc_neon: 116.0 71.5 78.2
inv_txfm_add_8x8_dct_dct_0_10bpc_neon: 85.7 50.7 53.8
inv_txfm_add_8x8_dct_dct_1_10bpc_neon: 287.0 203.5 215.2
inv_txfm_add_16x16_dct_dct_0_10bpc_neon: 255.7 129.1 140.4
inv_txfm_add_16x16_dct_dct_1_10bpc_neon: 1401.4 1026.7 1039.2
inv_txfm_add_16x16_dct_dct_2_10bpc_neon: 1913.2 1407.3 1479.6
After:
inv_txfm_add_4x4_dct_dct_0_10bpc_neon: 38.7 21.5 22.2
inv_txfm_add_4x4_dct_dct_1_10bpc_neon: 116.0 71.3 77.2
inv_txfm_add_8x8_dct_dct_0_10bpc_neon: 76.7 44.7 43.5
inv_txfm_add_8x8_dct_dct_1_10bpc_neon: 278.0 203.0 203.9
inv_txfm_add_16x16_dct_dct_0_10bpc_neon: 236.9 106.2 116.2
inv_txfm_add_16x16_dct_dct_1_10bpc_neon: 1368.7 999.7 1008.4
inv_txfm_add_16x16_dct_dct_2_10bpc_neon: 1880.5 1381.2 1459.4

6f9f3391

Feb 09, 2021

arm64: looprestoration: Rewrite the wiener functions · 2e73051c

Martin Storsjö authored 4 years ago and

Jean-Baptiste Kempf committed 4 years ago

Make them operate in a more cache friendly manner, interleaving
horizontal and vertical filtering (reducing the amount of stack
used from 51 KB to 4 KB), similar to what was done for x86 in
78d27b7d.

This also adds separate 5tap versions of the filters and unrolls
the vertical filter a bit more (which maybe could have been done
without doing the rewrite).

This does, however, increase the compiled code size by around
3.5 KB.

Before:                Cortex A53       A72       A73
wiener_5tap_8bpc_neon:   136855.6   91446.2   87363.6
wiener_7tap_8bpc_neon:   136861.6   91454.9   87374.5
wiener_5tap_10bpc_neon:  167685.3  114720.3  116522.1
wiener_5tap_12bpc_neon:  167677.5  114724.7  116511.9
wiener_7tap_10bpc_neon:  167681.6  114738.5  116567.0
wiener_7tap_12bpc_neon:  167673.8  114720.8  116515.4
After:
wiener_5tap_8bpc_neon:    87102.1   60460.6   66803.8
wiener_7tap_8bpc_neon:   110831.7   78489.0   82015.9
wiener_5tap_10bpc_neon:  109999.2   90259.0   89238.0
wiener_5tap_12bpc_neon:  109978.3   90255.7   89220.7
wiener_7tap_10bpc_neon:  137877.6  107578.5  103435.6
wiener_7tap_12bpc_neon:  137868.8  107568.9  103390.4

2e73051c

arm64: mc: Improve first tap for inorder cores · 4e869495
Kyle Siefring authored 4 years ago and Martin Storsjö committed 4 years ago
```
Change order of multiply accumulates to allow inorder cores to forward
the results.
```
4e869495

Feb 08, 2021

arm32: mc: Optimize warp by doing horz filtering in 8 bit · 0477fcf1

Martin Storsjö authored 4 years ago and

Jean-Baptiste Kempf committed 4 years ago

Additionally reschedule instructions for loading, to reduce stalls
on in order cores.

This applies the changes from a3b8157e
on the arm32 version.

Before:             Cortex A7      A8      A9     A53     A72     A73
warp_8x8_8bpc_neon:    3659.3  1746.0  1931.9  2128.8  1173.7  1188.9
warp_8x8t_8bpc_neon:   3650.8  1724.6  1919.8  2105.0  1147.7  1206.9
warp_8x8_16bpc_neon:   4039.4  2111.9  2337.1  2462.5  1334.6  1396.5
warp_8x8t_16bpc_neon:  3973.9  2137.1  2299.6  2413.2  1282.8  1369.6
After:
warp_8x8_8bpc_neon:    2920.8  1269.8  1410.3  1767.3   860.2  1004.8
warp_8x8t_8bpc_neon:   2904.9  1283.9  1397.5  1743.7   863.6  1024.7
warp_8x8_16bpc_neon:   3895.5  2060.7  2339.8  2376.6  1331.1  1394.0
warp_8x8t_16bpc_neon:  3822.7  2026.7  2298.7  2325.4  1278.1  1360.8

0477fcf1

build: Fix ninja warning message on Windows · 69268d3a

Henrik Gramner authored 4 years ago and

Henrik Gramner committed 4 years ago

We currently run 'git describe --match' to obtain the current version,
but meson doesn't properly quote/escape the pattern string on Windows.

As a result, "fatal: Not a valid object name .ninja_log" is printed
when compiling on Windows systems. Compilation still works, but the
warning is annoying and misleading.

Currently we don't actually need the pattern matching functionality
(which is why things still work), so simply remove it as a workaround.

69268d3a

xxhash: Add a cast to silence a warning when built with MSVC · 95884615

Martin Storsjö authored 4 years ago

This silences the following warning:
tools/output/xxhash.c(127): warning C4244: '=': conversion from 'unsigned long' to 'unsigned char', possible loss of data

95884615

lf_mask: Align an array that is accessed via aliasing structures · 0a577fd2
Martin Storsjö authored 4 years ago
```
This fixes bus errors due to missing alignment, when built with GCC 9
for arm32 with -mfpu=neon.
```
0a577fd2

tools: add optional xxh3 based muxer · e6168525

Janne Grunau authored 4 years ago and

Jean-Baptiste Kempf committed 4 years ago

The required 'xxhash.h' header can either be in system include directory
or can be copied to 'tools/output'.

The xxh3_128bits based muxer shows no significant slowdown compared to
the null muxer. Decoding times Chimera-AV1-8bit-1920x1080-6736kbps.ivf
with 4 frame and 4 tile threads on a core i7-8550U (disabled turbo boost):

null:  72.5 s
md5:   99.8 s
xxh3:  73.8 s

Decoding Chimera-AV1-10bit-1920x1080-6191kbps.ivf with 6 frame and 4 tile
threads on a m1 mc mini:

null:  27.8 s
md5:  105.9 s
xxh3:  28.3 s

e6168525

cli: Fix md5 verification for short values · 061ac9ae

Matthias Dressel authored 4 years ago

Verification should not succeed if the given string is too short to be a
real hash.

Fixes videolan/dav1d#361

061ac9ae

Feb 06, 2021
- tools: fix '--verify' with muxer explicitly set · 93319cef
  Janne Grunau authored 4 years ago
  
  93319cef
- Shrink some wedge initialization tables · ffc4e01c
  Henrik Gramner authored 4 years ago
  
  ffc4e01c