Commits · master · Nico Weber / dav1d

Jul 04, 2020
- Move logo to doc/ folder · f116e076
  Jean-Baptiste Kempf authored 4 years ago
```
Removes files from top-level
```
  f116e076
Jul 02, 2020

arm32: ipred: Port 8 bpc NEON implementations of remaining arm64 funtions · 8dd9c651

Martin Storsjö authored 4 years ago

This matches was is implemented for arm64 so far.

Align the dav1d_sm_weights table to allow aligned loads from it.

Relative speedups over C code (vs potentially autovectorized code, built
with Clang):

Cortex A7 A8 A9 A53 A72 A73
intra_pred_paeth_w4_8bpc_neon: 4.81 7.61 5.82 5.50 5.61 6.94
intra_pred_paeth_w8_8bpc_neon: 7.83 11.95 9.51 11.05 8.90 10.51
intra_pred_paeth_w16_8bpc_neon: 4.86 4.49 3.90 4.60 3.76 3.54
intra_pred_paeth_w32_8bpc_neon: 4.55 4.03 3.52 4.27 3.30 3.21
intra_pred_paeth_w64_8bpc_neon: 4.38 3.72 3.32 3.95 3.08 3.00
intra_pred_smooth_h_w4_8bpc_neon: 5.74 10.80 5.32 6.79 4.77 6.48
intra_pred_smooth_h_w8_8bpc_neon: 10.59 17.95 9.39 16.03 6.94 8.98
intra_pred_smooth_h_w16_8bpc_neon: 2.81 3.19 2.12 3.70 2.90 3.59
intra_pred_smooth_h_w32_8bpc_neon: 2.63 2.41 1.86 3.44 2.24 2.66
intra_pred_smooth_h_w64_8bpc_neon: 2.42 2.52 1.79 3.24 1.81 2.11
intra_pred_smooth_v_w4_8bpc_neon: 4.15 7.99 3.46 4.63 3.83 4.39
intra_pred_smooth_v_w8_8bpc_neon: 7.31 12.42 7.04 10.00 4.26 6.20
intra_pred_smooth_v_w16_8bpc_neon: 3.70 3.44 2.53 3.33 2.76 3.21
intra_pred_smooth_v_w32_8bpc_neon: 3.91 3.74 2.70 3.51 2.50 2.96
intra_pred_smooth_v_w64_8bpc_neon: 4.03 3.94 2.80 3.64 2.36 2.80
intra_pred_smooth_w4_8bpc_neon: 4.09 7.74 4.54 4.79 3.26 5.10
intra_pred_smooth_w8_8bpc_neon: 5.63 8.93 6.62 8.28 3.73 6.04
intra_pred_smooth_w16_8bpc_neon: 3.97 3.40 3.32 3.74 3.01 3.77
intra_pred_smooth_w32_8bpc_neon: 3.75 3.14 3.07 3.28 2.65 3.17
intra_pred_smooth_w64_8bpc_neon: 3.60 3.04 2.93 2.97 2.35 2.85
intra_pred_filter_w4_8bpc_neon: 5.54 6.43 4.90 7.26 3.44 4.61
intra_pred_filter_w8_8bpc_neon: 7.05 7.15 5.50 10.05 4.29 6.02
intra_pred_filter_w16_8bpc_neon: 7.36 6.46 5.27 11.51 4.75 6.70
intra_pred_filter_w32_8bpc_neon: 7.56 6.32 5.01 12.34 4.47 6.97
pal_pred_w4_8bpc_neon: 5.47 7.76 4.40 5.20 8.32 7.03
pal_pred_w8_8bpc_neon: 11.11 14.12 8.44 13.95 11.88 12.43
pal_pred_w16_8bpc_neon: 14.38 20.95 9.84 17.43 14.77 13.56
pal_pred_w32_8bpc_neon: 12.91 19.85 10.87 19.03 14.63 14.62
pal_pred_w64_8bpc_neon: 14.01 19.23 10.82 19.82 16.23 16.32
cfl_ac_420_w4_8bpc_neon: 8.11 13.41 7.92 9.26 10.55 9.36
cfl_ac_420_w8_8bpc_neon: 7.77 15.71 7.69 8.94 9.76 8.56
cfl_ac_420_w16_8bpc_neon: 7.72 13.71 8.30 9.05 9.81 9.02
cfl_ac_422_w4_8bpc_neon: 8.85 15.80 8.26 10.97 13.04 10.00
cfl_ac_422_w8_8bpc_neon: 8.77 16.96 7.57 10.46 12.16 9.92
cfl_ac_422_w16_8bpc_neon: 8.28 14.91 7.16 9.69 10.57 9.18
cfl_ac_444_w4_8bpc_neon: 7.47 14.13 7.50 9.76 11.11 9.39
cfl_ac_444_w8_8bpc_neon: 6.81 15.46 5.27 9.11 12.09 9.76
cfl_ac_444_w16_8bpc_neon: 6.11 13.68 4.62 8.17 10.78 8.92
cfl_ac_444_w32_8bpc_neon: 5.71 12.11 4.28 7.53 9.53 8.52
cfl_pred_cfl_128_w4_8bpc_neon: 7.46 12.63 8.48 8.03 7.64 9.29
cfl_pred_cfl_128_w8_8bpc_neon: 5.05 5.16 3.79 4.64 5.07 4.42
cfl_pred_cfl_128_w16_8bpc_neon: 4.44 5.17 3.65 4.20 4.41 4.74
cfl_pred_cfl_128_w32_8bpc_neon: 4.51 5.25 3.67 4.29 4.39 4.73
cfl_pred_cfl_left_w4_8bpc_neon: 6.60 11.74 7.75 6.91 7.44 9.14
cfl_pred_cfl_left_w8_8bpc_neon: 4.92 5.15 3.80 4.41 5.44 4.81
cfl_pred_cfl_left_w16_8bpc_neon: 4.40 5.26 3.66 4.10 4.63 4.94
cfl_pred_cfl_left_w32_8bpc_neon: 4.50 5.31 3.68 4.25 4.43 4.82
cfl_pred_cfl_top_w4_8bpc_neon: 7.00 11.88 7.88 7.50 7.43 9.68
cfl_pred_cfl_top_w8_8bpc_neon: 4.96 5.07 3.78 4.51 5.31 4.75
cfl_pred_cfl_top_w16_8bpc_neon: 4.42 5.31 3.69 4.16 4.60 4.93
cfl_pred_cfl_top_w32_8bpc_neon: 4.52 5.36 3.71 4.29 4.47 4.83
cfl_pred_cfl_w4_8bpc_neon: 5.92 10.54 7.25 6.21 6.79 8.33
cfl_pred_cfl_w8_8bpc_neon: 4.67 5.16 3.77 4.14 5.20 4.71
cfl_pred_cfl_w16_8bpc_neon: 4.29 5.29 3.70 3.97 4.53 4.86
cfl_pred_cfl_w32_8bpc_neon: 4.47 5.34 3.72 4.20 4.42 4.83

8dd9c651

arm32: ipred: Optimize ipred_dc_w32 · b4291523

Martin Storsjö authored 4 years ago

Do the horizontal summing in the same way as for other cases of
32 pixel summing.

This doesn't seem to affect the runtime significantly though (checkasm
benchmarks vary by a couple cycles), but it's 5 instructions shorter
at least.

b4291523

arm32: ipred: Use narrower vdups where possible · 8fd0bc90
Martin Storsjö authored 4 years ago

8fd0bc90

arm32: ipred: Fix comment formatting · f4a0127a

Martin Storsjö authored 4 years ago

This matches the arm64 original. The comment isn't about the condition,
but about the state after the conditional branch.

f4a0127a

arm32: ipred: Remove unnecessary operations in ipred_dc_w4 · d00a0227

Martin Storsjö authored 4 years ago

These came from matching some parts too closely to the arm64 version
(where the summation can be done efficiently with uaddlv by zeroing
the upper half of the register).

Before:                  Cortex A7     A8     A9    A53   A72    A73
intra_pred_dc_w4_8bpc_neon:  124.5   65.1   90.2  100.4  48.1   50.4
After:
intra_pred_dc_w4_8bpc_neon:  120.3   60.7   83.6   94.0  44.1   47.9

d00a0227

arm32: ipred: Mark a few more loads as aligned · 74d5cf57

Martin Storsjö authored 4 years ago

This speeds things up a bit on older cores.

Also do a load that duplicates the input over the whole register
instead of just loading a single lane in iprev_v_w4. This can be a
bit faster on Cortex A8.

Before: Cortex A7 A8 A9 A53 A72 A73
intra_pred_v_w4_8bpc_neon: 54.0 38.4 46.4 47.7 20.4 18.1
intra_pred_h_w4_8bpc_neon: 66.3 43.1 55.0 57.0 27.9 22.2
intra_pred_h_w8_8bpc_neon: 81.0 60.2 76.7 66.5 31.1 30.1
intra_pred_dc_left_w4_8bpc_neon: 91.0 49.0 72.8 77.7 35.4 38.5
intra_pred_dc_left_w8_8bpc_neon: 103.8 73.5 90.2 84.7 42.8 47.1
intra_pred_dc_left_w16_8bpc_neon: 156.1 101.8 186.1 119.4 77.7 92.6
intra_pred_dc_left_w32_8bpc_neon: 270.5 200.5 381.6 191.7 152.6 170.3
intra_pred_dc_left_w64_8bpc_neon: 560.7 439.1 877.0 375.4 333.5 343.6

After:
intra_pred_v_w4_8bpc_neon: 53.9 38.0 46.4 47.7 19.8 19.2
intra_pred_h_w4_8bpc_neon: 66.5 39.2 52.6 57.0 27.7 22.2
intra_pred_h_w8_8bpc_neon: 80.5 55.8 72.9 66.5 31.4 30.1
intra_pred_dc_left_w4_8bpc_neon: 91.0 48.2 71.8 77.7 34.9 38.6
intra_pred_dc_left_w8_8bpc_neon: 103.8 69.6 89.2 84.7 43.2 47.3
intra_pred_dc_left_w16_8bpc_neon: 182.3 99.9 184.9 118.8 77.7 85.8
intra_pred_dc_left_w32_8bpc_neon: 355.4 198.9 380.1 190.6 152.9 161.0
intra_pred_dc_left_w64_8bpc_neon: 517.5 437.4 876.9 375.7 333.3 347.7

74d5cf57

arm64: ipred: 16 bpc NEON implementation of the cfl_ac 444 function · 72db6607

Martin Storsjö authored 4 years ago

Relative speedup over C code:
                       Cortex A53    A72    A73
cfl_ac_444_w4_16bpc_neon:    8.03   9.41  10.48
cfl_ac_444_w8_16bpc_neon:   10.17  10.54  10.38
cfl_ac_444_w16_16bpc_neon:  10.73  10.38   9.73
cfl_ac_444_w32_16bpc_neon:  10.18   9.43   9.77

72db6607

arm64: ipred: 8 bpc NEON implementation of the cfl_ac 444 function · 9b40bb95

Martin Storsjö authored 4 years ago

Relative speedup over C code:
                      Cortex A53    A72    A73
cfl_ac_444_w4_8bpc_neon:    8.72   8.75  10.50
cfl_ac_444_w8_8bpc_neon:   13.10  10.77  11.23
cfl_ac_444_w16_8bpc_neon:  13.08   9.95  10.49
cfl_ac_444_w32_8bpc_neon:  12.58   9.43  10.63

9b40bb95

arm64: ipred: Remove an unnecessary branch in cfl_ac_420 · 2e271c49
Martin Storsjö authored 4 years ago
```
The branch target is directly afterwards, so the branch isn't needed.
```
2e271c49
arm64: ipred: Remove an accidental leftover instruction · a903642a
Martin Storsjö authored 4 years ago
```
It became unused in 38629906.
```
a903642a

arm64: ipred: Optimize the w16/w32 loop of pred_filter a bit · 2e36a3be

Martin Storsjö authored 4 years ago

Before:                        Cortex A53     A72     A73
intra_pred_filter_w16_8bpc_neon:    540.2   573.8   580.2
intra_pred_filter_w32_8bpc_neon:   1223.1  1364.1  1292.9
After:
intra_pred_filter_w16_8bpc_neon:    531.4   559.8   565.4
intra_pred_filter_w32_8bpc_neon:   1243.0  1308.6  1270.9

This does give a minor slowdown for the w32 case on A53, but helps
on w16 and quite notably in all cases on A72 and A73. Doing the same
modification on ipred16.S doesn't give quite as clear gains (the gains
on A72 and A73 are smaller, and the regression on A53 on on w32 is a
bit bigger), so not doing the same adjustment there.

2e36a3be

arm64: ipred: Fix a comment typo · a26882d2
Martin Storsjö authored 4 years ago

a26882d2
arm64: ipred: Improve scheduling a tiny bit in the entry in smooth · 8e004039
Martin Storsjö authored 4 years ago

8e004039

Jul 01, 2020
- x86: Split AVX2 and AVX-512 mc asm into separate files · 5fe20ec7
  Henrik Gramner authored 4 years ago and Henrik Gramner committed 4 years ago
  
  5fe20ec7
- Fix compilation with nasm 2.15 · 2b567aaa
  Henrik Gramner authored 4 years ago and Jean-Baptiste Kempf committed 4 years ago
```
%{:} macro operand ranges were broken in nasm 2.15 which causes
errors when compiling, so avoid using those for now.

Some new warnings regarding use of empty macro parameters has also
been added, adjust some x86inc code to silence those.
```
  2b567aaa
Jun 29, 2020
- meson: Workaround missing aarch64 normalisation · a7b92c76
  Marvin Scholz authored 4 years ago and Jean-Baptiste Kempf committed 4 years ago
```
Meson does not yet normalises arm64 to the aarch64 in the reference
table. To workaround this, in addition to the cpu_family check the cpu
field.
```
  a7b92c76
- CLI: Fix help text · 9ad9b326
  Matthias Dressel authored 4 years ago and Jean-Baptiste Kempf committed 4 years ago
```
Since 46d092ae the demuxer is no longer
detected from extension but rather by probing.
```
  9ad9b326
- x86: Add minor mc 8-tap optimizations · 5895809e
  Henrik Gramner authored 4 years ago and Henrik Gramner committed 4 years ago
  
  5895809e
Jun 25, 2020
- CI: Fix PIC test on 32-bit x86 · be1fe18e
  Henrik Gramner authored 4 years ago and Henrik Gramner committed 4 years ago
  
  be1fe18e
- x86: Fix 32-bit build with PIC enabled · 464ca6c2
  Henrik Gramner authored 4 years ago and Henrik Gramner committed 4 years ago
  
  464ca6c2
Jun 24, 2020
- x86inc: Improve warnings for use of unsupported instructions · ac5f7d0c
  Henrik Gramner authored 4 years ago and Henrik Gramner committed 4 years ago
  
  ac5f7d0c
- x86inc: Add missing 'blsr' BMI1 instruction · c19484d8
  Henrik Gramner authored 4 years ago and Henrik Gramner committed 4 years ago
  
  c19484d8
Jun 23, 2020

x86inc: Add template defines for EVEX broadcasts · 8ec5ff0e

Henrik Gramner authored 4 years ago and

Henrik Gramner committed 4 years ago

Broadcasting a memory operand is binary flag, you either broadcast
or you don't, and there's only a single possible element size for
any given instruction.

The instruction syntax however requires the broadcast semanticts
to be explicitly defined, which is an issue when using macros to
template code for multiple register widhts.

Add some helper defines to alleviate the issue.

8ec5ff0e

Accumulate leb128 value using uint64_t as intermediate type · 47daa4df

Ronald S. Bultje authored 4 years ago

The shift-amount can be up to 56, and left-shifting 32-bit integers
by values >=32 is undefined behaviour. Therefore, use 64-bit integers
instead. Also slightly rewrite so we only call dav1d_get_bits() once
for the combined more|bits value, and mask the relevant portions
out instead of reading twice. Lastly, move the overflow check out of
the loop (as suggested by @wtc)

Fixes #341.

47daa4df

Jun 21, 2020
- Simplify checks for leb128() and leb() output overflow · 54f92068
  Wan-Teh Chang authored 4 years ago and Ronald S. Bultje committed 4 years ago
  
  54f92068
- Simplify checks for dav1d_get_uleb128 out overflow · b14711ca
  Wan-Teh Chang authored 4 years ago and Ronald S. Bultje committed 4 years ago
  
  b14711ca
- Assert sz is nonzero in dav1d_init_get_bits() · 1efea985
  Wan-Teh Chang authored 4 years ago and Ronald S. Bultje committed 4 years ago
```
dav1d_init_get_bits() initializes c->eof to 0, which implies
c->ptr < c->ptr_end, or equivalently sz > 0.
```
  1efea985
Jun 20, 2020
- Update NEWS for 0.7.1 · e9df70c4
  Jean-Baptiste Kempf authored 4 years ago
  
  0.7.1
  
  e9df70c4
- Extract y related operations out of warp_affine inner loop · e97d5e77
  Luc Trudeau authored 4 years ago
  
  e97d5e77
Jun 19, 2020

x86: Branch before waiting on popcnt in ipred_z AVX2 functions · bf7adb75

Henrik Gramner authored 4 years ago and

Henrik Gramner committed 4 years ago

Some specific Haswell CPU:s have a hardware bug where the popcnt
instruction doesn't set zero flag correctly, which causes the wrong
branch to be taken.

popcnt also has a 3-cycle latency on Intel CPU:s, so doing the branch
on the input value instead of the output reduces the amount of time
wasted going down the wrong code path in case of branch mispredictions.

bf7adb75

arm32: Add a NEON implementation of MSAC · 53e7b21e

Martin Storsjö authored 4 years ago

Only use this in the cases when NEON can be used unconditionally
without runtime detection (when __ARM_NEON is defined).

The speedup over the C code is very modest for the smaller functions
(and the NEON version actually is a little slower than the C code
on Cortex A7 for adapt4), but the speedup is around 2x for
adapt16.

                              Cortex A7     A8     A9    A53    A72    A73
msac_decode_bool_c:                41.1   43.0   43.0   37.3   26.2   31.3
msac_decode_bool_neon:             40.2   42.0   37.2   32.8   19.9   25.5
msac_decode_bool_adapt_c:          65.1   70.4   58.5   54.3   33.2   40.8
msac_decode_bool_adapt_neon:       56.8   52.4   49.3   42.6   27.1   33.7
msac_decode_bool_equi_c:           36.9   37.2   42.8   32.6   22.7   42.3
msac_decode_bool_equi_neon:        34.9   35.1   36.4   29.7   19.5   36.4
msac_decode_symbol_adapt4_c:      114.2  139.0  111.6   99.9   65.5   83.5
msac_decode_symbol_adapt4_neon:   119.2  128.3   95.7   82.2   58.2   57.5
msac_decode_symbol_adapt8_c:      176.0  207.9  164.0  154.4   88.0  117.0
msac_decode_symbol_adapt8_neon:   128.3  130.3  110.7   85.1   59.9   61.4
msac_decode_symbol_adapt16_c:     292.1  320.5  256.4  246.4  129.1  173.3
msac_decode_symbol_adapt16_neon:  162.2  144.3  129.0  104.2   69.2   69.9

(Omitting msac_decode_hi_tok from the benchmark, as the "C" version
measured there uses the NEON version of msac_decode_symbol_adapt4.)

53e7b21e

Jun 18, 2020

arm64: msac: Add a special cased implementation of decode_hi_tok · 370200cd

Martin Storsjö authored 4 years ago

The speedup (over the normal version, that just calls the existing
assembly version of symbol_adapt4) is not very impressive on
bigger cores, but looks decent on small cores. It's an improvement
though, in any case.

                             Cortex A53    A72    A73
msac_decode_hi_tok_c:             175.7  136.2  138.1
msac_decode_hi_tok_neon:          146.8  129.4  125.9

370200cd

arm64: msac: Use a narrower vector length for adapt4 in one place · 8bc54870
Martin Storsjö authored 4 years ago

8bc54870
arm64: msac: Clarify the register use in one macro · 078e7360
Martin Storsjö authored 4 years ago
```
Include the letter prefix when calling the macro, making it
slightly less obscure.
```
078e7360

cli: Avoid large intermediates in the windows get_time_nanos · 7949de70

Martin Storsjö authored 4 years ago and

Jean-Baptiste Kempf committed 4 years ago

By multiplicating the performance counter value (within its own
time base) by the intended target time base, and only then dividing,
we reduce the available numeric range by the factor of the
original time base times the new time base.

On Windows 10 on ARM64, the performance counter frequency is
19200000 (on x86_64 in a virtual machine, it's 10000000), making
the calculation overflow every (1 << 64) / (19200000 * 1000000000)
= 960 seconds, i.e. 16 minutes - long before the actual uint64_t
nanosecond return value wraps around.

7949de70

cli: Get the elapsed time if printing progress, regardless of the fps value · 3e643b1f

Martin Storsjö authored 4 years ago

Even if we don't want to throttle decoding to realtime, and
even if the file itself didn't contain a valid fps value, we
may want to call the synchronize function to fetch the current
elapsed decoding time, for displaying the fps value.

3e643b1f

Update NEWS for 0.7.1 · 8c763d21
Jean-Baptiste Kempf authored 4 years ago

8c763d21

x86: Add put/prep_bilin_scaled AVX2 asm · a75ee78b

Victorien Le Couviour--Tuffet authored 4 years ago

Bilin scaled being very rarely used, add a new table entry to
mc_subpel_filters, and jump to the put/prep_8tap_scaled code.

AVX2 performance is obviously the same as the 8tap code, the speed up is
much smaller though, as the C code is a true bilinear codepath,
auto-vectorized. Yet, the AVX2 performance are always better.

a75ee78b

x86: Add prep_8tap_scaled AVX2 asm · ea74e3d5

Victorien Le Couviour--Tuffet authored 4 years ago

mct_scaled_8tap_regular_w4_8bpc_c: 872.1
mct_scaled_8tap_regular_w4_8bpc_avx2: 125.6
mct_scaled_8tap_regular_w4_dy1_8bpc_c: 886.3
mct_scaled_8tap_regular_w4_dy1_8bpc_avx2: 84.0
mct_scaled_8tap_regular_w4_dy2_8bpc_c: 1189.1
mct_scaled_8tap_regular_w4_dy2_8bpc_avx2: 84.7

mct_scaled_8tap_regular_w8_8bpc_c: 2261.0
mct_scaled_8tap_regular_w8_8bpc_avx2: 306.2
mct_scaled_8tap_regular_w8_dy1_8bpc_c: 2189.9
mct_scaled_8tap_regular_w8_dy1_8bpc_avx2: 233.8
mct_scaled_8tap_regular_w8_dy2_8bpc_c: 3060.3
mct_scaled_8tap_regular_w8_dy2_8bpc_avx2: 282.8

mct_scaled_8tap_regular_w16_8bpc_c: 4335.3
mct_scaled_8tap_regular_w16_8bpc_avx2: 680.7
mct_scaled_8tap_regular_w16_dy1_8bpc_c: 5137.2
mct_scaled_8tap_regular_w16_dy1_8bpc_avx2: 578.6
mct_scaled_8tap_regular_w16_dy2_8bpc_c: 7878.4
mct_scaled_8tap_regular_w16_dy2_8bpc_avx2: 774.6

mct_scaled_8tap_regular_w32_8bpc_c: 17871.9
mct_scaled_8tap_regular_w32_8bpc_avx2: 2954.8
mct_scaled_8tap_regular_w32_dy1_8bpc_c: 18594.7
mct_scaled_8tap_regular_w32_dy1_8bpc_avx2: 2073.9
mct_scaled_8tap_regular_w32_dy2_8bpc_c: 28696.0
mct_scaled_8tap_regular_w32_dy2_8bpc_avx2: 2852.1

mct_scaled_8tap_regular_w64_8bpc_c: 46967.5
mct_scaled_8tap_regular_w64_8bpc_avx2: 7527.5
mct_scaled_8tap_regular_w64_dy1_8bpc_c: 45564.2
mct_scaled_8tap_regular_w64_dy1_8bpc_avx2: 5262.9
mct_scaled_8tap_regular_w64_dy2_8bpc_c: 72793.3
mct_scaled_8tap_regular_w64_dy2_8bpc_avx2: 7535.9

mct_scaled_8tap_regular_w128_8bpc_c: 111190.8
mct_scaled_8tap_regular_w128_8bpc_avx2: 19386.8
mct_scaled_8tap_regular_w128_dy1_8bpc_c: 122625.0
mct_scaled_8tap_regular_w128_dy1_8bpc_avx2: 15376.1
mct_scaled_8tap_regular_w128_dy2_8bpc_c: 197120.6
mct_scaled_8tap_regular_w128_dy2_8bpc_avx2: 21871.0

ea74e3d5