Commits · e9df70c4348a3f9ba7269feacd17cfb57bf23852 · Pranav Kant / dav1d

Jun 20, 2020
- Update NEWS for 0.7.1 · e9df70c4
  Jean-Baptiste Kempf authored 4 years ago
  
  0.7.1
  
  e9df70c4
- Extract y related operations out of warp_affine inner loop · e97d5e77
  Luc Trudeau authored 4 years ago
  
  e97d5e77
Jun 19, 2020

x86: Branch before waiting on popcnt in ipred_z AVX2 functions · bf7adb75

Henrik Gramner authored 4 years ago and

Henrik Gramner committed 4 years ago

Some specific Haswell CPU:s have a hardware bug where the popcnt
instruction doesn't set zero flag correctly, which causes the wrong
branch to be taken.

popcnt also has a 3-cycle latency on Intel CPU:s, so doing the branch
on the input value instead of the output reduces the amount of time
wasted going down the wrong code path in case of branch mispredictions.

bf7adb75

arm32: Add a NEON implementation of MSAC · 53e7b21e

Martin Storsjö authored 4 years ago

Only use this in the cases when NEON can be used unconditionally
without runtime detection (when __ARM_NEON is defined).

The speedup over the C code is very modest for the smaller functions
(and the NEON version actually is a little slower than the C code
on Cortex A7 for adapt4), but the speedup is around 2x for
adapt16.

                              Cortex A7     A8     A9    A53    A72    A73
msac_decode_bool_c:                41.1   43.0   43.0   37.3   26.2   31.3
msac_decode_bool_neon:             40.2   42.0   37.2   32.8   19.9   25.5
msac_decode_bool_adapt_c:          65.1   70.4   58.5   54.3   33.2   40.8
msac_decode_bool_adapt_neon:       56.8   52.4   49.3   42.6   27.1   33.7
msac_decode_bool_equi_c:           36.9   37.2   42.8   32.6   22.7   42.3
msac_decode_bool_equi_neon:        34.9   35.1   36.4   29.7   19.5   36.4
msac_decode_symbol_adapt4_c:      114.2  139.0  111.6   99.9   65.5   83.5
msac_decode_symbol_adapt4_neon:   119.2  128.3   95.7   82.2   58.2   57.5
msac_decode_symbol_adapt8_c:      176.0  207.9  164.0  154.4   88.0  117.0
msac_decode_symbol_adapt8_neon:   128.3  130.3  110.7   85.1   59.9   61.4
msac_decode_symbol_adapt16_c:     292.1  320.5  256.4  246.4  129.1  173.3
msac_decode_symbol_adapt16_neon:  162.2  144.3  129.0  104.2   69.2   69.9

(Omitting msac_decode_hi_tok from the benchmark, as the "C" version
measured there uses the NEON version of msac_decode_symbol_adapt4.)

53e7b21e

Jun 18, 2020

arm64: msac: Add a special cased implementation of decode_hi_tok · 370200cd

Martin Storsjö authored 4 years ago

The speedup (over the normal version, that just calls the existing
assembly version of symbol_adapt4) is not very impressive on
bigger cores, but looks decent on small cores. It's an improvement
though, in any case.

                             Cortex A53    A72    A73
msac_decode_hi_tok_c:             175.7  136.2  138.1
msac_decode_hi_tok_neon:          146.8  129.4  125.9

370200cd

arm64: msac: Use a narrower vector length for adapt4 in one place · 8bc54870
Martin Storsjö authored 4 years ago

8bc54870
arm64: msac: Clarify the register use in one macro · 078e7360
Martin Storsjö authored 4 years ago
```
Include the letter prefix when calling the macro, making it
slightly less obscure.
```
078e7360

cli: Avoid large intermediates in the windows get_time_nanos · 7949de70

Martin Storsjö authored 4 years ago and

Jean-Baptiste Kempf committed 4 years ago

By multiplicating the performance counter value (within its own
time base) by the intended target time base, and only then dividing,
we reduce the available numeric range by the factor of the
original time base times the new time base.

On Windows 10 on ARM64, the performance counter frequency is
19200000 (on x86_64 in a virtual machine, it's 10000000), making
the calculation overflow every (1 << 64) / (19200000 * 1000000000)
= 960 seconds, i.e. 16 minutes - long before the actual uint64_t
nanosecond return value wraps around.

7949de70

cli: Get the elapsed time if printing progress, regardless of the fps value · 3e643b1f

Martin Storsjö authored 4 years ago

Even if we don't want to throttle decoding to realtime, and
even if the file itself didn't contain a valid fps value, we
may want to call the synchronize function to fetch the current
elapsed decoding time, for displaying the fps value.

3e643b1f

Update NEWS for 0.7.1 · 8c763d21
Jean-Baptiste Kempf authored 4 years ago

8c763d21

x86: Add put/prep_bilin_scaled AVX2 asm · a75ee78b

Victorien Le Couviour--Tuffet authored 4 years ago

Bilin scaled being very rarely used, add a new table entry to
mc_subpel_filters, and jump to the put/prep_8tap_scaled code.

AVX2 performance is obviously the same as the 8tap code, the speed up is
much smaller though, as the C code is a true bilinear codepath,
auto-vectorized. Yet, the AVX2 performance are always better.

a75ee78b

x86: Add prep_8tap_scaled AVX2 asm · ea74e3d5

Victorien Le Couviour--Tuffet authored 4 years ago

mct_scaled_8tap_regular_w4_8bpc_c: 872.1
mct_scaled_8tap_regular_w4_8bpc_avx2: 125.6
mct_scaled_8tap_regular_w4_dy1_8bpc_c: 886.3
mct_scaled_8tap_regular_w4_dy1_8bpc_avx2: 84.0
mct_scaled_8tap_regular_w4_dy2_8bpc_c: 1189.1
mct_scaled_8tap_regular_w4_dy2_8bpc_avx2: 84.7

mct_scaled_8tap_regular_w8_8bpc_c: 2261.0
mct_scaled_8tap_regular_w8_8bpc_avx2: 306.2
mct_scaled_8tap_regular_w8_dy1_8bpc_c: 2189.9
mct_scaled_8tap_regular_w8_dy1_8bpc_avx2: 233.8
mct_scaled_8tap_regular_w8_dy2_8bpc_c: 3060.3
mct_scaled_8tap_regular_w8_dy2_8bpc_avx2: 282.8

mct_scaled_8tap_regular_w16_8bpc_c: 4335.3
mct_scaled_8tap_regular_w16_8bpc_avx2: 680.7
mct_scaled_8tap_regular_w16_dy1_8bpc_c: 5137.2
mct_scaled_8tap_regular_w16_dy1_8bpc_avx2: 578.6
mct_scaled_8tap_regular_w16_dy2_8bpc_c: 7878.4
mct_scaled_8tap_regular_w16_dy2_8bpc_avx2: 774.6

mct_scaled_8tap_regular_w32_8bpc_c: 17871.9
mct_scaled_8tap_regular_w32_8bpc_avx2: 2954.8
mct_scaled_8tap_regular_w32_dy1_8bpc_c: 18594.7
mct_scaled_8tap_regular_w32_dy1_8bpc_avx2: 2073.9
mct_scaled_8tap_regular_w32_dy2_8bpc_c: 28696.0
mct_scaled_8tap_regular_w32_dy2_8bpc_avx2: 2852.1

mct_scaled_8tap_regular_w64_8bpc_c: 46967.5
mct_scaled_8tap_regular_w64_8bpc_avx2: 7527.5
mct_scaled_8tap_regular_w64_dy1_8bpc_c: 45564.2
mct_scaled_8tap_regular_w64_dy1_8bpc_avx2: 5262.9
mct_scaled_8tap_regular_w64_dy2_8bpc_c: 72793.3
mct_scaled_8tap_regular_w64_dy2_8bpc_avx2: 7535.9

mct_scaled_8tap_regular_w128_8bpc_c: 111190.8
mct_scaled_8tap_regular_w128_8bpc_avx2: 19386.8
mct_scaled_8tap_regular_w128_dy1_8bpc_c: 122625.0
mct_scaled_8tap_regular_w128_dy1_8bpc_avx2: 15376.1
mct_scaled_8tap_regular_w128_dy2_8bpc_c: 197120.6
mct_scaled_8tap_regular_w128_dy2_8bpc_avx2: 21871.0

ea74e3d5

Jun 16, 2020

Clean up fraction calculation · 07261e8c
Colin Lee authored 4 years ago

07261e8c

Add clamping back to mv projection · 2d4711c9

Colin Lee authored 4 years ago

Clamping in the motion vector projection calculation is required by spec.
In commit aca57bf3
a rewrite of the function omitted the clamping. This commit readds the
clamping.

2d4711c9

arm64: itx: Simplify and clarify the sub_sp macro a little bit · 1e674fdb
Martin Storsjö authored 4 years ago
```
Add an .error case for windows if subtracting more than 8 KB, simplify
the generic subtraction case.
```
1e674fdb

arm: itx: Add NEON implementation of itx for 8 bpc · 3d6d7683

Martin Storsjö authored 4 years ago

The transforms process vectors of up to 8 elements at a time, for
transforms up to size 8; for larger transforms, it uses vectors of
4 elements.

Overall, the speedup over C code seems to be around 8-14x for the
larger transforms, and 10-19x for the smaller ones.

Relative speedup over C code (built with GCC 7.5) for a few functions:

Cortex A7 A8 A9 A53 A72 A73
inv_txfm_add_4x4_dct_dct_0_8bpc_neon: 3.83 3.42 2.57 3.36 2.97 7.47
inv_txfm_add_4x4_dct_dct_1_8bpc_neon: 7.25 13.53 8.38 8.82 7.96 12.37
inv_txfm_add_8x8_dct_dct_0_8bpc_neon: 4.78 6.61 4.82 4.65 5.27 9.76
inv_txfm_add_8x8_dct_dct_1_8bpc_neon: 10.20 19.07 13.07 14.69 11.45 15.50
inv_txfm_add_16x16_dct_dct_0_8bpc_neon: 4.26 5.06 3.00 3.74 4.05 4.49
inv_txfm_add_16x16_dct_dct_1_8bpc_neon: 10.51 16.02 13.57 14.03 12.86 18.16
inv_txfm_add_16x16_dct_dct_2_8bpc_neon: 7.95 11.75 9.09 10.64 10.06 14.07
inv_txfm_add_32x32_dct_dct_0_8bpc_neon: 5.31 5.58 3.14 4.18 4.80 4.57
inv_txfm_add_32x32_dct_dct_1_8bpc_neon: 12.66 16.07 14.34 16.00 15.24 21.32
inv_txfm_add_32x32_dct_dct_4_8bpc_neon: 8.25 10.69 8.90 10.59 10.41 14.39
inv_txfm_add_64x64_dct_dct_0_8bpc_neon: 4.69 5.97 3.17 3.96 4.57 4.34
inv_txfm_add_64x64_dct_dct_1_8bpc_neon: 11.47 12.68 10.18 14.73 14.20 17.95
inv_txfm_add_64x64_dct_dct_4_8bpc_neon: 8.84 10.13 7.94 11.25 10.58 13.88

3d6d7683

Jun 11, 2020

meson: Use dav1d_src_root everywhere for consistency · 6d190ad3
Matthias Dressel authored 4 years ago

6d190ad3

Remove redundant memset in itx DSP initialization · d606dd24

Henrik Gramner authored 4 years ago

The struct is already zero-initialized when the function is called
except for the checkasm test, so move the zeroing there instead.

d606dd24

meson: Make docs generation subproject-safe · bc008834
Matthias Dressel authored 5 years ago
```
meson.source_root() returns the root of a parent project if dav1d is
embedded as a subproject.
```
bc008834

x86: Adapt SSSE3 prep_8tap to SSE2 · 22fb8a42

Victorien Le Couviour--Tuffet authored 4 years ago

---------------------
x86_64:
------------------------------------------
mct_8tap_regular_w4_h_8bpc_c: 302.3
mct_8tap_regular_w4_h_8bpc_sse2: 47.3
mct_8tap_regular_w4_h_8bpc_ssse3: 19.5
---------------------
mct_8tap_regular_w8_h_8bpc_c: 745.5
mct_8tap_regular_w8_h_8bpc_sse2: 235.2
mct_8tap_regular_w8_h_8bpc_ssse3: 70.4
---------------------
mct_8tap_regular_w16_h_8bpc_c: 1844.3
mct_8tap_regular_w16_h_8bpc_sse2: 755.6
mct_8tap_regular_w16_h_8bpc_ssse3: 225.9
---------------------
mct_8tap_regular_w32_h_8bpc_c: 6685.5
mct_8tap_regular_w32_h_8bpc_sse2: 2954.4
mct_8tap_regular_w32_h_8bpc_ssse3: 795.8
---------------------
mct_8tap_regular_w64_h_8bpc_c: 15633.5
mct_8tap_regular_w64_h_8bpc_sse2: 7120.4
mct_8tap_regular_w64_h_8bpc_ssse3: 1900.4
---------------------
mct_8tap_regular_w128_h_8bpc_c: 37772.1
mct_8tap_regular_w128_h_8bpc_sse2: 17698.1
mct_8tap_regular_w128_h_8bpc_ssse3: 4665.5
------------------------------------------
mct_8tap_regular_w4_v_8bpc_c: 306.5
mct_8tap_regular_w4_v_8bpc_sse2: 71.7
mct_8tap_regular_w4_v_8bpc_ssse3: 37.9
---------------------
mct_8tap_regular_w8_v_8bpc_c: 923.3
mct_8tap_regular_w8_v_8bpc_sse2: 168.7
mct_8tap_regular_w8_v_8bpc_ssse3: 71.3
---------------------
mct_8tap_regular_w16_v_8bpc_c: 3040.1
mct_8tap_regular_w16_v_8bpc_sse2: 505.1
mct_8tap_regular_w16_v_8bpc_ssse3: 199.7
---------------------
mct_8tap_regular_w32_v_8bpc_c: 12354.8
mct_8tap_regular_w32_v_8bpc_sse2: 1942.0
mct_8tap_regular_w32_v_8bpc_ssse3: 714.2
---------------------
mct_8tap_regular_w64_v_8bpc_c: 29427.9
mct_8tap_regular_w64_v_8bpc_sse2: 4637.4
mct_8tap_regular_w64_v_8bpc_ssse3: 1829.2
---------------------
mct_8tap_regular_w128_v_8bpc_c: 72756.9
mct_8tap_regular_w128_v_8bpc_sse2: 11301.0
mct_8tap_regular_w128_v_8bpc_ssse3: 5020.6
------------------------------------------
mct_8tap_regular_w4_hv_8bpc_c: 876.9
mct_8tap_regular_w4_hv_8bpc_sse2: 171.7
mct_8tap_regular_w4_hv_8bpc_ssse3: 112.2
---------------------
mct_8tap_regular_w8_hv_8bpc_c: 2215.1
mct_8tap_regular_w8_hv_8bpc_sse2: 730.2
mct_8tap_regular_w8_hv_8bpc_ssse3: 330.9
---------------------
mct_8tap_regular_w16_hv_8bpc_c: 6075.5
mct_8tap_regular_w16_hv_8bpc_sse2: 2252.1
mct_8tap_regular_w16_hv_8bpc_ssse3: 973.4
---------------------
mct_8tap_regular_w32_hv_8bpc_c: 22182.7
mct_8tap_regular_w32_hv_8bpc_sse2: 7692.6
mct_8tap_regular_w32_hv_8bpc_ssse3: 3599.8
---------------------
mct_8tap_regular_w64_hv_8bpc_c: 50876.8
mct_8tap_regular_w64_hv_8bpc_sse2: 18499.6
mct_8tap_regular_w64_hv_8bpc_ssse3: 8815.6
---------------------
mct_8tap_regular_w128_hv_8bpc_c: 122926.3
mct_8tap_regular_w128_hv_8bpc_sse2: 45120.0
mct_8tap_regular_w128_hv_8bpc_ssse3: 22085.7
------------------------------------------

22fb8a42

x86: Adapt SSSE3 prep_bilin to SSE2 · 83956bf1

Victorien Le Couviour--Tuffet authored 4 years ago

---------------------
x86_64:
------------------------------------------
mct_bilinear_w4_h_8bpc_c: 98.9
mct_bilinear_w4_h_8bpc_sse2: 30.2
mct_bilinear_w4_h_8bpc_ssse3: 11.5
---------------------
mct_bilinear_w8_h_8bpc_c: 175.3
mct_bilinear_w8_h_8bpc_sse2: 57.0
mct_bilinear_w8_h_8bpc_ssse3: 19.7
---------------------
mct_bilinear_w16_h_8bpc_c: 396.2
mct_bilinear_w16_h_8bpc_sse2: 179.3
mct_bilinear_w16_h_8bpc_ssse3: 50.9
---------------------
mct_bilinear_w32_h_8bpc_c: 1311.2
mct_bilinear_w32_h_8bpc_sse2: 718.8
mct_bilinear_w32_h_8bpc_ssse3: 243.9
---------------------
mct_bilinear_w64_h_8bpc_c: 2892.7
mct_bilinear_w64_h_8bpc_sse2: 1746.0
mct_bilinear_w64_h_8bpc_ssse3: 568.0
---------------------
mct_bilinear_w128_h_8bpc_c: 7192.6
mct_bilinear_w128_h_8bpc_sse2: 4339.8
mct_bilinear_w128_h_8bpc_ssse3: 1619.2
------------------------------------------
mct_bilinear_w4_v_8bpc_c: 129.7
mct_bilinear_w4_v_8bpc_sse2: 26.6
mct_bilinear_w4_v_8bpc_ssse3: 16.7
---------------------
mct_bilinear_w8_v_8bpc_c: 233.3
mct_bilinear_w8_v_8bpc_sse2: 55.0
mct_bilinear_w8_v_8bpc_ssse3: 24.7
---------------------
mct_bilinear_w16_v_8bpc_c: 498.9
mct_bilinear_w16_v_8bpc_sse2: 146.0
mct_bilinear_w16_v_8bpc_ssse3: 54.2
---------------------
mct_bilinear_w32_v_8bpc_c: 1562.2
mct_bilinear_w32_v_8bpc_sse2: 560.6
mct_bilinear_w32_v_8bpc_ssse3: 201.0
---------------------
mct_bilinear_w64_v_8bpc_c: 3221.3
mct_bilinear_w64_v_8bpc_sse2: 1380.6
mct_bilinear_w64_v_8bpc_ssse3: 499.3
---------------------
mct_bilinear_w128_v_8bpc_c: 7357.7
mct_bilinear_w128_v_8bpc_sse2: 3439.0
mct_bilinear_w128_v_8bpc_ssse3: 1489.1
------------------------------------------
mct_bilinear_w4_hv_8bpc_c: 185.0
mct_bilinear_w4_hv_8bpc_sse2: 54.5
mct_bilinear_w4_hv_8bpc_ssse3: 22.1
---------------------
mct_bilinear_w8_hv_8bpc_c: 377.8
mct_bilinear_w8_hv_8bpc_sse2: 104.3
mct_bilinear_w8_hv_8bpc_ssse3: 35.8
---------------------
mct_bilinear_w16_hv_8bpc_c: 1159.4
mct_bilinear_w16_hv_8bpc_sse2: 311.0
mct_bilinear_w16_hv_8bpc_ssse3: 106.3
---------------------
mct_bilinear_w32_hv_8bpc_c: 4436.2
mct_bilinear_w32_hv_8bpc_sse2: 1230.7
mct_bilinear_w32_hv_8bpc_ssse3: 400.7
---------------------
mct_bilinear_w64_hv_8bpc_c: 10627.7
mct_bilinear_w64_hv_8bpc_sse2: 2934.2
mct_bilinear_w64_hv_8bpc_ssse3: 957.2
---------------------
mct_bilinear_w128_hv_8bpc_c: 26048.9
mct_bilinear_w128_hv_8bpc_sse2: 7590.3
mct_bilinear_w128_hv_8bpc_ssse3: 2947.0
------------------------------------------

83956bf1

Jun 10, 2020

arm64: itx16: Add a missed eob check in the 16x8 transform · 88798ebf

Martin Storsjö authored 4 years ago

This allows skipping half of the first transforms if the input
coefficients lie within the upper 4x4 (but checkasm only tests in
increments of 8x8 at the moment).

With checkasm modified to test in smaller increments, the speedup
is like this:

Before:                             Cortex A53     A72     A73
inv_txfm_add_16x8_dct_dct_1_10bpc_neon:  874.4   709.0   707.3
After:
inv_txfm_add_16x8_dct_dct_1_10bpc_neon:  618.0   479.5   472.9

88798ebf

arm64: itx16: Remove a leftover unused macro parameter · 6cdfd4c5
Martin Storsjö authored 4 years ago

6cdfd4c5

Jun 09, 2020
- x86: Fix compiler warnings when using nasm 2.15 · 01289612
  Henrik Gramner authored 4 years ago and Henrik Gramner committed 4 years ago
  
  01289612
- Avoid compiling logging functions when logging is disabled · 07ee7c9d
  Henrik Gramner authored 4 years ago and Henrik Gramner committed 4 years ago
  
  07ee7c9d
Jun 07, 2020

CI: Enable coverage reports · 2b98fd28

Niklas Haas authored 4 years ago and

Jean-Baptiste Kempf committed 4 years ago

Blacklisted some files not directly relevant to the codebase (such as
tests, tools and debugging functions).

The coverage HTML report gets attached as a build artifact, although
unfortunately we can't link directly to the `index.html`. We also attach
the coverage XML as a cobertura report, although I'm not sure if it does
anything.

2b98fd28

Jun 04, 2020
- Range of operating point is 0 - 31, not 0 - 32 · a2f90805
  Wan-Teh Chang authored 4 years ago
  
  a2f90805
- arm: Add an export parameter to the const macro · d695e0e8
  Martin Storsjö authored 4 years ago and Jean-Baptiste Kempf committed 4 years ago
```
This is currently not used in dav1d (yet), but there's a need for
it in rav1e, which shares this header with dav1d.
```
  d695e0e8
Jun 01, 2020

x86: Add put_8tap_scaled AVX2 asm · a755541f

Victorien Le Couviour--Tuffet authored 4 years ago

mc_scaled_8tap_regular_w2_8bpc_c: 764.4
mc_scaled_8tap_regular_w2_8bpc_avx2: 191.3
mc_scaled_8tap_regular_w2_dy1_8bpc_c: 705.8
mc_scaled_8tap_regular_w2_dy1_8bpc_avx2: 89.5
mc_scaled_8tap_regular_w2_dy2_8bpc_c: 964.0
mc_scaled_8tap_regular_w2_dy2_8bpc_avx2: 120.3

mc_scaled_8tap_regular_w4_8bpc_c: 1355.7
mc_scaled_8tap_regular_w4_8bpc_avx2: 180.9
mc_scaled_8tap_regular_w4_dy1_8bpc_c: 1233.2
mc_scaled_8tap_regular_w4_dy1_8bpc_avx2: 115.3
mc_scaled_8tap_regular_w4_dy2_8bpc_c: 1707.6
mc_scaled_8tap_regular_w4_dy2_8bpc_avx2: 117.9

mc_scaled_8tap_regular_w8_8bpc_c: 2483.2
mc_scaled_8tap_regular_w8_8bpc_avx2: 294.8
mc_scaled_8tap_regular_w8_dy1_8bpc_c: 2166.4
mc_scaled_8tap_regular_w8_dy1_8bpc_avx2: 222.0
mc_scaled_8tap_regular_w8_dy2_8bpc_c: 3133.7
mc_scaled_8tap_regular_w8_dy2_8bpc_avx2: 292.6

mc_scaled_8tap_regular_w16_8bpc_c: 5239.2
mc_scaled_8tap_regular_w16_8bpc_avx2: 729.9
mc_scaled_8tap_regular_w16_dy1_8bpc_c: 5156.5
mc_scaled_8tap_regular_w16_dy1_8bpc_avx2: 602.2
mc_scaled_8tap_regular_w16_dy2_8bpc_c: 8018.4
mc_scaled_8tap_regular_w16_dy2_8bpc_avx2: 783.1

mc_scaled_8tap_regular_w32_8bpc_c: 14745.0
mc_scaled_8tap_regular_w32_8bpc_avx2: 2205.0
mc_scaled_8tap_regular_w32_dy1_8bpc_c: 14862.3
mc_scaled_8tap_regular_w32_dy1_8bpc_avx2: 1721.3
mc_scaled_8tap_regular_w32_dy2_8bpc_c: 23607.6
mc_scaled_8tap_regular_w32_dy2_8bpc_avx2: 2325.7

mc_scaled_8tap_regular_w64_8bpc_c: 54891.7
mc_scaled_8tap_regular_w64_8bpc_avx2: 8351.4
mc_scaled_8tap_regular_w64_dy1_8bpc_c: 50249.0
mc_scaled_8tap_regular_w64_dy1_8bpc_avx2: 5864.4
mc_scaled_8tap_regular_w64_dy2_8bpc_c: 79400.1
mc_scaled_8tap_regular_w64_dy2_8bpc_avx2: 8295.7

mc_scaled_8tap_regular_w128_8bpc_c: 121046.8
mc_scaled_8tap_regular_w128_8bpc_avx2: 21809.1
mc_scaled_8tap_regular_w128_dy1_8bpc_c: 133720.4
mc_scaled_8tap_regular_w128_dy1_8bpc_avx2: 16197.8
mc_scaled_8tap_regular_w128_dy2_8bpc_c: 218774.8
mc_scaled_8tap_regular_w128_dy2_8bpc_avx2: 22993.1

a755541f

May 28, 2020

meson: favor _aligned_malloc over posix_memalign · ed39e8fb

Steve Lhomme authored 4 years ago

posix_memalign is defined as a built-in in gcc in msys2 but it's not available
when linking with the Universal C Runtime. _aligned_malloc is available in the
UCRT.

That should only affect builds targeting Windows since _aligned_malloc is a MS
thing.

ed39e8fb

May 26, 2020

x86: Add minor looprestoration asm optimizations · 059ad248
Henrik Gramner authored 4 years ago
```
Eliminate store forwarding stalls.
Use shorter instruction encodings where possible.
Misc. tweaks.
```
059ad248

dav1dplay: use new pl_chroma_location API · 12a64ec7

Niklas Haas authored 4 years ago and

Jean-Baptiste Kempf committed 4 years ago

This one correctly sets the subsampling mode based on whether or not the
plane is actually subsampled, and also infers PL_CHROMA_UNKNOWN as
PL_CHROMA_TOP_LEFT in such cases.

12a64ec7

May 25, 2020

dav1dplay: allow resizing the window · a1e7a329

Niklas Haas authored 4 years ago

libplacebo v66 got helper functions that make preserving the aspect
ratio in this case trivial. But we still need to make sure to clear the
FBO to black if the image doesn't cover it fully.

a1e7a329

May 20, 2020

dav1dplay: don't freeze on render errors · df40d36d

Niklas Haas authored 4 years ago

Returning out of this function when pl_render_image() fails is the wrong
thing to do, since that leaves the swapchain frame acquired but never
submitted. Instead, just clear the target FBO to blank red (to make it
clear that something went wrong) and continue on with presentation.

df40d36d

May 19, 2020
- Update NEWS for 0.7.0 · dd1ed29b
  Jean-Baptiste Kempf authored 4 years ago
  
  dd1ed29b
May 18, 2020

dav1dplay: support on-GPU film grain synthesis · cbe05cf4

Niklas Haas authored 4 years ago

Annoying minor differences in this struct layout mean we can't just
memcpy the entire thing. Oh well.

Note: technically, PL_API_VER 33 added this API, but PL_API_VER 63 is
the minimum version of libplacebo that doesn't have glaring bugs when
generating chroma grain, so we require that as a minimum instead.

(I tested this version on some 4:2:2 and 4:2:0, 8-bit and 10-bit grain
samples I had lying around and made sure the output was identical up to
differences in rounding / dithering.)

cbe05cf4

dav1dplay: handle all supported csps/reprs/bitdepths · 7bbebdb4

Niklas Haas authored 4 years ago

Generalize the code to set the right pl_image metadata based on the
values signaled in the Dav1dPictureParameters / Dav1dSequenceHeader.

Some values are not mapped, in which case stdout will be spammed.
Whatever. Hopefully somebody sees that error spam and opens a bug report
for libplacebo to implement it.

7bbebdb4

dav1dplay: move and simplify pl_image generation · f01fd0f1

Niklas Haas authored 4 years ago

Having the pl_image generation live in upload_planes() rather than
render() will make it easier to set the correct pl_image metadata based
on the Dav1dPicture headers moving forwards. Rename the function to make
more sense, semantically.

Reduce some code duplication by turning per-plane fields into arrays
wherever appropriate.

As an aside, also apply the correct chroma location rather than
hard-coding it as PL_CHROMA_LEFT.

f01fd0f1

dav1dplay: don't write directly to iparams.extensions · 3bb0aed1

Niklas Haas authored 4 years ago

This is turned into a const array in upstream libplacebo, which
generates warnings due to the implicit cast. Rewrite the code to have
the mutable array live inside a separate variable `extensions` and only
set `iparams.extensions` to this, rather than directly manipulating it.

3bb0aed1

May 16, 2020
- Fix swapped define guards in dav1dplay’s libplacebo renderer · 239b87f0
  Emmanuel Gil Peyrot authored 4 years ago and Jean-Baptiste Kempf committed 4 years ago
```
Signed-off-by: Marvin Scholz <epirat07@gmail.com>
```
  239b87f0