Commits · master · Michail Alvanos / dav1d

Oct 07, 2019

Don't backup pixels if next restoration unit is NONE · d2c94ee1
Luc Trudeau authored 5 years ago and Jean-Baptiste Kempf committed 5 years ago

d2c94ee1

Add AVX2 version of generate_grain_uv (4:2:0) · 4e22ef3a

gen_grain_uv_ar0_8bpc_420_c: 30131.8
gen_grain_uv_ar0_8bpc_420_avx2: 6600.4
gen_grain_uv_ar1_8bpc_420_c: 46110.5
gen_grain_uv_ar1_8bpc_420_avx2: 17887.2
gen_grain_uv_ar2_8bpc_420_c: 73593.2
gen_grain_uv_ar2_8bpc_420_avx2: 26918.6
gen_grain_uv_ar3_8bpc_420_c: 114499.3
gen_grain_uv_ar3_8bpc_420_avx2: 29804.6

4e22ef3a

arm64: mc: Schedule instructions better in the warp8x8 functions · ff41197b

Martin Storsjö authored 5 years ago

Before:           Cortex A53     A72     A73
warp_8x8_8bpc_neon:   1997.3  1170.1  1199.9
warp_8x8t_8bpc_neon:  1982.4  1171.5  1192.6
After:
warp_8x8_8bpc_neon:   1954.6  1159.2  1153.3
warp_8x8t_8bpc_neon:  1938.5  1146.2  1136.7

ff41197b

Oct 03, 2019

Check for RESTORATION_NONE once per frame · e570088d

Luc Trudeau authored 5 years ago and

Jean-Baptiste Kempf committed 5 years ago

Prior checks were done at the sbrow level. This now allows to call
dav1d_lr_sbrow and dav1d_lr_copy_lpf only when there's something
for them to do.

e570088d

Oct 02, 2019
- arm64: mc: Use sbfx instead of ubfx+sxth in the warp function · a4ceff6f
  Martin Storsjö authored 5 years ago
  
  a4ceff6f
- x86: Increase precision of SSSE3 IDCT intermediates · d4dfa85c
  Henrik Gramner authored 5 years ago
  
  d4dfa85c
- x86: Increase precision of AVX2 IDCT intermediates · de561b3b
  Henrik Gramner authored 5 years ago
```
The existing code was using 16-bit intermediate precision for certain
calculations which is insufficient for some esoteric edge cases.
```
  de561b3b
- checkasm: Add a function listing feature · f404c722
  Henrik Gramner authored 5 years ago
```
--list-functions now prints a list of all function names. Uses stdout
for easy grepping/piping. Can be combined with the --test option to
only list functions within a specific test.

Also rename --list to --list-tests and make it print to stdout as well
for consistency.
```
  f404c722
Oct 01, 2019

Simplify README build instructions · 16e0741a
Henrik Gramner authored 5 years ago and Henrik Gramner committed 5 years ago

16e0741a
Minor cleanup · f6a8cc0c
Ronald S. Bultje authored 5 years ago

f6a8cc0c

arm64: ipred: NEON implementation of dc/h/v prediction modes · f7743da1

Martin Storsjö authored 5 years ago

Relative speedups over the C code:
                              Cortex A53    A72    A73
intra_pred_dc_128_w4_8bpc_neon:     2.08   1.47   2.17
intra_pred_dc_128_w8_8bpc_neon:     3.33   2.49   4.03
intra_pred_dc_128_w16_8bpc_neon:    3.93   3.86   3.75
intra_pred_dc_128_w32_8bpc_neon:    3.14   3.79   2.90
intra_pred_dc_128_w64_8bpc_neon:    3.68   1.97   2.42
intra_pred_dc_left_w4_8bpc_neon:    2.41   1.70   2.23
intra_pred_dc_left_w8_8bpc_neon:    3.53   2.41   3.32
intra_pred_dc_left_w16_8bpc_neon:   3.87   3.54   3.34
intra_pred_dc_left_w32_8bpc_neon:   4.10   3.60   2.76
intra_pred_dc_left_w64_8bpc_neon:   3.72   2.00   2.39
intra_pred_dc_top_w4_8bpc_neon:     2.27   1.66   2.07
intra_pred_dc_top_w8_8bpc_neon:     3.83   2.69   3.43
intra_pred_dc_top_w16_8bpc_neon:    3.66   3.60   3.20
intra_pred_dc_top_w32_8bpc_neon:    3.92   3.54   2.66
intra_pred_dc_top_w64_8bpc_neon:    3.60   1.98   2.30
intra_pred_dc_w4_8bpc_neon:         2.29   1.42   2.16
intra_pred_dc_w8_8bpc_neon:         3.56   2.83   3.05
intra_pred_dc_w16_8bpc_neon:        3.46   3.37   3.15
intra_pred_dc_w32_8bpc_neon:        3.79   3.41   2.74
intra_pred_dc_w64_8bpc_neon:        3.52   2.01   2.41
intra_pred_h_w4_8bpc_neon:         10.34   5.74   5.94
intra_pred_h_w8_8bpc_neon:         12.13   6.33   6.43
intra_pred_h_w16_8bpc_neon:        10.66   7.31   5.85
intra_pred_h_w32_8bpc_neon:         6.28   4.18   2.88
intra_pred_h_w64_8bpc_neon:         3.96   1.85   1.75
intra_pred_v_w4_8bpc_neon:         11.44   6.12   7.57
intra_pred_v_w8_8bpc_neon:         14.76   7.58   7.95
intra_pred_v_w16_8bpc_neon:        11.34   6.28   5.88
intra_pred_v_w32_8bpc_neon:         6.56   3.33   3.34
intra_pred_v_w64_8bpc_neon:         4.57   1.24   1.97

f7743da1

Sep 30, 2019

x86: add warp_affine SSE4 and SSSE3 asm · a91a03b0

Victorien Le Couviour--Tuffet authored 5 years ago

------------------------------------------
x86_64: warp_8x8_8bpc_c: 1773.4
x86_32: warp_8x8_8bpc_c: 1740.4
----------
x86_64: warp_8x8_8bpc_ssse3: 317.5
x86_32: warp_8x8_8bpc_ssse3: 378.4
----------
x86_64: warp_8x8_8bpc_sse4: 303.7
x86_32: warp_8x8_8bpc_sse4: 367.7
----------
x86_64: warp_8x8_8bpc_avx2: 224.9
---------------------
---------------------
x86_64: warp_8x8t_8bpc_c: 1664.6
x86_32: warp_8x8t_8bpc_c: 1674.0
----------
x86_64: warp_8x8t_8bpc_ssse3: 320.7
x86_32: warp_8x8t_8bpc_ssse3: 379.5
----------
x86_64: warp_8x8t_8bpc_sse4: 304.8
x86_32: warp_8x8t_8bpc_sse4: 369.8
----------
x86_64: warp_8x8t_8bpc_avx2: 228.5
------------------------------------------

a91a03b0

Sep 29, 2019

arm64: itx: Fix overflows in idct · 713aa34c

Martin Storsjö authored 5 years ago

Don't add two 16 bit coefficients in 16 bit, if the result isn't supposed
to be clipped.

This fixes mismatches for some samples, see issue #299.

Before:                                Cortex A53       A72       A73
inv_txfm_add_4x4_dct_dct_1_8bpc_neon:        93.0      52.8      49.5
inv_txfm_add_8x8_dct_dct_1_8bpc_neon:       260.0     186.0     196.4
inv_txfm_add_16x16_dct_dct_2_8bpc_neon:    1371.0     953.4    1028.6
inv_txfm_add_32x32_dct_dct_4_8bpc_neon:    7363.2    4887.5    5135.8
inv_txfm_add_64x64_dct_dct_4_8bpc_neon:   25029.0   17492.3   18404.5
After:
inv_txfm_add_4x4_dct_dct_1_8bpc_neon:       105.0      58.7      55.2
inv_txfm_add_8x8_dct_dct_1_8bpc_neon:       294.0     211.5     209.9
inv_txfm_add_16x16_dct_dct_2_8bpc_neon:    1495.8    1050.4    1070.6
inv_txfm_add_32x32_dct_dct_4_8bpc_neon:    7866.7    5197.8    5321.4
inv_txfm_add_64x64_dct_dct_4_8bpc_neon:   25807.2   18619.3   18526.9

713aa34c

arm64: itx: Consistently use the factor 2896 in adst · 0ed3ad19
Martin Storsjö authored 5 years ago
```
The scaled form 2896>>4 shouldn't be necessary with valid bistreams.
```
0ed3ad19

arm64: itx: Use smull+smlal instead of addl+mul · a4950bce

Martin Storsjö authored 5 years ago

Even though smull+smlal does two multiplications instead of one,
the combination seems to be better handled by actual cores.

Before:                                 Cortex A53      A72      A73
inv_txfm_add_8x8_adst_adst_1_8bpc_neon:      356.0    279.2    278.0
inv_txfm_add_16x16_adst_adst_2_8bpc_neon:   1785.0   1329.5   1308.8
After:
inv_txfm_add_8x8_adst_adst_1_8bpc_neon:      360.0    253.2    269.3
inv_txfm_add_16x16_adst_adst_2_8bpc_neon:   1793.1   1300.9   1254.0

(In this particular cases, it seems like it is a minor regression
on A53, which is probably more due to having to change the ordering
of some instructions, due to how smull+smlal+smull2+smlal2 overwrites
the second output register sooner than an addl+addl2 would have, but
in general, smull+smlal seems to be equally good or better than
addl+mul on A53 as well.)

a4950bce

Sep 27, 2019

dav1dplay: initial support for --zerocopy · 490a1420

Niklas Haas authored 5 years ago and

Jean-Baptiste Kempf committed 5 years ago

Right now this just allocates a new buffer for every frame, uses it,
then discards it immediately. This is not optimal, either dav1d should
start reusing buffers internally or we need to pool them in dav1dplay.

As it stands, this is not really a performance gain. I'll have to
investigate why, but my suspicion is that seeing any gains might require
reusing buffers somewhere.

Note: Thrashing buffers is not as bad as it seems, initially. Not only
does libplacebo pool and reuse GPU memory and buffer state objects
internally, but this also absolves us from having to do any manual
polling to figure out when the buffer is reusable again. Creating, using
and immediately destroying buffers actually isn't as bad an approach as
it might otherwise seem.

It's entirely possible that this is only bad because of lock contention.
As said, I'll have to investigate further...

490a1420

dav1dplay: add --untimed for benchmarking purposes · 3f35ef1f
Niklas Haas authored 5 years ago and Jean-Baptiste Kempf committed 5 years ago
```
Useful to test the effects of performance changes to the
decoding/rendering loop as a whole.
```
3f35ef1f

dav1dplay: add --highquality to toggle render quality · f6ae8c9c

Niklas Haas authored 5 years ago and

Jean-Baptiste Kempf committed 5 years ago

Only meaningful with libplacebo. The defaults are higher quality than
SDL so it's an unfair comparison and definitely too much for slow iGPUs
at 4K res. Make the defaults fast/dumb processing only, and guard the
debanding/dithering/upscaling/etc. behind a new --highquality flag.

f6ae8c9c

Sep 19, 2019

x86: add 32-bit support to SSSE3 deblock lpf · c0865f35

Victorien Le Couviour--Tuffet authored 5 years ago

------------------------------------------
x86_64: lpf_h_sb_uv_w4_8bpc_c: 430.6
x86_32: lpf_h_sb_uv_w4_8bpc_c: 788.6
x86_64: lpf_h_sb_uv_w4_8bpc_ssse3: 322.0
x86_32: lpf_h_sb_uv_w4_8bpc_ssse3: 302.4
---------------------
x86_64: lpf_h_sb_uv_w6_8bpc_c: 981.9
x86_32: lpf_h_sb_uv_w6_8bpc_c: 1579.6
x86_64: lpf_h_sb_uv_w6_8bpc_ssse3: 421.5
x86_32: lpf_h_sb_uv_w6_8bpc_ssse3: 431.6
---------------------
x86_64: lpf_h_sb_y_w4_8bpc_c: 3001.7
x86_32: lpf_h_sb_y_w4_8bpc_c: 7021.3
x86_64: lpf_h_sb_y_w4_8bpc_ssse3: 466.3
x86_32: lpf_h_sb_y_w4_8bpc_ssse3: 564.7
---------------------
x86_64: lpf_h_sb_y_w8_8bpc_c: 4457.7
x86_32: lpf_h_sb_y_w8_8bpc_c: 3657.8
x86_64: lpf_h_sb_y_w8_8bpc_ssse3: 818.9
x86_32: lpf_h_sb_y_w8_8bpc_ssse3: 927.9
---------------------
x86_64: lpf_h_sb_y_w16_8bpc_c: 1967.9
x86_32: lpf_h_sb_y_w16_8bpc_c: 3343.5
x86_64: lpf_h_sb_y_w16_8bpc_ssse3: 1836.7
x86_32: lpf_h_sb_y_w16_8bpc_ssse3: 1975.0
---------------------
x86_64: lpf_v_sb_uv_w4_8bpc_c: 369.4
x86_32: lpf_v_sb_uv_w4_8bpc_c: 793.6
x86_64: lpf_v_sb_uv_w4_8bpc_ssse3: 110.9
x86_32: lpf_v_sb_uv_w4_8bpc_ssse3: 133.0
---------------------
x86_64: lpf_v_sb_uv_w6_8bpc_c: 769.6
x86_32: lpf_v_sb_uv_w6_8bpc_c: 1576.7
x86_64: lpf_v_sb_uv_w6_8bpc_ssse3: 222.2
x86_32: lpf_v_sb_uv_w6_8bpc_ssse3: 232.2
---------------------
x86_64: lpf_v_sb_y_w4_8bpc_c: 772.4
x86_32: lpf_v_sb_y_w4_8bpc_c: 2596.5
x86_64: lpf_v_sb_y_w4_8bpc_ssse3: 179.8
x86_32: lpf_v_sb_y_w4_8bpc_ssse3: 234.7
---------------------
x86_64: lpf_v_sb_y_w8_8bpc_c: 1660.2
x86_32: lpf_v_sb_y_w8_8bpc_c: 3979.9
x86_64: lpf_v_sb_y_w8_8bpc_ssse3: 468.3
x86_32: lpf_v_sb_y_w8_8bpc_ssse3: 580.9
---------------------
x86_64: lpf_v_sb_y_w16_8bpc_c: 1889.6
x86_32: lpf_v_sb_y_w16_8bpc_c: 4728.7
x86_64: lpf_v_sb_y_w16_8bpc_ssse3: 1142.0
x86_32: lpf_v_sb_y_w16_8bpc_ssse3: 1174.8
------------------------------------------

c0865f35

x86: add deblocking loopfilters SSSE3 asm (64-bit) · 1e4e6c7a

Ronald S. Bultje authored 6 years ago and

Victorien Le Couviour--Tuffet committed 5 years ago

---------------------
x86_64:
------------------------------------------
lpf_h_sb_uv_w4_8bpc_c: 430.6
lpf_h_sb_uv_w4_8bpc_ssse3: 322.0
lpf_h_sb_uv_w4_8bpc_avx2: 200.4
---------------------
lpf_h_sb_uv_w6_8bpc_c: 981.9
lpf_h_sb_uv_w6_8bpc_ssse3: 421.5
lpf_h_sb_uv_w6_8bpc_avx2: 270.0
---------------------
lpf_h_sb_y_w4_8bpc_c: 3001.7
lpf_h_sb_y_w4_8bpc_ssse3: 466.3
lpf_h_sb_y_w4_8bpc_avx2: 383.1
---------------------
lpf_h_sb_y_w8_8bpc_c: 4457.7
lpf_h_sb_y_w8_8bpc_ssse3: 818.9
lpf_h_sb_y_w8_8bpc_avx2: 537.0
---------------------
lpf_h_sb_y_w16_8bpc_c: 1967.9
lpf_h_sb_y_w16_8bpc_ssse3: 1836.7
lpf_h_sb_y_w16_8bpc_avx2: 1078.2
---------------------
lpf_v_sb_uv_w4_8bpc_c: 369.4
lpf_v_sb_uv_w4_8bpc_ssse3: 110.9
lpf_v_sb_uv_w4_8bpc_avx2: 58.1
---------------------
lpf_v_sb_uv_w6_8bpc_c: 769.6
lpf_v_sb_uv_w6_8bpc_ssse3: 222.2
lpf_v_sb_uv_w6_8bpc_avx2: 117.8
---------------------
lpf_v_sb_y_w4_8bpc_c: 772.4
lpf_v_sb_y_w4_8bpc_ssse3: 179.8
lpf_v_sb_y_w4_8bpc_avx2: 173.6
---------------------
lpf_v_sb_y_w8_8bpc_c: 1660.2
lpf_v_sb_y_w8_8bpc_ssse3: 468.3
lpf_v_sb_y_w8_8bpc_avx2: 345.8
---------------------
lpf_v_sb_y_w16_8bpc_c: 1889.6
lpf_v_sb_y_w16_8bpc_ssse3: 1142.0
lpf_v_sb_y_w16_8bpc_avx2: 568.1
------------------------------------------

1e4e6c7a

Sep 10, 2019

AVX2 for chroma 4:2:0 film grain reconstruction · 556890be

Ronald S. Bultje authored 5 years ago

fguv_32x32xn_8bpc_420_csfl0_c: 8945.4
fguv_32x32xn_8bpc_420_csfl0_avx2: 1001.6
fguv_32x32xn_8bpc_420_csfl1_c: 6363.4
fguv_32x32xn_8bpc_420_csfl1_avx2: 1299.5

556890be

Remove luma width check in fguv_32x32xn · 6d363223

Ronald S. Bultje authored 5 years ago

This would affect the output in samples with an odd width and horizontal
chroma subsampling. The check does not exist in libaom, and might cause
mismatches.

This causes issues in the sample from #210, which uses super-resolution
and has odd width. To work around this, make super-resolution's resize()
always write an even number of pixels. This should not interfere with
SIMD in the future.

6d363223

Y grain AVX2 implementations · 99307bf3

Ronald S. Bultje authored 5 years ago

fgy_32x32xn_8bpc_c: 16181.8
fgy_32x32xn_8bpc_avx2: 3231.4
gen_grain_y_ar0_8bpc_c: 108857.6
gen_grain_y_ar0_8bpc_avx2: 22826.7
gen_grain_y_ar1_8bpc_c: 168239.8
gen_grain_y_ar1_8bpc_avx2: 72117.2
gen_grain_y_ar2_8bpc_c: 266165.9
gen_grain_y_ar2_8bpc_avx2: 126281.8
gen_grain_y_ar3_8bpc_c: 448139.4
gen_grain_y_ar3_8bpc_avx2: 137047.1

99307bf3

Add film grain checkasm tests · 04ca7112
Ronald S. Bultje authored 5 years ago

04ca7112
Split out film grain block functions into a DSPContext · b9d4630c
Ronald S. Bultje authored 5 years ago

b9d4630c

Sep 06, 2019
- obu: fix deriving render_width and render_height from reference frames · 79c4aa95
  James Almer authored 5 years ago
```
Both values can be independently coded in the bitstream, and are not
always equal to frame_width and frame_height.
```
  79c4aa95
Sep 05, 2019

Silence some clang-cl warnings · acad1a99

Henrik Gramner authored 5 years ago

For some reason the MSVC CRT _wassert() function is not flagged as
 __declspec(noreturn), so when using those headers the compiler will
expect execution to continue after an assertion has been triggered
and will therefore complain about the use of uninitialized variables
when compiled in debug mode in certain code paths.

Reorder some case statements as a workaround.

acad1a99

x86: Fix buffer overead in mc put · 69dae683
Henrik Gramner authored 5 years ago and Henrik Gramner committed 5 years ago
```
For w <= 32 we can't process more than two rows per loop iteration.

Credit to OSS-Fuzz.
```
69dae683

Sep 04, 2019

x86: Increase precision of the final inverse ADST transform stages · a9315f5f
Henrik Gramner authored 5 years ago and Henrik Gramner committed 5 years ago
```
16-bit precision is sufficient for the second pass, but the first pass
requires 32-bit precision to correctly handle some esoteric edge cases.
```
a9315f5f

arm64: itx: Do the final calculation of adst4/adst8/adst16 in 32 bit to avoid too narrow clipping · e2702eaf

Martin Storsjö authored 5 years ago

See issue #295, this fixes it for arm64.

Before:                                 Cortex A53      A72      A73
inv_txfm_add_4x4_adst_adst_1_8bpc_neon:      103.0     63.2     65.2
inv_txfm_add_4x8_adst_adst_1_8bpc_neon:      197.0    145.0    134.2
inv_txfm_add_8x8_adst_adst_1_8bpc_neon:      332.0    248.0    247.1
inv_txfm_add_16x16_adst_adst_2_8bpc_neon:   1676.8   1197.0   1186.8
After:
inv_txfm_add_4x4_adst_adst_1_8bpc_neon:      103.0     76.4     67.0
inv_txfm_add_4x8_adst_adst_1_8bpc_neon:      205.0    155.0    143.8
inv_txfm_add_8x8_adst_adst_1_8bpc_neon:      358.0    269.0    276.2
inv_txfm_add_16x16_adst_adst_2_8bpc_neon:   1785.2   1347.8   1312.1

This would probably only be needed for adst in the first pass, but
the additional code complexity from splitting the implementations
(as we currently don't have transforms differentiated between first
and second pass) isn't necessarily worth it (the speedup over C code
is still 8-10x).

e2702eaf

Prefer __builtin_unreachable() over __assume() on clang-cl · c0e1988b

Henrik Gramner authored 5 years ago

__assume() doesn't work correctly in clang-cl versions prior to 7.0.0
which causes bogus warnings regarding use of uninitialized variables
to be printed. Avoid that by using __builtin_unreachable() instead.

c0e1988b

Fix clang-cl assertion warning · 666c71a0

Henrik Gramner authored 5 years ago

clang-cl doesn't like function calls in __assume statements, even
trivial inline ones.

666c71a0

arm: Fix assembling with older binutils · e65abadf

Janne Grunau authored 5 years ago and

Martin Storsjö committed 5 years ago

This large constant needs a movw instruction, which newer binutils can
figure out, but older versions need stated explicitly.

This fixes #296.

e65abadf

Sep 03, 2019

TileContext: reorder scratch buffer to avoid conflicts · 863c3731

Janne Grunau authored 5 years ago

The chroma part of pal_idx potentially conflicts during intra
reconstruction with edge_{8,16}bpc. Fixes out of range pixel values
caused by invalid palette indices in
clusterfuzz-testcase-minimized-dav1d_fuzzer_mt-5076736684851200.
Fixes #294. Reported as integer overflows in boxsum5sqr with undefined
behavior sanitizer. Credits to oss-fuzz.

863c3731

Sep 01, 2019
- CI: use "needs:" to break the static build, test stage dependency · bfc9f72a
  Janne Grunau authored 5 years ago
  
  bfc9f72a
Aug 30, 2019
- Apply high-bitdepth adjustment of pixel index after delta calculation · 91b0af2f
  Ronald S. Bultje authored 5 years ago
```
Fixes libaom/dav1d mismatch in av1-1-b10-23-film_grain-50.ivf.
```
  91b0af2f
- Use linear interpolation for high bit-depth pixel values · 1ffbeda0
  Ronald S. Bultje authored 5 years ago
  
  1ffbeda0
- Fix bugs in film grain generation · c09f1072
  Ronald S. Bultje authored 5 years ago and Jean-Baptiste Kempf committed 5 years ago
```
- calculate chroma grain based on src (not dst) luma pixels;
- division should precede multiplication in delta calculation.

Together, these fix differences in film grain reconstruction between
libaom and dav1d for various generated samples.
```
  c09f1072
Aug 29, 2019
- arm: mc: Making code style consistent · cfd6fe6d
  B Krishnan Iyer authored 5 years ago
  
  cfd6fe6d
- arm: mc: Push fewer registers in w_mask · f01bbbdd
  Martin Storsjö authored 5 years ago
```
Use the so far unused lr register instead of r10.
```
  f01bbbdd