Commits · 6ed5fafb42c651c24b6a65fd4f50ed426fd72d65 · Pranav Kant / dav1d

Jan 01, 2021
- Update NEWS for 0.8.1 · 6ed5fafb
  Janne Grunau authored 4 years ago
  
  0.8.1
  
  6ed5fafb
Dec 16, 2020

CI: Run multithreaded tests using thread sanitizer (tsan) · 7424f8e8
Henrik Gramner authored 4 years ago and Henrik Gramner committed 4 years ago

7424f8e8

arm32: mc: Add NEON implementation of emu_edge for 16 bpc · 38df0efa

Martin Storsjö authored 4 years ago

Checkasm benchmarks: Cortex A7 A8 A53 A72 A73
emu_edge_w4_16bpc_neon: 375.0 312.6 268.3 159.3 170.0
emu_edge_w8_16bpc_neon: 619.3 425.5 435.5 249.5 291.1
emu_edge_w16_16bpc_neon: 719.1 568.3 506.9 324.2 314.4
emu_edge_w32_16bpc_neon: 2112.2 1677.7 1396.2 1050.5 1009.6
emu_edge_w64_16bpc_neon: 5046.8 4322.5 3693.7 3953.8 2682.8
emu_edge_w128_16bpc_neon: 16311.1 14341.3 12877.8 26183.5 8924.9

Corresponding numbers for arm64, for comparison:
Cortex A53 A72 A73
emu_edge_w4_16bpc_neon: 302.5 174.9 159.2
emu_edge_w8_16bpc_neon: 344.6 292.3 273.2
emu_edge_w16_16bpc_neon: 601.0 461.2 316.8
emu_edge_w32_16bpc_neon: 974.2 1274.7 960.5
emu_edge_w64_16bpc_neon: 2853.1 3527.6 2633.5
emu_edge_w128_16bpc_neon: 14633.5 26776.6 7236.0

38df0efa

arm32: mc: Add NEON implementations of the w_mask functions for 16 bpc · cf74bdec

Martin Storsjö authored 4 years ago

Checkasm numbers: Cortex A7 A8 A53 A72 A73
w_mask_420_w4_16bpc_neon: 350.3 216.4 215.4 141.7 134.5
w_mask_420_w8_16bpc_neon: 926.7 590.9 529.1 373.8 354.5
w_mask_420_w16_16bpc_neon: 2956.7 1880.4 1654.8 1186.1 1134.1
w_mask_420_w32_16bpc_neon: 11489.3 7426.4 6314.1 4599.8 4398.6
w_mask_420_w64_16bpc_neon: 28175.9 17898.1 16002.8 11079.0 10551.8
w_mask_420_w128_16bpc_neon: 71599.4 44630.9 40696.9 28057.3 27836.5
w_mask_422_w4_16bpc_neon: 339.0 210.1 206.7 137.3 134.7
w_mask_422_w8_16bpc_neon: 887.2 573.3 499.6 361.6 353.5
w_mask_422_w16_16bpc_neon: 2918.0 1841.6 1593.0 1194.0 1157.9
w_mask_422_w32_16bpc_neon: 11313.8 7238.7 6043.4 4577.1 4469.6
w_mask_422_w64_16bpc_neon: 27746.5 17427.2 15386.9 11082.6 10693.8
w_mask_422_w128_16bpc_neon: 70521.4 43864.9 39209.3 29045.7 28305.5
w_mask_444_w4_16bpc_neon: 325.6 202.9 198.4 135.2 129.3
w_mask_444_w8_16bpc_neon: 860.7 534.9 474.8 358.0 352.2
w_mask_444_w16_16bpc_neon: 2764.3 1714.4 1517.8 1160.6 1133.1
w_mask_444_w32_16bpc_neon: 10719.8 6738.3 5746.7 4458.6 4347.1
w_mask_444_w64_16bpc_neon: 26407.9 16224.1 14783.9 10784.3 10371.4
w_mask_444_w128_16bpc_neon: 67226.1 41060.1 37823.1 41696.1 27722.2

Corresponding numbers for arm64, for comparison:
Cortex A53 A72 A73
w_mask_420_w4_16bpc_neon: 173.6 123.6 120.3
w_mask_420_w8_16bpc_neon: 484.0 344.0 329.4
w_mask_420_w16_16bpc_neon: 1436.3 1025.7 1028.7
w_mask_420_w32_16bpc_neon: 5597.0 3994.8 3981.2
w_mask_420_w64_16bpc_neon: 13953.4 9700.8 9579.9
w_mask_420_w128_16bpc_neon: 35833.7 25519.3 24277.8
w_mask_422_w4_16bpc_neon: 159.4 111.7 114.2
w_mask_422_w8_16bpc_neon: 453.4 326.2 326.7
w_mask_422_w16_16bpc_neon: 1398.2 1063.3 1052.6
w_mask_422_w32_16bpc_neon: 5532.7 4143.0 4026.3
w_mask_422_w64_16bpc_neon: 13885.3 9978.0 9689.8
w_mask_422_w128_16bpc_neon: 35763.3 25822.4 24610.9
w_mask_444_w4_16bpc_neon: 152.9 110.0 112.8
w_mask_444_w8_16bpc_neon: 437.2 332.0 325.8
w_mask_444_w16_16bpc_neon: 1399.3 1068.9 1041.7
w_mask_444_w32_16bpc_neon: 5410.9 4139.7 4136.9
w_mask_444_w64_16bpc_neon: 13648.7 10011.8 10004.6
w_mask_444_w128_16bpc_neon: 35639.6 26910.8 25631.0

cf74bdec

arm32: mc: Add NEON implementation of the blend functions for 16 bpc · f809edb4

Martin Storsjö authored 4 years ago

Checkasm numbers: Cortex A7 A8 A53 A72 A73
blend_h_w2_16bpc_neon: 190.0 163.0 135.5 67.4 71.2
blend_h_w4_16bpc_neon: 204.4 119.1 140.3 61.2 74.9
blend_h_w8_16bpc_neon: 247.6 126.2 159.5 86.1 88.4
blend_h_w16_16bpc_neon: 391.6 186.5 230.7 134.9 149.4
blend_h_w32_16bpc_neon: 734.9 354.2 454.1 248.1 270.9
blend_h_w64_16bpc_neon: 1290.8 611.7 801.1 456.6 491.3
blend_h_w128_16bpc_neon: 2876.4 1354.2 1788.6 1083.4 1092.0
blend_v_w2_16bpc_neon: 264.4 325.2 206.8 107.6 123.0
blend_v_w4_16bpc_neon: 471.8 358.7 356.9 187.0 229.9
blend_v_w8_16bpc_neon: 616.9 365.3 445.4 218.2 248.5
blend_v_w16_16bpc_neon: 928.3 517.1 629.1 325.0 358.0
blend_v_w32_16bpc_neon: 1771.6 790.1 1106.1 631.2 584.7
blend_w4_16bpc_neon: 128.8 66.6 95.5 33.5 42.0
blend_w8_16bpc_neon: 238.7 118.0 156.8 76.5 84.5
blend_w16_16bpc_neon: 809.7 360.9 482.3 268.5 298.3
blend_w32_16bpc_neon: 2015.7 916.6 1177.0 682.1 730.9

Corresponding numbers for arm64, for comparison:
Cortex A53 A72 A73
blend_h_w2_16bpc_neon: 109.3 83.1 56.8
blend_h_w4_16bpc_neon: 114.1 61.1 62.3
blend_h_w8_16bpc_neon: 133.3 80.8 81.0
blend_h_w16_16bpc_neon: 215.6 132.7 149.5
blend_h_w32_16bpc_neon: 390.4 253.9 235.8
blend_h_w64_16bpc_neon: 715.8 455.8 454.0
blend_h_w128_16bpc_neon: 1649.7 1034.7 1066.2
blend_v_w2_16bpc_neon: 185.9 176.3 178.3
blend_v_w4_16bpc_neon: 338.3 184.4 234.3
blend_v_w8_16bpc_neon: 427.0 214.5 252.7
blend_v_w16_16bpc_neon: 680.4 358.1 389.2
blend_v_w32_16bpc_neon: 1100.7 615.5 690.1
blend_w4_16bpc_neon: 76.0 32.3 32.1
blend_w8_16bpc_neon: 134.4 76.3 71.5
blend_w16_16bpc_neon: 476.3 268.8 301.5
blend_w32_16bpc_neon: 1226.8 659.9 782.8

f809edb4

arm64: mc16: Get rid of one instruction in blend_v w16 · eeb03a73
Martin Storsjö authored 4 years ago

eeb03a73
arm32: mc16: Fix column alignment in the warp function · f3197c1a
Martin Storsjö authored 4 years ago

f3197c1a
arm32: mc: Improve scheduling in blend_h · 9257a961
Martin Storsjö authored 4 years ago

9257a961
arm32: mc: Use a replicating vld1 to all lanes in one place · 85de1c3b
Martin Storsjö authored 4 years ago
```
This is one cycle faster, when the other lanes don't need to be
preserved, on some (old) cores.
```
85de1c3b
arm32: mc: Use two-word replicating loads in emu_edge · 9381637a
Martin Storsjö authored 4 years ago

9381637a
arm32: mc: Back up and restore fewer registers in blend_h/blend_v · c6df7491
Martin Storsjö authored 4 years ago

c6df7491
arm32: Use ldrd for loading two parameters from the stack · b0c97120
Martin Storsjö authored 4 years ago

b0c97120

Dec 15, 2020

meson: Increase checkasm timeout · ea65e1ab
Henrik Gramner authored 4 years ago and Henrik Gramner committed 4 years ago
```
The default 30 second timeout may be insufficient when running
under certain sanitizers, especially on slower CPUs.
```
ea65e1ab
x86: Fix out-of-bounds read in AVX2 wiener_filter · 1571f65a
Henrik Gramner authored 4 years ago

1571f65a

Revert "meson: Handle the b_lto option as a string option for newer meson versions" · 5a88f60f

Martin Storsjö authored 4 years ago

This reverts commit 920079ed.

Upstream meson reverted the breaking change, see
https://github.com/mesonbuild/meson/issues/7493#issuecomment-729020325,
so currently the forward-compatibility code was producing warnings
in builds with newer meson instead.

5a88f60f

Dec 12, 2020

arm32: loopfilter: NEON implementation of loopfilter for 16 bpc · 802790f1

Martin Storsjö authored 4 years ago

This operates on 4 pixels as a time, while the arm64 version
operated on 8 pixels at a time.

As the registers only fit one single 4 pixel wide slice (with one
single set of input parameters and mask bits), the high level
logic for calculating those input parameters is done with GPRs
and scalar instructions instead of SIMD as in the other implementations.

802790f1

arm64: loopfilter16: Fix conditions for skipping parts of the filtering · 2a448fde

Martin Storsjö authored 4 years ago

As the arm64 16 bpc loopfilter operates on a 8 pixel region at a time,
inspect 2 bits (corresponding to 4 pixels each) from these registers,
as we also shift them down by 2 bits at the end of the loop.

This should allow skipping the loopfilter altogether (or using a
smaller filter) in more cases.

2a448fde

arm32: loopfilter: Fix a misindented/aligned operand · c1a5e445
Martin Storsjö authored 4 years ago

c1a5e445
arm: loopfilter: Compare L != 0 before doing a splat · b252334a
Martin Storsjö authored 4 years ago

b252334a

x86: Rewrite wiener SSE2/SSSE3/AVX2 asm · 78d27b7d

Henrik Gramner authored 4 years ago

The previous implementation did two separate passes in the horizontal
and vertical directions, with the intermediate values being stored
in a buffer on the stack. This caused bad cache thrashing.

By interleaving the horizontal and vertical passes in combination
with a ring buffer for storing only a few rows at a time the
performance is improved by a significant amount.

Also split the function into 7-tap and 5-tap versions. The latter is
faster and fairly common (always for chroma, sometimes for luma).

78d27b7d

x86: Rename looprestoration_ssse3.asm to looprestoration_sse.asm · 3497c4c9
Henrik Gramner authored 4 years ago
```
It contains both SSE2 and SSSE3 code.
```
3497c4c9

Add miscellaneous minor wiener optimizations · 2737c05e

Henrik Gramner authored 4 years ago

Combine horizontal and vertical filter pointers into a single parameter
when calling the wiener DSP function.

Eliminate the +128 filter coefficient handling where possible.

2737c05e

Use smaller data types for wiener filter coefficients · fdf1570e
Henrik Gramner authored 4 years ago
```
Reduces memory usage by 96 bytes per sb.
```
fdf1570e
Simplify msac subexp decoding · 6f7e5cb3
Henrik Gramner authored 4 years ago

6f7e5cb3

Dec 11, 2020
- fuzzer: Test calling dav1d_picture_unref() after dav1d_close() · f0f73b4c
  Henrik Gramner authored 4 years ago and Henrik Gramner committed 4 years ago
```
Covers the use case of keeping a reference to a Dav1dPicture
after closing the decoder.
```
  f0f73b4c
Dec 10, 2020

Fix use of references to buffers after calling dav1d_close() · 135286f4

Henrik Gramner authored 4 years ago and

Henrik Gramner committed 4 years ago

9057d286 had the side effect of causing references to buffers allocated
using memory pools to no longer be valid after closing the decoder.

Restore this functionality by making buffer pools reference counted.

135286f4

Dec 01, 2020

arm32: looprestoration: NEON implementation of SGR for 10 bpc · e705519d

Martin Storsjö authored 4 years ago

Checkasm numbers: Cortex A7 A8 A53 A72 A73
selfguided_3x3_10bpc_neon: 919127.6 717942.8 565717.8 404748.0 372179.8
selfguided_5x5_10bpc_neon: 640310.8 511873.4 370653.3 273593.7 256403.2
selfguided_mix_10bpc_neon: 1533887.0 1252389.5 922111.1 659033.4 613410.6

Corresponding numbers for arm64, for comparison:

Cortex A53 A72 A73
selfguided_3x3_10bpc_neon: 500706.0 367199.2 345261.2
selfguided_5x5_10bpc_neon: 361403.3 270550.0 249955.3
selfguided_mix_10bpc_neon: 846172.4 623590.3 578404.8

e705519d

arm32: looprestoration: Prepare for 16 bpc by splitting code to separate files · e1be33b9

Martin Storsjö authored 5 years ago

looprestoration_common.S contains functions that can be used as is
with one single instantiation of the functions for both 8 and 16 bpc.
This file will be built once, regardless of which bitdepths are enabled.

looprestoration_tmpl.S contains functions where the source can be shared
and templated between 8 and 16 bpc. This will be included by the separate
8/16bpc implementaton files.

e1be33b9

arm: looprestoration16: Fix comments referring to pixels as bytes · c58e9d57

Martin Storsjö authored 4 years ago

A number of other similar comments were updated to say pixels when
the 16 bpc code was written originally, but these were missed.

c58e9d57

arm64: looprestoration: Add a missed parameter in a comment · 25877c3b
Martin Storsjö authored 4 years ago
```
Make it consistent with the weighted1 function.
```
25877c3b
arm32: looprestoration: Remove an unnecessary stack arg load in SGR · ca9cd497
Martin Storsjö authored 4 years ago
```
For the existing 8 bpc support, there's no stack argument to load
into r8.
```
ca9cd497
arm32: looprestoration: Specify alignment in loads/stores in SGR where possible · b7c66fa6
Martin Storsjö authored 4 years ago

b7c66fa6

arm64: looprestoration16: Don't keep precalculated squares in box3/5_h · cbd4827f

Martin Storsjö authored 4 years ago

Instead of calculating squares of pixels once, and shifting and
adding the precalculated squares, just do multiply-accumulate of
the pixels that are shifted anyway for the non-squared sum. This
results in more multiplications in total, but fewer instructions,
and multiplications aren't that much more expensive than regular
arithmetic operations anyway.

On Cortex A53 and A72, this is a fairly substantial gain, on A73
it's a very marginal gain.

The runtimes for the box3/5_h functions themselves are reduced
by around 16-20%, and the overall runtime for SGR is reduced
by around 2-8%.

Before:                   Cortex A53       A72       A73
selfguided_3x3_10bpc_neon:  513086.5  385767.7  348774.3
selfguided_5x5_10bpc_neon:  378108.6  291133.5  253251.4
selfguided_mix_10bpc_neon:  876833.1  662801.0  586387.4

After:                    Cortex A53       A72       A73
selfguided_3x3_10bpc_neon:  502734.0  363754.5  343199.8
selfguided_5x5_10bpc_neon:  361696.4  265848.2  249476.8
selfguided_mix_10bpc_neon:  852683.8  615848.6  577615.0

cbd4827f

Nov 28, 2020
- meson: Support running checkasm benchmarks through meson · d6beb0a0
  Henrik Gramner authored 4 years ago and Henrik Gramner committed 4 years ago
  
  d6beb0a0
- meson: Place checkasm and header tests in named suites · 7a1c1fc3
  Henrik Gramner authored 4 years ago and Henrik Gramner committed 4 years ago
  
  7a1c1fc3
Nov 23, 2020
- Update NEWS for 0.8.0 · 2ca1bfc3
  Jean-Baptiste Kempf authored 4 years ago
  
  0.8.0
  
  2ca1bfc3
Nov 22, 2020

Add more buffer pools · 236e1122

Henrik Gramner authored 4 years ago and

Henrik Gramner committed 4 years ago

Add buffer pools for miscellaneous smaller buffers that are
repeatedly being freed and reallocated.

Also improve dav1d_ref_create() by consolidating two separate
memory allocations into a single one.

236e1122

Nov 20, 2020

arm32: mc: NEON implementation of warp8x8 for 16 bpc · dc98fff8

Martin Storsjö authored 4 years ago

Checkasm benchmarks:
                    Cortex A7      A8     A53     A72     A73
warp_8x8_16bpc_neon:   4062.6  2109.4  2462.0  1338.9  1391.1
warp_8x8t_16bpc_neon:  3996.3  2102.4  2412.0  1273.8  1368.9

Corresponding numbers for arm64, for comparison:
                                   Cortex A53     A72     A73
warp_8x8_16bpc_neon:                   2037.0  1148.8  1222.0
warp_8x8t_16bpc_neon:                  2008.0  1120.4  1200.9

dc98fff8

arm32: cdef: Add NEON implementations of CDEF for 16 bpc · 018e64e7

Martin Storsjö authored 4 years ago

Use a shared template file for assembly functions that can be
templated into 8 and 16 bpc forms, just like in the arm64 version.

Checkasm benchmarks:
Cortex A7 A8 A53 A72 A73
cdef_dir_16bpc_neon: 975.9 853.2 555.2 378.7 386.9
cdef_filter_4x4_16bpc_neon: 746.9 521.7 481.2 333.0 340.8
cdef_filter_4x8_16bpc_neon: 1300.0 885.5 816.3 582.7 599.5
cdef_filter_8x8_16bpc_neon: 2282.5 1415.0 1417.6 1059.0 1076.3

Corresponding numbers for arm64, for comparison:
Cortex A53 A72 A73
cdef_dir_16bpc_neon: 418.0 306.7 310.7
cdef_filter_4x4_16bpc_neon: 453.4 282.9 297.4
cdef_filter_4x8_16bpc_neon: 807.5 514.2 533.8
cdef_filter_8x8_16bpc_neon: 1425.2 924.4 942.0

018e64e7

arm32: cdef: Simplify some cases in the padding function · e41a2a1f
Martin Storsjö authored 4 years ago

e41a2a1f