Commits · 2ca1bfc39fe5c182d885f2750779d5b1678d8ef9 · Pranav Kant / dav1d

Nov 23, 2020
- Update NEWS for 0.8.0 · 2ca1bfc3
  Jean-Baptiste Kempf authored 4 years ago
  
  0.8.0
  
  2ca1bfc3
Nov 22, 2020

Henrik Gramner authored 4 years ago and

Henrik Gramner committed 4 years ago

Add buffer pools for miscellaneous smaller buffers that are
repeatedly being freed and reallocated.

Also improve dav1d_ref_create() by consolidating two separate
memory allocations into a single one.

236e1122

Nov 20, 2020

arm32: mc: NEON implementation of warp8x8 for 16 bpc · dc98fff8

Martin Storsjö authored 4 years ago

Checkasm benchmarks:
                    Cortex A7      A8     A53     A72     A73
warp_8x8_16bpc_neon:   4062.6  2109.4  2462.0  1338.9  1391.1
warp_8x8t_16bpc_neon:  3996.3  2102.4  2412.0  1273.8  1368.9

Corresponding numbers for arm64, for comparison:
                                   Cortex A53     A72     A73
warp_8x8_16bpc_neon:                   2037.0  1148.8  1222.0
warp_8x8t_16bpc_neon:                  2008.0  1120.4  1200.9

dc98fff8

arm32: cdef: Add NEON implementations of CDEF for 16 bpc · 018e64e7

Martin Storsjö authored 4 years ago

Use a shared template file for assembly functions that can be
templated into 8 and 16 bpc forms, just like in the arm64 version.

Checkasm benchmarks:
Cortex A7 A8 A53 A72 A73
cdef_dir_16bpc_neon: 975.9 853.2 555.2 378.7 386.9
cdef_filter_4x4_16bpc_neon: 746.9 521.7 481.2 333.0 340.8
cdef_filter_4x8_16bpc_neon: 1300.0 885.5 816.3 582.7 599.5
cdef_filter_8x8_16bpc_neon: 2282.5 1415.0 1417.6 1059.0 1076.3

Corresponding numbers for arm64, for comparison:
Cortex A53 A72 A73
cdef_dir_16bpc_neon: 418.0 306.7 310.7
cdef_filter_4x4_16bpc_neon: 453.4 282.9 297.4
cdef_filter_4x8_16bpc_neon: 807.5 514.2 533.8
cdef_filter_8x8_16bpc_neon: 1425.2 924.4 942.0

018e64e7

arm32: cdef: Simplify some cases in the padding function · e41a2a1f
Martin Storsjö authored 4 years ago

e41a2a1f
arm64: cdef: Fix a comment typo · c48ea15f
Martin Storsjö authored 4 years ago

c48ea15f
Update THANKS.md · ba875b96
Matthias Dressel authored 4 years ago

ba875b96

Nov 18, 2020
- Check frame IDs for reference and "show_existing" frames · 3fdf468e
  Dmitriy Sychov authored 4 years ago and Ronald S. Bultje committed 4 years ago
  
  3fdf468e
Nov 17, 2020
- Combine boxsum and boxsumsqr in SGR C code · e413c8ed
  Luc Trudeau authored 4 years ago
```
Makes C code more alike ASM
```
  e413c8ed
Nov 16, 2020

Add a picture buffer pool · 9057d286

Henrik Gramner authored 4 years ago and

Jean-Baptiste Kempf committed 4 years ago

Reuse buffers allocated for picture data instead of constantly
freeing and allocating new ones.

The impact of this can vary significantly between different systems,
in particular it's highly beneficial on Windows where it can result
in an overall performance increase of up to 10% in some cases.

9057d286

meson: Handle the b_lto option as a string option for newer meson versions · 920079ed

Martin Storsjö authored 4 years ago

Since 3e6fbde94c1cb8d4e01b7daf0282c315ff0e6c7d in meson (past the
0.56 release), the b_lto option was changed from a bool to a
tristate option (false/true/thin).

One could just compare the b_lto option against 'false', but that
causes warnings on older meson versions (on all existing releases).

920079ed

use less memory in SGR C code · bcebc7bd
Luc Trudeau authored 4 years ago

bcebc7bd

Nov 07, 2020

Fix variable name · ffd052bd

oddstone authored 4 years ago

The first index to task_idx_to_sby_and_tile_idx is task_idx not tile_idx

ffd052bd

Oct 21, 2020
- Abort frame decoding properly on reference error · a40d3b5f
  Victorien Le Couviour--Tuffet authored 4 years ago
```
This could cause a frame waiting on the current one to not be notified
on error.

Fixes #351.
```
  a40d3b5f
Oct 02, 2020

Avoid using %ld for debugging in obu.c · 901704e8

Luc Trudeau authored 4 years ago

long is 32 bits on Win64, as such %ld are replaced with %td. For
SEQHDR, %ld was used but the actual value is a 32bit unsigned so
%u is enough.

901704e8

Oct 01, 2020

Add debug code for HDR metadata · a902d6e3

Luc Trudeau authored 4 years ago

Prints out values and offsets for content light level and
mastering display color volume

a902d6e3

Sep 27, 2020
- CI/test-debian-asan: run address sanitizer tests both with and without asm · 0243c3ff
  Janne Grunau authored 4 years ago and Jean-Baptiste Kempf committed 4 years ago
  
  0243c3ff
- fuzzer: parse '--cpumask X' command line argument · ac1cb28d
  Janne Grunau authored 4 years ago and Jean-Baptiste Kempf committed 4 years ago
  
  ac1cb28d
Sep 24, 2020

arm32: looprestoration: NEON implementation of wiener filter for 16 bpc · 2c09aaa4

Martin Storsjö authored 4 years ago

Checkasm benchmarks: Cortex A7 A8 A53 A72 A73
wiener_chroma_10bpc_neon: 385312.5 165772.7 184308.2 122311.2 126050.2
wiener_chroma_12bpc_neon: 385296.7 165538.0 184438.2 122290.5 126205.3
wiener_luma_10bpc_neon: 385318.5 165985.3 184147.4 122311.1 126168.4
wiener_luma_12bpc_neon: 385316.3 165819.1 184484.7 122304.4 125982.4

The corresponding numbers for arm64 for comparison:
Cortex A53 A72 A73
wiener_chroma_10bpc_neon: 176319.7 125992.1 128162.4
wiener_chroma_12bpc_neon: 176386.2 125986.4 128343.8
wiener_luma_10bpc_neon: 176174.0 126001.7 128227.8
wiener_luma_12bpc_neon: 176176.5 125992.1 128204.8

The arm32 version actually seems to run marginally faster than the arm64
one on A72 and A73. I believe this is because the arm64 code is tuned
for A53 (which makes it a bit slower on other cores), but the arm32 code
can't be tuned exactly the same way due to fewer registers being available.

2c09aaa4

arm64: looprestoration16: Reorder instructions to avoid close data dependencies · 7ebcb777

Martin Storsjö authored 4 years ago

Before:                  Cortex A53       A72       A73
wiener_chroma_10bpc_neon:  177063.6  129197.3  127987.9
wiener_chroma_12bpc_neon:  177034.4  129206.8  128409.5
wiener_luma_10bpc_neon:    177072.6  129198.1  127931.8
wiener_luma_12bpc_neon:    177052.4  129196.0  127955.2
After:
wiener_chroma_10bpc_neon:  176319.7  125992.1  128162.4
wiener_chroma_12bpc_neon:  176386.2  125986.4  128343.8
wiener_luma_10bpc_neon:    176174.0  126001.7  128227.8
wiener_luma_12bpc_neon:    176176.5  125992.1  128204.8

This gives a small speedup on A53, a bit larger one on A72 and little
change (mostly noise?) on A73.

7ebcb777

arm64: looprestoration16: Use narrower operations where possible when filtering one pixel · 911942ca
Martin Storsjö authored 4 years ago

911942ca

arm32: looprestoration: Optimize the 4-pixel wide horizontal wiener filter · 41f59b02

Martin Storsjö authored 4 years ago

The vext.8 instructions only need to produce a single d register each,
making more registers available as scratch space, allowing to hide
latencies more, and group the vmul/vmla in the form that is beneficial
for in-order cores (with a special forwarding path for such patterns).

41f59b02

arm32: looprestoration: Remove an unused macro that is used only once · 8486bffe
Martin Storsjö authored 4 years ago

8486bffe
arm32: looprestoration: Specify alignment for more loads/stores · c3c4e3ab
Martin Storsjö authored 4 years ago

c3c4e3ab
arm32: looprestoration: Fix missed vertical alignment · 77b3b25c
Martin Storsjö authored 4 years ago

77b3b25c

Sep 20, 2020

tests: avoid using sed in header test · f90ada0d
Janne Grunau authored 4 years ago and Jean-Baptiste Kempf committed 4 years ago
```
Makes !1078 redundant.
```
f90ada0d
meson: Set msvc like warning options for clang-cl · a5e45517
Martin Storsjö authored 4 years ago and Jean-Baptiste Kempf committed 4 years ago
```
This avoids lots of warnings about unsupported warning options.
```
a5e45517

meson: Use gas-preprocessor as generator, for targets that need it · d68a2fc1

Martin Storsjö authored 4 years ago

Don't pass the .S assembly sources as C source files in this case,
as e.g. MSVC doesn't support them (and meson knows it doesn't, so
it refuses to proceed with an MSVC/gas-preprocessor wrapper script, as
meson detects it as MSVC - unless meson is hacked to allow passing .S
files to MSVC).

This allows building dav1d with MSVC for ARM targets without
hacks to meson. (Building in a pure MSVC setup with no other
compilers available does require a few new patches to gas-preprocessor
though.)

This has been postponed for quite some time, as compiling with
MSVC for non-x86 targets in meson has been problematic, as meson
used to require a working compiler for the build system as well,
and MSVC for all targets are named cl.exe, and you can't have one
for the cross target and the build machine first in the path at
the same time. This was recently fixed though, see
https://github.com/mesonbuild/meson/issues/4402 and
https://github.com/mesonbuild/meson/pull/6512.

This matches how gas-preprocessor is hooked up for e.g. OpenH264 in
https://github.com/cisco/openh264/commit/013c4566a219a1f0fd50a8186f2b11fd8c3efcfb.

d68a2fc1

build: increase minimal meson to 0.49 · d85fdf52
Janne Grunau authored 4 years ago
```
Fixes #350.
```
d85fdf52

Sep 17, 2020
- x86: Add misc mc asm tweaks · 5173de30
  Henrik Gramner authored 4 years ago and Victorien Le Couviour--Tuffet committed 4 years ago
  
  5173de30
Sep 15, 2020

Ban op->idc that may drop all layer-specific OBUs · 50e876c6

Wan-Teh Chang authored 4 years ago

If c->operating_point_idc is nonzero and either bits 0-7 or bits 8-11 in
it are all 0s, it will cause dav1d_parse_obus() to drop all
layer-specific OBUs. Prohibit any op->idc with such properties because
it could be selected as c->operating_point_idc.

50e876c6

Sep 06, 2020
- cli: Add support for Unicode and long paths on Windows 10 · 8c2a8976
  Henrik Gramner authored 4 years ago and Henrik Gramner committed 4 years ago
  
  8c2a8976
Sep 03, 2020

arm32: mc: NEON implementation of put/prep 8tap/bilin for 16 bpc · 856662b4

Martin Storsjö authored 4 years ago

Examples of checkasm benchmarks:
Cortex A7 A8 A9 A53 A72 A73
mc_8tap_regular_w8_0_16bpc_neon: 158.7 106.2 167.0 127.9 55.0 77.2
mc_8tap_regular_w8_h_16bpc_neon: 1000.8 557.5 749.2 609.2 401.4 485.4
mc_8tap_regular_w8_hv_16bpc_neon: 2278.9 1255.4 1352.5 1277.2 867.8 915.9
mc_8tap_regular_w8_v_16bpc_neon: 1060.0 393.6 485.5 448.3 298.0 298.2
mc_bilinear_w8_0_16bpc_neon: 159.7 96.6 161.1 123.7 55.4 74.7
mc_bilinear_w8_h_16bpc_neon: 342.3 250.8 352.9 239.0 158.4 165.1
mc_bilinear_w8_hv_16bpc_neon: 587.7 373.8 469.0 339.8 244.4 247.5
mc_bilinear_w8_v_16bpc_neon: 285.8 189.3 284.9 180.4 103.4 100.9
mct_8tap_regular_w8_0_16bpc_neon: 233.0 136.6 229.3 169.3 86.2 98.3
mct_8tap_regular_w8_h_16bpc_neon: 1106.8 588.3 817.9 654.1 406.4 489.8
mct_8tap_regular_w8_hv_16bpc_neon: 2473.3 1326.3 1428.2 1373.7 903.3 951.1
mct_8tap_regular_w8_v_16bpc_neon: 1266.0 474.1 581.3 505.9 382.0 373.4
mct_bilinear_w8_0_16bpc_neon: 232.9 126.2 225.0 166.3 86.2 91.7
mct_bilinear_w8_h_16bpc_neon: 380.6 270.6 386.0 259.7 154.1 151.9
mct_bilinear_w8_hv_16bpc_neon: 631.4 409.2 509.4 372.1 243.1 244.1
mct_bilinear_w8_v_16bpc_neon: 349.5 233.5 347.9 212.4 138.7 138.4

For comparison, the corresponding numbers for the existing arm64
implementation:

Cortex A53 A72 A73
mc_8tap_regular_w8_0_16bpc_neon: 94.1 48.9 62.3
mc_8tap_regular_w8_h_16bpc_neon: 570.4 388.1 467.3
mc_8tap_regular_w8_hv_16bpc_neon: 1035.8 775.0 891.2
mc_8tap_regular_w8_v_16bpc_neon: 399.8 284.5 278.2
mc_bilinear_w8_0_16bpc_neon: 90.0 44.3 57.4
mc_bilinear_w8_h_16bpc_neon: 191.7 158.7 156.4
mc_bilinear_w8_hv_16bpc_neon: 295.6 235.0 244.9
mc_bilinear_w8_v_16bpc_neon: 147.2 99.0 88.8
mct_8tap_regular_w8_0_16bpc_neon: 139.4 78.4 84.9
mct_8tap_regular_w8_h_16bpc_neon: 612.3 395.9 478.6
mct_8tap_regular_w8_hv_16bpc_neon: 1113.0 804.3 963.5
mct_8tap_regular_w8_v_16bpc_neon: 462.1 370.8 353.3
mct_bilinear_w8_0_16bpc_neon: 135.6 77.0 80.5
mct_bilinear_w8_h_16bpc_neon: 210.8 159.2 141.7
mct_bilinear_w8_hv_16bpc_neon: 325.7 238.4 227.3
mct_bilinear_w8_v_16bpc_neon: 180.7 136.7 129.5

856662b4

arm64: mc: Apply tuning from w4/w8 case to w2 case in 16 bpc 8tap_hv · 4ae3f5f7

Martin Storsjö authored 4 years ago

Narrowing the intermediates from the horizontal pass is beneficial
(on most cores, but a small slowdown on A53) here as well. This
increases consistency in the code between the cases.

(The corresponding change in the upcoming arm32 version is beneficial
on all tested cores except for on A53 - it helps, on some cores a lot,
on A7, A8, A9, A72, A73 and only makes it marginally slower on A53.)

Before:                        Cortex A53     A72     A73
mc_8tap_regular_w2_hv_16bpc_neon:   457.7   301.0   317.1
After:
mc_8tap_regular_w2_hv_16bpc_neon:   472.0   276.0   284.3

4ae3f5f7

arm: mc: Avoid an unnecessary mov in 8tap_hv w2 · 65a1aafd
Martin Storsjö authored 4 years ago
```
This matches how the same logic is written for w4 and above.
```
65a1aafd
arm32: mc: Load 8tap filter coefficients with alignment where possible · 458273ed
Martin Storsjö authored 4 years ago

458273ed
arm32: mc: Use narrower vext.8 in 8tap_w4_h · ea7e13e7
Martin Storsjö authored 4 years ago
```
The previous form was a leftover from how it had to be written on
aarch64.
```
ea7e13e7
arm64: mc: Use more descriptive element specifiers for loads/stores in 16 bpc put_neon · 13fad75d
Martin Storsjö authored 4 years ago
```
For loads of a half/full register, the actual size of the elements
doesn't matter, but it makes the code more readable and understandable.
```
13fad75d

Sep 01, 2020

cli: Use proper integer math in Y4M PAR calculations · 3bfe8c7c

Henrik Gramner authored 4 years ago and

Henrik Gramner committed 4 years ago

The previous floating-point implementation produced results that were
sometimes slightly off due to rounding errors.

For example, a frame size of 432x240 with a render size of 176x240
previously resulted in a PAR of 98:240 instead of the correct 11:27.

Also reduce fractions to produce more readable numbers.

3bfe8c7c

Aug 30, 2020
- Output render size to Y4M · 484d6595
  Raphaël Zumer authored 4 years ago and Jean-Baptiste Kempf committed 4 years ago
```
This adds A<W>:<H> to the Y4M header, to
preserve the intended aspect ratio for
anamorphic video.
```
  484d6595