Commits · master · Abdou Seck / dav1d

Feb 18, 2020
- CI: update aarch64 docker image to buster with meson 0.49 · bf56afde
  Janne Grunau authored 5 years ago
  
  bf56afde
Feb 17, 2020

arm: cdef: Do an 8 bit implementation for cases with all edges present · b33f46e8

Martin Storsjö authored 5 years ago

This increases the code size by around 3 KB on arm64.

Before:
ARM32:                    Cortex A7      A8      A9     A53     A72     A73
cdef_filter_4x4_8bpc_neon:    807.1   517.0   617.7   506.6   429.9   357.8
cdef_filter_4x8_8bpc_neon:   1407.9   899.3  1054.6   862.3   726.5   628.1
cdef_filter_8x8_8bpc_neon:   2394.9  1456.8  1676.8  1461.2  1084.4  1101.2
ARM64:
cdef_filter_4x4_8bpc_neon:                            460.7   301.8   308.0
cdef_filter_4x8_8bpc_neon:                            831.6   547.0   555.2
cdef_filter_8x8_8bpc_neon:                           1454.6   935.6   960.4

After:
ARM32:
cdef_filter_4x4_8bpc_neon:    669.3   541.3   524.4   424.9   322.7   298.1
cdef_filter_4x8_8bpc_neon:   1159.1   922.9   881.1   709.2   538.3   514.1
cdef_filter_8x8_8bpc_neon:   1888.8  1285.4  1358.5  1152.9   839.3   871.2
ARM64:
cdef_filter_4x4_8bpc_neon:                            383.6   262.1   259.9
cdef_filter_4x8_8bpc_neon:                            684.9   472.2   464.7
cdef_filter_8x8_8bpc_neon:                           1160.0   756.8   788.0

(The checkasm benchmark averages three different cases; the fully
edged case is one of those three, while it's the most common case
in actual video. The difference is much bigger if only benchmarking
that particular case.)

This actually apparently makes the code a little bit slower for the w=4
cases on Cortex A8, while it's a significant speedup on all other cores.

b33f46e8

arm32: cdef: Fix a typo for consistency · aff9a210
Martin Storsjö authored 5 years ago
```
The signedness of elements doesn't matter for vsub; match the vsub.i16
next to it.
```
aff9a210

Feb 16, 2020

cli: Implement line buffering in print_stats() · 09d90658

Henrik Gramner authored 5 years ago

Console output is incredibly slow on Windows, which is aggravated by
the lack of line buffering. As a result, a significant percentage of
overall runtime is actually spent displaying the decoding progress.

Doing the line buffering manually alleviates most of the issue.

09d90658

Feb 13, 2020
- arm: cdef: Remove leftover unused labels and macro parameters · eb7077ed
  Martin Storsjö authored 5 years ago
```
These were missed in 361a3c8e.
```
  eb7077ed
Feb 11, 2020

arm64: looprestoration: NEON implementation of SGR for 10 bpc · e3dbf926

Martin Storsjö authored 5 years ago

This only supports 10 bpc, not 12 bpc, as the sum and tmp buffers can
be int16_t for 10 bpc, but need to be int32_t for 12 bpc.

Make actual templates out of the functions in looprestoration_tmpl.S,
and add box3/5_h to looprestoration16.S.

Extend dav1d_sgr_calc_abX_neon with a mandatory bitdepth_max parameter
(which is passed even in 8bpc mode), add a define to bitdepth.h for
passing such a parameter in all modes. This makes this function
a few instructions slower in 8bpc mode than it was before (overall impact
seems to be around 1% of the total runtime of SGR), but allows using the
same actual function instantiation for all modes, saving a bit of code
size.

Examples of checkasm runtimes:
                           Cortex A53        A72        A73
selfguided_3x3_10bpc_neon:   516755.8   389412.7   349058.7
selfguided_5x5_10bpc_neon:   380699.9   293486.6   254591.6
selfguided_mix_10bpc_neon:   878142.3   667495.9   587844.6

Corresponding 8 bpc numbers for comparison:
selfguided_3x3_8bpc_neon:    491058.1   361473.4   347705.9
selfguided_5x5_8bpc_neon:    352655.0   266423.7   248192.2
selfguided_mix_8bpc_neon:    826094.1   612372.2   581943.1

e3dbf926

arm64: looprestoration: Prepare for 16 bpc by splitting code to separate files · 7cf5d753

Martin Storsjö authored 5 years ago

looprestoration_common.S contains functions that can be used as is
with one single instantiation of the functions for both 8 and 16 bpc.
This file will be built once, regardless of which bitdepths are enabled.

looprestoration_tmpl.S contains functions where the source can be shared
and templated between 8 and 16 bpc. This will be included by the separate
8/16bpc implementaton files.

7cf5d753

arm: looprestoration: Add 8bpc to existing function names, add HIGHBD_*_SUFFIX · 32e265a8

Martin Storsjö authored 5 years ago

Don't add it to dav1d_sgr_calc_ab1/2_neon and box3/5_v, as the same
concrete function implementations can be shared for both 8 and 16 bpc
for those functions.

32e265a8

looprestoration: Add a bpc parameter to the init func · 96da9cc2

Martin Storsjö authored 5 years ago

This allows using completely different codepaths for 10 and 12 bpc,
or just adding SIMD functions for either of them.

96da9cc2

arm: looprestoration: Improve scheduling in box3/5_h slightly · 8fb30657
Martin Storsjö authored 5 years ago
```
Set flags further from the branch instructions that use them.
```
8fb30657

arm: Use int16_t for the tmp intermediate buffer · 8e8fb84d

Martin Storsjö authored 5 years ago

For 8bpc and 10bpc, int16_t is enough here, and for 12bpc, other
intermediate int16_t buffers also need to be made of size coef anyway.

8e8fb84d

arm: looprestoration: Fix a comment · feeaf785
Martin Storsjö authored 5 years ago

feeaf785

Feb 10, 2020

NEWS: Official naming is AVX2, not AVX-2 · e4208e85
Jean-Baptiste Kempf authored 5 years ago

e4208e85
arm64: mc: Reduce the width of a register copy · d4c5ad49
Martin Storsjö authored 5 years ago and Janne Grunau committed 5 years ago
```
Only copy as much as really is needed/used.
```
d4c5ad49

arm64: mc: Use two regs for alternating output rows for w4/8 in avg/w_avg/mask · b1167ce1

Martin Storsjö authored 5 years ago and

Janne Grunau committed 5 years ago

It was already done this way for w32/64. Not doing it for w16 as it
didn't help there (and instead gave a small slowdown due to the two
setup instructions).

This gives a small speedup on in-order cores like A53.

Before:         Cortex A53     A72     A73
avg_w4_8bpc_neon:     60.9    25.6    29.0
avg_w8_8bpc_neon:    143.6    52.8    64.0
After:
avg_w4_8bpc_neon:     56.7    26.7    28.5
avg_w8_8bpc_neon:    137.2    54.5    64.4

b1167ce1

arm64: mc: Simplify avg/w_avg/mask by always using the w16 macro · 0bad117e

Martin Storsjö authored 5 years ago and

Janne Grunau committed 5 years ago

This shortens the source by 40 lines, and gives a significant
speedup on A53, a small speedup on A72 and a very minor slowdown
for avg/w_avg on A73.

Before:           Cortex A53     A72     A73
avg_w4_8bpc_neon:       67.4    26.1    25.4
avg_w8_8bpc_neon:      158.7    56.3    59.1
avg_w16_8bpc_neon:     382.9   154.1   160.7
w_avg_w4_8bpc_neon:     99.9    43.6    39.4
w_avg_w8_8bpc_neon:    253.2    98.3    99.0
w_avg_w16_8bpc_neon:   543.1   285.0   301.8
mask_w4_8bpc_neon:     110.6    51.4    45.1
mask_w8_8bpc_neon:     295.0   129.9   114.0
mask_w16_8bpc_neon:    654.6   365.8   369.7
After:
avg_w4_8bpc_neon:       60.8    26.3    29.0
avg_w8_8bpc_neon:      142.8    52.9    64.1
avg_w16_8bpc_neon:     378.2   153.4   160.8
w_avg_w4_8bpc_neon:     78.7    41.0    40.9
w_avg_w8_8bpc_neon:    190.6    90.1   105.1
w_avg_w16_8bpc_neon:   531.1   279.3   301.4
mask_w4_8bpc_neon:      86.6    47.2    44.9
mask_w8_8bpc_neon:     222.0   114.3   114.9
mask_w16_8bpc_neon:    639.5   356.0   369.8

0bad117e

Update NEWS for 0.6.0 · 2e68c1f3
Jean-Baptiste Kempf authored 5 years ago

2e68c1f3

Feb 08, 2020

arm64: mc: NEON implementation of warp for 16 bpc · 8974c155

Martin Storsjö authored 5 years ago

Checkasm benchmark numbers:
                   Cortex A53     A72     A73
warp_8x8_16bpc_neon:   2029.9  1150.5  1225.2
warp_8x8t_16bpc_neon:  2007.6  1129.0  1192.3

Corresponding numbers for 8bpc for comparison:

warp_8x8_8bpc_neon:    1863.8  1052.8  1106.2
warp_8x8t_8bpc_neon:   1847.4  1048.3  1099.8

8974c155

Feb 07, 2020

arm64: cdef: Add NEON implementations of CDEF for 16 bpc · e6cebeb7

Martin Storsjö authored 5 years ago

As some functions are made for both 8bpc and 16bpc from a shared
template, those functions are moved to a separate assembly file
which is included. That assembly file (cdef_tmpl.S) isn't intended
to be assembled on its own (just like utils.S), but if it is
assembled, it should produce an empty object file.

Checkasm benchmarks:
                         Cortex A53     A72     A73
cdef_dir_16bpc_neon:          422.7   305.5   314.0
cdef_filter_4x4_16bpc_neon:   452.9   282.7   296.6
cdef_filter_4x8_16bpc_neon:   800.9   515.3   534.1
cdef_filter_8x8_16bpc_neon:  1417.1   922.7   942.6

Corresponding numbers for 8bpc for comparison:

cdef_dir_8bpc_neon:          394.7   268.8   281.8
cdef_filter_4x4_8bpc_neon:   461.5   300.9   307.7
cdef_filter_4x8_8bpc_neon:   831.6   546.1   555.6
cdef_filter_8x8_8bpc_neon:  1454.6   934.0   960.0

e6cebeb7

arm: cdef: Prepare for 16bpc · 1d5ef8df
Martin Storsjö authored 5 years ago

1d5ef8df

Feb 06, 2020

x86: Add cdef_filter_4x4 AVX-512 (Ice Lake) asm · 19ce77e0
Henrik Gramner authored 5 years ago

19ce77e0
Reorder the Dav1dFrameHeader struct to fix alignment issues · 58a4ba07
Henrik Gramner authored 5 years ago
```
The ar_coeff_shift element needs to have a 16-byte alignment on x86.
```
58a4ba07

arm64: looprestoration: NEON implementation of wiener filter for 16 bpc · c89eb564

Martin Storsjö authored 5 years ago

Checkasm benchmarks:     Cortex A53       A72       A73
wiener_chroma_16bpc_neon:  190288.4  129369.5  127284.1
wiener_luma_16bpc_neon:    195108.4  129387.8  127042.7

The corresponding numbers for 8 bpc for comparison:
wiener_chroma_8bpc_neon:   150586.9  101647.1   97709.9
wiener_luma_8bpc_neon:     146297.4  101593.2   97670.5

c89eb564

arm: looprestoration: Fix the wiener C wrapper function for high bitdepths · c2a2e6ee
Martin Storsjö authored 5 years ago
```
Use HIGHBD_DECL_SUFFIX and HIGHBD_TAIL_SUFFIX where necessary, add
a missing sizeof(pixel).
```
c2a2e6ee
arm: looprestoration: Prepare for 16bpc wiener filter by adding _8bpc to function names · 5bc8a500
Martin Storsjö authored 5 years ago

5bc8a500
arm: looprestoration: Clarify a comment · 2653292c
Martin Storsjö authored 5 years ago

2653292c

arm64: mc: NEON implementation of put/prep 8tap/bilin for 16 bpc · fe44861b

Martin Storsjö authored 5 years ago

Examples of checkasm benchmarks:
Cortex A53 A72 A73
mc_8tap_regular_w8_0_16bpc_neon: 96.8 49.6 62.5
mc_8tap_regular_w8_h_16bpc_neon: 570.3 388.0 467.2
mc_8tap_regular_w8_hv_16bpc_neon: 1035.8 776.7 891.3
mc_8tap_regular_w8_v_16bpc_neon: 400.6 285.0 278.3
mc_bilinear_w8_0_16bpc_neon: 90.0 44.8 57.8
mc_bilinear_w8_h_16bpc_neon: 191.2 158.7 156.4
mc_bilinear_w8_hv_16bpc_neon: 295.9 234.6 244.9
mc_bilinear_w8_v_16bpc_neon: 147.2 98.7 89.2
mct_8tap_regular_w8_0_16bpc_neon: 139.4 78.7 84.9
mct_8tap_regular_w8_h_16bpc_neon: 612.5 396.8 479.1
mct_8tap_regular_w8_hv_16bpc_neon: 1112.4 814.6 963.2
mct_8tap_regular_w8_v_16bpc_neon: 461.8 370.8 353.4
mct_bilinear_w8_0_16bpc_neon: 135.6 76.2 80.5
mct_bilinear_w8_h_16bpc_neon: 211.3 159.4 141.7
mct_bilinear_w8_hv_16bpc_neon: 325.7 237.2 227.2
mct_bilinear_w8_v_16bpc_neon: 180.7 135.9 129.5

For comparison, the corresponding numbers for 8 bpc:

mc_8tap_regular_w8_0_8bpc_neon: 78.6 41.0 39.5
mc_8tap_regular_w8_h_8bpc_neon: 371.2 299.6 348.3
mc_8tap_regular_w8_hv_8bpc_neon: 817.1 675.0 726.5
mc_8tap_regular_w8_v_8bpc_neon: 243.7 260.4 253.0
mc_bilinear_w8_0_8bpc_neon: 74.8 35.4 36.1
mc_bilinear_w8_h_8bpc_neon: 179.9 69.9 79.2
mc_bilinear_w8_hv_8bpc_neon: 210.8 132.4 144.8
mc_bilinear_w8_v_8bpc_neon: 141.6 64.9 65.4
mct_8tap_regular_w8_0_8bpc_neon: 101.7 54.4 59.5
mct_8tap_regular_w8_h_8bpc_neon: 391.3 329.1 358.3
mct_8tap_regular_w8_hv_8bpc_neon: 880.4 754.9 829.4
mct_8tap_regular_w8_v_8bpc_neon: 270.8 300.8 277.4
mct_bilinear_w8_0_8bpc_neon: 97.6 54.0 55.4
mct_bilinear_w8_h_8bpc_neon: 173.3 73.5 79.5
mct_bilinear_w8_hv_8bpc_neon: 228.3 163.0 174.0
mct_bilinear_w8_v_8bpc_neon: 128.9 72.5 63.3

fe44861b

Feb 05, 2020
- Update README · c851c65c
  kossh1 authored 5 years ago and Ronald S. Bultje committed 5 years ago
  
  c851c65c
- x86/msac: add an avx2 version for msac_decode_symbol_adapt16 · 3c8110a9
  Lynne authored 5 years ago
```
msac_decode_symbol_adapt16_c: 55.1
msac_decode_symbol_adapt16_sse2: 30.3
msac_decode_symbol_adapt16_avx2: 28.0

Most code written by Henrik Gramner.
```
  3c8110a9
- msac: make symbol_adapt16 a function pointer on x86_64 · 35ab85bb
  Lynne authored 5 years ago
  
  35ab85bb
Feb 04, 2020

arm64: mc: NEON implementation of avg/mask/w_avg for 16 bpc · 03511f8c

Martin Storsjö authored 5 years ago

avg_w4_16bpc_neon:         78.2     43.2     48.9
avg_w8_16bpc_neon:        199.1    108.7    123.1
avg_w16_16bpc_neon:       615.6    339.9    373.9
avg_w32_16bpc_neon:      2313.0   1390.6   1490.6
avg_w64_16bpc_neon:      5783.6   3119.5   3653.0
avg_w128_16bpc_neon:    15444.6   8168.7   8907.9
w_avg_w4_16bpc_neon:      120.1     87.8     92.4
w_avg_w8_16bpc_neon:      321.6    252.4    263.1
w_avg_w16_16bpc_neon:    1017.5    794.5    831.2
w_avg_w32_16bpc_neon:    3911.4   3154.7   3306.5
w_avg_w64_16bpc_neon:    9977.9   7794.9   8022.3
w_avg_w128_16bpc_neon:  25721.5  19274.6  20041.7
mask_w4_16bpc_neon:       139.5     96.5    104.3
mask_w8_16bpc_neon:       376.0    283.9    300.1
mask_w16_16bpc_neon:     1217.2    906.7    950.0
mask_w32_16bpc_neon:     4811.1   3669.0   3901.3
mask_w64_16bpc_neon:    12036.4   8918.4   9244.8
mask_w128_16bpc_neon:   30888.8  21999.0  23206.7

For comparison, these are the corresponding numbers for 8bpc:

avg_w4_8bpc_neon:          56.7     26.2     28.5
avg_w8_8bpc_neon:         137.2     52.8     64.3
avg_w16_8bpc_neon:        377.9    151.5    161.6
avg_w32_8bpc_neon:       1528.9    614.5    633.9
avg_w64_8bpc_neon:       3792.5   1814.3   1518.3
avg_w128_8bpc_neon:     10685.3   5220.4   3879.9
w_avg_w4_8bpc_neon:        75.2     53.0     41.1
w_avg_w8_8bpc_neon:       186.7    120.1    105.2
w_avg_w16_8bpc_neon:      531.6    314.1    302.1
w_avg_w32_8bpc_neon:     2138.4   1120.4   1171.5
w_avg_w64_8bpc_neon:     5151.9   2910.5   2857.1
w_avg_w128_8bpc_neon:   13945.0   7330.5   7389.1
mask_w4_8bpc_neon:         82.0     47.2     45.1
mask_w8_8bpc_neon:        213.5    115.4    115.8
mask_w16_8bpc_neon:       639.8    356.2    370.1
mask_w32_8bpc_neon:      2566.9   1489.8   1435.0
mask_w64_8bpc_neon:      6727.6   3822.8   3425.2
mask_w128_8bpc_neon:    17893.0   9622.6   9161.3

03511f8c

arm: mc: Prepare the init file for higher bitdepths · a285204a
Martin Storsjö authored 5 years ago

a285204a
arm: Make the existing 8bpc assembly only built if 8bpc is enabled · 7eaa7c9f
Martin Storsjö authored 5 years ago

7eaa7c9f

Feb 03, 2020

x86: Avoid cmov instructions that depends on multiple flags · b8399319

Henrik Gramner authored 5 years ago and

Henrik Gramner committed 5 years ago

On many AMD CPU:s cmov instructions that depends on multiple flags
require an additional µop, so prefer using cmov variants that only
depends on a single flag where possible.

b8399319

x86: Add miscellaneous minor scalar optimizations · d21dc801
Henrik Gramner authored 5 years ago and Henrik Gramner committed 5 years ago
```
Shave off a few instructions, or save a few bytes, in various places.
Also change some instructions to use appropriately sized registers.
```
d21dc801

x86: Use unsigned pointer comparisons · 81a26458

Henrik Gramner authored 5 years ago and

Henrik Gramner committed 5 years ago

Using signed comparisons could theoretically cause the wrong result in
some obscure corner cases on x86-32 with PAE. x86-64 should be fine
with either, but unsigned is technically more correct.

81a26458

Feb 01, 2020

Rework the CDEF top edge handling · bb178db0

Henrik Gramner authored 5 years ago

Avoids some pointer chasing and simplifies the DSP code, at the cost
of making the initialization a little bit more complicated.

Also reduces memory usage by a small amount due to properly sizing
the buffers instead of always allocating enough space for 4:4:4.

bb178db0

checkasm: Fix missing shift in high bit-depth cdef_filter test · dccc21b7
Henrik Gramner authored 5 years ago

dccc21b7

Avoid masking the lsb in high bit-depth stride calculations · fbc1b420

Henrik Gramner authored 5 years ago

We specify most strides in bytes, but since C defines offsets
in multiples of sizeof(type) we use the PXSTRIDE() macro to
downshift the strides by one in high-bit depth templated files.

This however means that the compiler is required to mask away
the least significant bit, because it could in theory be non-zero.

Avoid that by telling the compiler (when compiled in release mode)
that the lsb is in fact guaranteed to always be zero.

fbc1b420

Jan 29, 2020
- checkasm: Increase buffer alignment to 64-byte on x86-64 · 9c29f229
  Henrik Gramner authored 5 years ago
```
Required for AVX-512.
```
  9c29f229