Commits · master · David Smith / dav1d

Feb 11, 2020
- Delete THANKS.md · f5c1587b
  David Smith authored 5 years ago
  
  f5c1587b
- Delete meson.build · fd2e9def
  David Smith authored 5 years ago
  
  fd2e9def
- Delete README.md · bc0fe651
  David Smith authored 5 years ago
  
  bc0fe651
- Delete CONTRIBUTING.md · 43e6634a
  David Smith authored 5 years ago
  
  43e6634a
Feb 10, 2020

NEWS: Official naming is AVX2, not AVX-2 · e4208e85
Jean-Baptiste Kempf authored 5 years ago

e4208e85
arm64: mc: Reduce the width of a register copy · d4c5ad49
Martin Storsjö authored 5 years ago and Janne Grunau committed 5 years ago
```
Only copy as much as really is needed/used.
```
d4c5ad49

arm64: mc: Use two regs for alternating output rows for w4/8 in avg/w_avg/mask · b1167ce1

Martin Storsjö authored 5 years ago and

Janne Grunau committed 5 years ago

It was already done this way for w32/64. Not doing it for w16 as it
didn't help there (and instead gave a small slowdown due to the two
setup instructions).

This gives a small speedup on in-order cores like A53.

Before:         Cortex A53     A72     A73
avg_w4_8bpc_neon:     60.9    25.6    29.0
avg_w8_8bpc_neon:    143.6    52.8    64.0
After:
avg_w4_8bpc_neon:     56.7    26.7    28.5
avg_w8_8bpc_neon:    137.2    54.5    64.4

b1167ce1

arm64: mc: Simplify avg/w_avg/mask by always using the w16 macro · 0bad117e

Martin Storsjö authored 5 years ago and

Janne Grunau committed 5 years ago

This shortens the source by 40 lines, and gives a significant
speedup on A53, a small speedup on A72 and a very minor slowdown
for avg/w_avg on A73.

Before:           Cortex A53     A72     A73
avg_w4_8bpc_neon:       67.4    26.1    25.4
avg_w8_8bpc_neon:      158.7    56.3    59.1
avg_w16_8bpc_neon:     382.9   154.1   160.7
w_avg_w4_8bpc_neon:     99.9    43.6    39.4
w_avg_w8_8bpc_neon:    253.2    98.3    99.0
w_avg_w16_8bpc_neon:   543.1   285.0   301.8
mask_w4_8bpc_neon:     110.6    51.4    45.1
mask_w8_8bpc_neon:     295.0   129.9   114.0
mask_w16_8bpc_neon:    654.6   365.8   369.7
After:
avg_w4_8bpc_neon:       60.8    26.3    29.0
avg_w8_8bpc_neon:      142.8    52.9    64.1
avg_w16_8bpc_neon:     378.2   153.4   160.8
w_avg_w4_8bpc_neon:     78.7    41.0    40.9
w_avg_w8_8bpc_neon:    190.6    90.1   105.1
w_avg_w16_8bpc_neon:   531.1   279.3   301.4
mask_w4_8bpc_neon:      86.6    47.2    44.9
mask_w8_8bpc_neon:     222.0   114.3   114.9
mask_w16_8bpc_neon:    639.5   356.0   369.8

0bad117e

Update NEWS for 0.6.0 · 2e68c1f3
Jean-Baptiste Kempf authored 5 years ago

2e68c1f3

Feb 08, 2020

arm64: mc: NEON implementation of warp for 16 bpc · 8974c155

Martin Storsjö authored 5 years ago

Checkasm benchmark numbers:
                   Cortex A53     A72     A73
warp_8x8_16bpc_neon:   2029.9  1150.5  1225.2
warp_8x8t_16bpc_neon:  2007.6  1129.0  1192.3

Corresponding numbers for 8bpc for comparison:

warp_8x8_8bpc_neon:    1863.8  1052.8  1106.2
warp_8x8t_8bpc_neon:   1847.4  1048.3  1099.8

8974c155

Feb 07, 2020

arm64: cdef: Add NEON implementations of CDEF for 16 bpc · e6cebeb7

Martin Storsjö authored 5 years ago

As some functions are made for both 8bpc and 16bpc from a shared
template, those functions are moved to a separate assembly file
which is included. That assembly file (cdef_tmpl.S) isn't intended
to be assembled on its own (just like utils.S), but if it is
assembled, it should produce an empty object file.

Checkasm benchmarks:
                         Cortex A53     A72     A73
cdef_dir_16bpc_neon:          422.7   305.5   314.0
cdef_filter_4x4_16bpc_neon:   452.9   282.7   296.6
cdef_filter_4x8_16bpc_neon:   800.9   515.3   534.1
cdef_filter_8x8_16bpc_neon:  1417.1   922.7   942.6

Corresponding numbers for 8bpc for comparison:

cdef_dir_8bpc_neon:          394.7   268.8   281.8
cdef_filter_4x4_8bpc_neon:   461.5   300.9   307.7
cdef_filter_4x8_8bpc_neon:   831.6   546.1   555.6
cdef_filter_8x8_8bpc_neon:  1454.6   934.0   960.0

e6cebeb7

arm: cdef: Prepare for 16bpc · 1d5ef8df
Martin Storsjö authored 5 years ago

1d5ef8df

Feb 06, 2020

x86: Add cdef_filter_4x4 AVX-512 (Ice Lake) asm · 19ce77e0
Henrik Gramner authored 5 years ago

19ce77e0
Reorder the Dav1dFrameHeader struct to fix alignment issues · 58a4ba07
Henrik Gramner authored 5 years ago
```
The ar_coeff_shift element needs to have a 16-byte alignment on x86.
```
58a4ba07

arm64: looprestoration: NEON implementation of wiener filter for 16 bpc · c89eb564

Martin Storsjö authored 5 years ago

Checkasm benchmarks:     Cortex A53       A72       A73
wiener_chroma_16bpc_neon:  190288.4  129369.5  127284.1
wiener_luma_16bpc_neon:    195108.4  129387.8  127042.7

The corresponding numbers for 8 bpc for comparison:
wiener_chroma_8bpc_neon:   150586.9  101647.1   97709.9
wiener_luma_8bpc_neon:     146297.4  101593.2   97670.5

c89eb564

arm: looprestoration: Fix the wiener C wrapper function for high bitdepths · c2a2e6ee
Martin Storsjö authored 5 years ago
```
Use HIGHBD_DECL_SUFFIX and HIGHBD_TAIL_SUFFIX where necessary, add
a missing sizeof(pixel).
```
c2a2e6ee
arm: looprestoration: Prepare for 16bpc wiener filter by adding _8bpc to function names · 5bc8a500
Martin Storsjö authored 5 years ago

5bc8a500
arm: looprestoration: Clarify a comment · 2653292c
Martin Storsjö authored 5 years ago

2653292c

arm64: mc: NEON implementation of put/prep 8tap/bilin for 16 bpc · fe44861b

Martin Storsjö authored 5 years ago

Examples of checkasm benchmarks:
Cortex A53 A72 A73
mc_8tap_regular_w8_0_16bpc_neon: 96.8 49.6 62.5
mc_8tap_regular_w8_h_16bpc_neon: 570.3 388.0 467.2
mc_8tap_regular_w8_hv_16bpc_neon: 1035.8 776.7 891.3
mc_8tap_regular_w8_v_16bpc_neon: 400.6 285.0 278.3
mc_bilinear_w8_0_16bpc_neon: 90.0 44.8 57.8
mc_bilinear_w8_h_16bpc_neon: 191.2 158.7 156.4
mc_bilinear_w8_hv_16bpc_neon: 295.9 234.6 244.9
mc_bilinear_w8_v_16bpc_neon: 147.2 98.7 89.2
mct_8tap_regular_w8_0_16bpc_neon: 139.4 78.7 84.9
mct_8tap_regular_w8_h_16bpc_neon: 612.5 396.8 479.1
mct_8tap_regular_w8_hv_16bpc_neon: 1112.4 814.6 963.2
mct_8tap_regular_w8_v_16bpc_neon: 461.8 370.8 353.4
mct_bilinear_w8_0_16bpc_neon: 135.6 76.2 80.5
mct_bilinear_w8_h_16bpc_neon: 211.3 159.4 141.7
mct_bilinear_w8_hv_16bpc_neon: 325.7 237.2 227.2
mct_bilinear_w8_v_16bpc_neon: 180.7 135.9 129.5

For comparison, the corresponding numbers for 8 bpc:

mc_8tap_regular_w8_0_8bpc_neon: 78.6 41.0 39.5
mc_8tap_regular_w8_h_8bpc_neon: 371.2 299.6 348.3
mc_8tap_regular_w8_hv_8bpc_neon: 817.1 675.0 726.5
mc_8tap_regular_w8_v_8bpc_neon: 243.7 260.4 253.0
mc_bilinear_w8_0_8bpc_neon: 74.8 35.4 36.1
mc_bilinear_w8_h_8bpc_neon: 179.9 69.9 79.2
mc_bilinear_w8_hv_8bpc_neon: 210.8 132.4 144.8
mc_bilinear_w8_v_8bpc_neon: 141.6 64.9 65.4
mct_8tap_regular_w8_0_8bpc_neon: 101.7 54.4 59.5
mct_8tap_regular_w8_h_8bpc_neon: 391.3 329.1 358.3
mct_8tap_regular_w8_hv_8bpc_neon: 880.4 754.9 829.4
mct_8tap_regular_w8_v_8bpc_neon: 270.8 300.8 277.4
mct_bilinear_w8_0_8bpc_neon: 97.6 54.0 55.4
mct_bilinear_w8_h_8bpc_neon: 173.3 73.5 79.5
mct_bilinear_w8_hv_8bpc_neon: 228.3 163.0 174.0
mct_bilinear_w8_v_8bpc_neon: 128.9 72.5 63.3

fe44861b

Feb 05, 2020
- Update README · c851c65c
  kossh1 authored 5 years ago and Ronald S. Bultje committed 5 years ago
  
  c851c65c
- x86/msac: add an avx2 version for msac_decode_symbol_adapt16 · 3c8110a9
  Lynne authored 5 years ago
```
msac_decode_symbol_adapt16_c: 55.1
msac_decode_symbol_adapt16_sse2: 30.3
msac_decode_symbol_adapt16_avx2: 28.0

Most code written by Henrik Gramner.
```
  3c8110a9
- msac: make symbol_adapt16 a function pointer on x86_64 · 35ab85bb
  Lynne authored 5 years ago
  
  35ab85bb
Feb 04, 2020

arm64: mc: NEON implementation of avg/mask/w_avg for 16 bpc · 03511f8c

Martin Storsjö authored 5 years ago

avg_w4_16bpc_neon:         78.2     43.2     48.9
avg_w8_16bpc_neon:        199.1    108.7    123.1
avg_w16_16bpc_neon:       615.6    339.9    373.9
avg_w32_16bpc_neon:      2313.0   1390.6   1490.6
avg_w64_16bpc_neon:      5783.6   3119.5   3653.0
avg_w128_16bpc_neon:    15444.6   8168.7   8907.9
w_avg_w4_16bpc_neon:      120.1     87.8     92.4
w_avg_w8_16bpc_neon:      321.6    252.4    263.1
w_avg_w16_16bpc_neon:    1017.5    794.5    831.2
w_avg_w32_16bpc_neon:    3911.4   3154.7   3306.5
w_avg_w64_16bpc_neon:    9977.9   7794.9   8022.3
w_avg_w128_16bpc_neon:  25721.5  19274.6  20041.7
mask_w4_16bpc_neon:       139.5     96.5    104.3
mask_w8_16bpc_neon:       376.0    283.9    300.1
mask_w16_16bpc_neon:     1217.2    906.7    950.0
mask_w32_16bpc_neon:     4811.1   3669.0   3901.3
mask_w64_16bpc_neon:    12036.4   8918.4   9244.8
mask_w128_16bpc_neon:   30888.8  21999.0  23206.7

For comparison, these are the corresponding numbers for 8bpc:

avg_w4_8bpc_neon:          56.7     26.2     28.5
avg_w8_8bpc_neon:         137.2     52.8     64.3
avg_w16_8bpc_neon:        377.9    151.5    161.6
avg_w32_8bpc_neon:       1528.9    614.5    633.9
avg_w64_8bpc_neon:       3792.5   1814.3   1518.3
avg_w128_8bpc_neon:     10685.3   5220.4   3879.9
w_avg_w4_8bpc_neon:        75.2     53.0     41.1
w_avg_w8_8bpc_neon:       186.7    120.1    105.2
w_avg_w16_8bpc_neon:      531.6    314.1    302.1
w_avg_w32_8bpc_neon:     2138.4   1120.4   1171.5
w_avg_w64_8bpc_neon:     5151.9   2910.5   2857.1
w_avg_w128_8bpc_neon:   13945.0   7330.5   7389.1
mask_w4_8bpc_neon:         82.0     47.2     45.1
mask_w8_8bpc_neon:        213.5    115.4    115.8
mask_w16_8bpc_neon:       639.8    356.2    370.1
mask_w32_8bpc_neon:      2566.9   1489.8   1435.0
mask_w64_8bpc_neon:      6727.6   3822.8   3425.2
mask_w128_8bpc_neon:    17893.0   9622.6   9161.3

03511f8c

arm: mc: Prepare the init file for higher bitdepths · a285204a
Martin Storsjö authored 5 years ago

a285204a
arm: Make the existing 8bpc assembly only built if 8bpc is enabled · 7eaa7c9f
Martin Storsjö authored 5 years ago

7eaa7c9f

Feb 03, 2020

x86: Avoid cmov instructions that depends on multiple flags · b8399319

Henrik Gramner authored 5 years ago and

Henrik Gramner committed 5 years ago

On many AMD CPU:s cmov instructions that depends on multiple flags
require an additional µop, so prefer using cmov variants that only
depends on a single flag where possible.

b8399319

x86: Add miscellaneous minor scalar optimizations · d21dc801
Henrik Gramner authored 5 years ago and Henrik Gramner committed 5 years ago
```
Shave off a few instructions, or save a few bytes, in various places.
Also change some instructions to use appropriately sized registers.
```
d21dc801

x86: Use unsigned pointer comparisons · 81a26458

Henrik Gramner authored 5 years ago and

Henrik Gramner committed 5 years ago

Using signed comparisons could theoretically cause the wrong result in
some obscure corner cases on x86-32 with PAE. x86-64 should be fine
with either, but unsigned is technically more correct.

81a26458

Feb 01, 2020

Rework the CDEF top edge handling · bb178db0

Henrik Gramner authored 5 years ago

Avoids some pointer chasing and simplifies the DSP code, at the cost
of making the initialization a little bit more complicated.

Also reduces memory usage by a small amount due to properly sizing
the buffers instead of always allocating enough space for 4:4:4.

bb178db0

checkasm: Fix missing shift in high bit-depth cdef_filter test · dccc21b7
Henrik Gramner authored 5 years ago

dccc21b7

Avoid masking the lsb in high bit-depth stride calculations · fbc1b420

Henrik Gramner authored 5 years ago

We specify most strides in bytes, but since C defines offsets
in multiples of sizeof(type) we use the PXSTRIDE() macro to
downshift the strides by one in high-bit depth templated files.

This however means that the compiler is required to mask away
the least significant bit, because it could in theory be non-zero.

Avoid that by telling the compiler (when compiled in release mode)
that the lsb is in fact guaranteed to always be zero.

fbc1b420

Jan 29, 2020

checkasm: Increase buffer alignment to 64-byte on x86-64 · 9c29f229
Henrik Gramner authored 5 years ago
```
Required for AVX-512.
```
9c29f229

arm: cdef: Add special cased versions for pri_strength/sec_strength being zero · 361a3c8e

Martin Storsjö authored 5 years ago

Before:
ARM32:                    Cortex A7      A8      A9     A53     A72     A73
cdef_filter_4x4_8bpc_neon:    964.6   599.5   707.9   601.2   465.1   405.2
cdef_filter_4x8_8bpc_neon:   1726.0  1066.2  1238.7  1041.7   798.6   725.3
cdef_filter_8x8_8bpc_neon:   2974.4  1671.8  1943.9  1806.1  1229.8  1242.1
ARM64:
cdef_filter_4x4_8bpc_neon:                            569.2   337.8   348.7
cdef_filter_4x8_8bpc_neon:                           1031.1   623.3   633.6
cdef_filter_8x8_8bpc_neon:                           1847.5  1097.7  1117.5

After:
ARM32:                    Cortex A7      A8      A9     A53     A72     A73
cdef_filter_4x4_8bpc_neon:    798.4   524.2   617.3   506.8   432.4   361.1
cdef_filter_4x8_8bpc_neon:   1394.7   910.4  1054.0   863.6   730.2   632.2
cdef_filter_8x8_8bpc_neon:   2364.6  1453.8  1675.1  1466.0  1086.4  1107.7
ARM64:
cdef_filter_4x4_8bpc_neon:                            461.7   303.1   308.6
cdef_filter_4x8_8bpc_neon:                            833.0   547.5   556.0
cdef_filter_8x8_8bpc_neon:                           1459.3   934.1   967.9

361a3c8e

arm: cdef: Fix some comment typos · 6ad9bd5f
Martin Storsjö authored 5 years ago

6ad9bd5f

Jan 28, 2020
- Fix crash in dav1d_apply_grain() with negative picture strides · ba23ac8c
  Henrik Gramner authored 5 years ago and Henrik Gramner committed 5 years ago
  
  ba23ac8c
Jan 27, 2020

Optimize the cdef_filter C implementation · eaaf2218

Henrik Gramner authored 5 years ago and

Henrik Gramner committed 5 years ago

The main feature is splitting the main filter code into three
different code paths depending on the strength values.

Clipping is only required when both the primary and secondary
strengths are non-zero, which is an uncommon case. Being able to
skip that complexity in the common cases is significantly faster.

eaaf2218

checkasm: Improve cdef_filter test · fad6db20
Henrik Gramner authored 5 years ago and Henrik Gramner committed 5 years ago

fad6db20

Jan 25, 2020

Avoid redundant calls to CDEF DSP functions · 6385cde2

Henrik Gramner authored 5 years ago and

Henrik Gramner committed 5 years ago

If the primary strengths for both luma and chroma are zero
the direction is always zero.

If both the primary and secondary luma strengths are zero the
entire luma filtering process is a no-op.

6385cde2

Jan 21, 2020
- x86: Bump nasm version requirement to 2.14 · 447b7c63
  Henrik Gramner authored 5 years ago
```
Required for the AVX-512 instructions added in Ice Lake.
```
  447b7c63
- CI: Use a newer image to build snap packages · e636a2f4
  Konstantin Pavlov authored 5 years ago
```
The image now includes nasm 2.14.02 which is needed to assemble AVX512
code.
```
  e636a2f4