Commits · efd9e5518e0ed5114f8b4579debd7ee6dbede21f · Pranav Kant / dav1d

Mar 05, 2020

Update NEWS for 0.6.0 · efd9e551
Jean-Baptiste Kempf authored 5 years ago

0.6.0

efd9e551

arm64: mc: NEON implementation of w_mask for 16 bpc · c8aaddea

Martin Storsjö authored 5 years ago and

Jean-Baptiste Kempf committed 5 years ago

Checkasm numbers:          Cortex A53       A72       A73
w_mask_420_w4_16bpc_neon:       173.6     123.5     120.3
w_mask_420_w8_16bpc_neon:       484.2     344.1     329.5
w_mask_420_w16_16bpc_neon:     1411.2    1027.4    1035.1
w_mask_420_w32_16bpc_neon:     5561.5    4093.2    3980.1
w_mask_420_w64_16bpc_neon:    13809.6    9856.5    9581.0
w_mask_420_w128_16bpc_neon:   35614.7   25553.8   24284.4
w_mask_422_w4_16bpc_neon:       159.4     112.2     114.2
w_mask_422_w8_16bpc_neon:       453.4     326.1     326.7
w_mask_422_w16_16bpc_neon:     1394.6    1062.3    1050.2
w_mask_422_w32_16bpc_neon:     5485.8    4219.6    4027.3
w_mask_422_w64_16bpc_neon:    13701.2   10079.6    9692.6
w_mask_422_w128_16bpc_neon:   35455.3   25892.5   24625.9
w_mask_444_w4_16bpc_neon:       153.0     112.3     112.7
w_mask_444_w8_16bpc_neon:       437.2     331.8     325.8
w_mask_444_w16_16bpc_neon:     1395.1    1069.1    1041.7
w_mask_444_w32_16bpc_neon:     5370.1    4213.5    4138.1
w_mask_444_w64_16bpc_neon:    13482.6   10190.5   10004.6
w_mask_444_w128_16bpc_neon:   35583.7   26911.2   25638.8

Corresponding numbers for 8 bpc for comparison:

w_mask_420_w4_8bpc_neon:        126.6      79.1      87.7
w_mask_420_w8_8bpc_neon:        343.9     195.0     211.5
w_mask_420_w16_8bpc_neon:       886.3     540.3     577.7
w_mask_420_w32_8bpc_neon:      3558.6    2152.4    2216.7
w_mask_420_w64_8bpc_neon:      8894.9    5161.2    5297.0
w_mask_420_w128_8bpc_neon:    22520.1   13514.5   13887.2
w_mask_422_w4_8bpc_neon:        112.9      68.2      77.0
w_mask_422_w8_8bpc_neon:        314.4     175.5     208.7
w_mask_422_w16_8bpc_neon:       835.5     565.0     608.3
w_mask_422_w32_8bpc_neon:      3381.3    2231.8    2287.6
w_mask_422_w64_8bpc_neon:      8499.4    5343.6    5460.8
w_mask_422_w128_8bpc_neon:    21823.3   14206.5   14249.1
w_mask_444_w4_8bpc_neon:        104.6      65.8      72.7
w_mask_444_w8_8bpc_neon:        290.4     173.7     196.6
w_mask_444_w16_8bpc_neon:       831.4     586.7     591.7
w_mask_444_w32_8bpc_neon:      3320.8    2300.6    2251.0
w_mask_444_w64_8bpc_neon:      8300.0    5480.5    5346.8
w_mask_444_w128_8bpc_neon:    21633.8   15981.3   14384.8

c8aaddea

CI: run a selection of jobs on a node with avx2 · bce8fae9

Janne Grunau authored 5 years ago

Switches build-debian (for avx2 checkasm coverage) and test-win64 and
test-debian-unaligned-stack (for testing asm '%if's).
Refs #330, #333

bce8fae9

Mar 04, 2020

x86: Fix crash in AVX2 cdef_filter with <32-byte stack alignment · 3a6a55d8
Henrik Gramner authored 5 years ago and Henrik Gramner committed 5 years ago

3a6a55d8

arm64: mc: NEON implementation of blend for 16bpc · fb348f64

Martin Storsjö authored 5 years ago

Checkasm numbers:     Cortex A53     A72     A73
blend_h_w2_16bpc_neon:     109.3    83.1    56.7
blend_h_w4_16bpc_neon:     114.1    61.4    62.3
blend_h_w8_16bpc_neon:     133.3    80.8    81.1
blend_h_w16_16bpc_neon:    215.6   132.7   149.5
blend_h_w32_16bpc_neon:    390.4   254.2   235.8
blend_h_w64_16bpc_neon:    719.1   456.3   453.8
blend_h_w128_16bpc_neon:  1646.1  1112.3  1065.9
blend_v_w2_16bpc_neon:     185.9   175.9   180.0
blend_v_w4_16bpc_neon:     338.0   183.4   232.1
blend_v_w8_16bpc_neon:     426.5   213.8   250.6
blend_v_w16_16bpc_neon:    678.2   357.8   382.6
blend_v_w32_16bpc_neon:   1098.3   686.2   695.6
blend_w4_16bpc_neon:        75.7    31.5    32.0
blend_w8_16bpc_neon:       134.0    75.0    75.8
blend_w16_16bpc_neon:      467.9   267.3   310.0
blend_w32_16bpc_neon:     1201.9   658.7   779.7

Corresponding numbers for 8bpc for comparison:
blend_h_w2_8bpc_neon:      104.1    55.9    60.8
blend_h_w4_8bpc_neon:      108.9    58.7    48.2
blend_h_w8_8bpc_neon:       99.3    64.4    67.4
blend_h_w16_8bpc_neon:     145.2    93.4    85.1
blend_h_w32_8bpc_neon:     262.2   157.5   148.6
blend_h_w64_8bpc_neon:     466.7   278.9   256.6
blend_h_w128_8bpc_neon:   1054.2   624.7   571.0
blend_v_w2_8bpc_neon:      170.5   106.6   113.4
blend_v_w4_8bpc_neon:      333.0   189.9   225.9
blend_v_w8_8bpc_neon:      314.9   199.0   203.5
blend_v_w16_8bpc_neon:     476.9   300.8   241.1
blend_v_w32_8bpc_neon:     766.9   430.4   415.1
blend_w4_8bpc_neon:         66.7    35.4    26.0
blend_w8_8bpc_neon:        110.7    47.9    48.1
blend_w16_8bpc_neon:       299.4   161.8   162.3
blend_w32_8bpc_neon:       725.8   417.0   432.8

fb348f64

arm: mc: Optimize blend_v · 52e9b435

Martin Storsjö authored 5 years ago

Use a post-increment with a register on the last increment, avoiding
a separate increment. Avoid processing the last 8 pixels in the w32
case when we only output 24 pixels.

Before:
ARM32                Cortex A7      A8      A9     A53     A72     A73
blend_v_w4_8bpc_neon:    450.4   574.7   538.7   374.6   199.3   260.5
blend_v_w8_8bpc_neon:    559.6   351.3   552.5   357.6   214.8   204.3
blend_v_w16_8bpc_neon:   926.3   511.6   787.9   593.0   271.0   246.8
blend_v_w32_8bpc_neon:  1482.5   917.0  1149.5   991.9   354.0   368.9
ARM64
blend_v_w4_8bpc_neon:                            351.1   200.0   224.1
blend_v_w8_8bpc_neon:                            333.0   212.4   203.8
blend_v_w16_8bpc_neon:                           495.2   302.0   247.0
blend_v_w32_8bpc_neon:                           840.0   557.8   514.0

After:
ARM32
blend_v_w4_8bpc_neon:    435.5   575.8   537.6   356.2   198.3   259.5
blend_v_w8_8bpc_neon:    545.2   347.9   553.5   339.1   207.8   204.2
blend_v_w16_8bpc_neon:   913.7   511.0   788.1   573.7   275.4   243.3
blend_v_w32_8bpc_neon:  1445.3   951.2  1079.1   920.4   352.2   361.6
ARM64
blend_v_w4_8bpc_neon:                            333.0   191.3   225.9
blend_v_w8_8bpc_neon:                            314.9   199.3   203.5
blend_v_w16_8bpc_neon:                           476.9   301.3   241.1
blend_v_w32_8bpc_neon:                           766.9   432.8   416.9

52e9b435

arm64: mc: Treat the stride as a full 64 bit (potential signed) value in blend_8bpc_neon · a7f6fe32
Martin Storsjö authored 5 years ago

a7f6fe32
arm64: mc: Fix indentation · 48ffb05e
Martin Storsjö authored 5 years ago

48ffb05e

arm64: mc: Use more intuitive lane specifications for loads/stores · 83c62716

Martin Storsjö authored 5 years ago

For loads where we load/store a full or half register (instead of
a lanewise load/store), the lane specification in itself doesn't
matter, only its size.

This doesn't change the generated code, but makes it more readable.

83c62716

Mar 03, 2020
- Update NEWS for 0.6.0 · f4dac1a3
  Jean-Baptiste Kempf authored 5 years ago
  
  f4dac1a3
- CI/armv7: use `linux32 meson ...` to allow running on aarch64 · abaad816
  Janne Grunau authored 5 years ago
  
  abaad816
Mar 02, 2020

arm64: loopfilter: NEON implementation of loopfilter for 16 bpc · 360243c2

Martin Storsjö authored 5 years ago and

Jean-Baptiste Kempf committed 5 years ago

Checkasm runtimes:      Cortex A53     A72     A73
lpf_h_sb_uv_w4_16bpc_neon:   919.0   795.0   714.9
lpf_h_sb_uv_w6_16bpc_neon:  1267.7  1116.2  1081.9
lpf_h_sb_y_w4_16bpc_neon:   1500.2  1543.9  1778.5
lpf_h_sb_y_w8_16bpc_neon:   2216.1  2183.0  2568.1
lpf_h_sb_y_w16_16bpc_neon:  2641.8  2630.4  2639.4
lpf_v_sb_uv_w4_16bpc_neon:   836.5   572.7   667.3
lpf_v_sb_uv_w6_16bpc_neon:  1130.8   709.1   955.5
lpf_v_sb_y_w4_16bpc_neon:   1271.6  1434.4  1272.1
lpf_v_sb_y_w8_16bpc_neon:   1818.0  1759.1  1664.6
lpf_v_sb_y_w16_16bpc_neon:  1998.6  2115.8  1586.6

Corresponding numbers for 8 bpc for comparison:
lpf_h_sb_uv_w4_8bpc_neon:    799.4   632.8   695.4
lpf_h_sb_uv_w6_8bpc_neon:   1067.3   613.6   767.5
lpf_h_sb_y_w4_8bpc_neon:    1490.5  1179.1  1018.9
lpf_h_sb_y_w8_8bpc_neon:    1892.9  1382.0  1172.0
lpf_h_sb_y_w16_8bpc_neon:   2117.4  1625.4  1739.0
lpf_v_sb_uv_w4_8bpc_neon:    447.1   447.7   446.0
lpf_v_sb_uv_w6_8bpc_neon:    522.1   529.0   513.1
lpf_v_sb_y_w4_8bpc_neon:    1043.7   785.0   775.9
lpf_v_sb_y_w8_8bpc_neon:    1500.4  1115.9   881.2
lpf_v_sb_y_w16_8bpc_neon:   1493.5  1371.4  1248.5

360243c2

arm: loopfilter: Prepare for 16 bpc · ebbf91f4
Martin Storsjö authored 5 years ago and Jean-Baptiste Kempf committed 5 years ago

ebbf91f4
arm: loopfilter: Fix a comment · ac492552
Martin Storsjö authored 5 years ago and Jean-Baptiste Kempf committed 5 years ago

ac492552

Feb 25, 2020
- fuzzing: link the fuzzing binaries as C++ · d398da88
  Janne Grunau authored 5 years ago and Jean-Baptiste Kempf committed 5 years ago
```
Requires meson 0.51 for oss-fuzz and 0.49 for the fuzzing binaries in
general due to the use of the 'kwargs' keyword argument.
```
  d398da88
- fuzzing: split the fuzzing targets to their own meson.build file · 7675eb16
  Janne Grunau authored 5 years ago and Jean-Baptiste Kempf committed 5 years ago
  
  7675eb16
Feb 24, 2020

x86: Add mc w_mask 4:4:4 AVX-512 (Ice Lake) asm · 64f9db55
Henrik Gramner authored 5 years ago

64f9db55
x86: Add mc w_mask 4:2:2 AVX-512 (Ice Lake) asm · d4a7c647
Henrik Gramner authored 5 years ago

d4a7c647
x86: Add mc w_mask 4:2:0 AVX-512 (Ice Lake) asm · 50e9a39a
Henrik Gramner authored 5 years ago

50e9a39a
x86: Add mc avg/w_avg/mask AVX-512 (Ice Lake) asm · d085424c
Henrik Gramner authored 5 years ago

d085424c

x86: optimize cdef_filter_{4x{4,8},8x8}_avx2 · 22080aa3

Victorien Le Couviour--Tuffet authored 5 years ago

Add 2 seperate code paths for pri/sec strengths equal 0.
Having both strengths not equal to 0 is uncommon, branching to skip
unnecessary computations is therefore beneficial.

------------------------------------------
before: cdef_filter_4x4_8bpc_avx2: 93.8
 after: cdef_filter_4x4_8bpc_avx2: 71.7
---------------------
before: cdef_filter_4x8_8bpc_avx2: 161.5
 after: cdef_filter_4x8_8bpc_avx2: 116.3
---------------------
before: cdef_filter_8x8_8bpc_avx2: 221.8
 after: cdef_filter_8x8_8bpc_avx2: 156.4
------------------------------------------

22080aa3

x86: add a seperate fully edged case to cdef_filter_avx2 · 1bd078c2

Victorien Le Couviour--Tuffet authored 5 years ago

---------------------
fully edged blocks perf
------------------------------------------
before: cdef_filter_4x4_8bpc_avx2: 91.0
 after: cdef_filter_4x4_8bpc_avx2: 75.7
---------------------
before: cdef_filter_4x8_8bpc_avx2: 154.6
 after: cdef_filter_4x8_8bpc_avx2: 131.8
---------------------
before: cdef_filter_8x8_8bpc_avx2: 214.1
 after: cdef_filter_8x8_8bpc_avx2: 195.9
------------------------------------------

1bd078c2

checkasm: Improve the cdef input randomization algorithm · efbdf7a0
Henrik Gramner authored 5 years ago and Victorien Le Couviour--Tuffet committed 5 years ago
```
Change the input buffer randomization algorithm to more readily
trigger issues with both under- and overflows in cdef_filter.
```
efbdf7a0

Feb 21, 2020
- cli: Replace malloc + memset(0) with calloc in input.c · 296d1dc0
  Luc Trudeau authored 5 years ago
  
  296d1dc0
- cli: remove init_[de]muxers() functions · cacc8e35
  Luc Trudeau authored 5 years ago
```
Muxer and demuxers arrays are now statically initialized
```
  cacc8e35
Feb 20, 2020
- Replace malloc+memset(0) with calloc · 0c885607
  Luc Trudeau authored 5 years ago
  
  0c885607
Feb 18, 2020
- CI: update aarch64 docker image to buster with meson 0.49 · bf56afde
  Janne Grunau authored 5 years ago
  
  bf56afde
Feb 17, 2020

arm: cdef: Do an 8 bit implementation for cases with all edges present · b33f46e8

Martin Storsjö authored 5 years ago

This increases the code size by around 3 KB on arm64.

Before:
ARM32:                    Cortex A7      A8      A9     A53     A72     A73
cdef_filter_4x4_8bpc_neon:    807.1   517.0   617.7   506.6   429.9   357.8
cdef_filter_4x8_8bpc_neon:   1407.9   899.3  1054.6   862.3   726.5   628.1
cdef_filter_8x8_8bpc_neon:   2394.9  1456.8  1676.8  1461.2  1084.4  1101.2
ARM64:
cdef_filter_4x4_8bpc_neon:                            460.7   301.8   308.0
cdef_filter_4x8_8bpc_neon:                            831.6   547.0   555.2
cdef_filter_8x8_8bpc_neon:                           1454.6   935.6   960.4

After:
ARM32:
cdef_filter_4x4_8bpc_neon:    669.3   541.3   524.4   424.9   322.7   298.1
cdef_filter_4x8_8bpc_neon:   1159.1   922.9   881.1   709.2   538.3   514.1
cdef_filter_8x8_8bpc_neon:   1888.8  1285.4  1358.5  1152.9   839.3   871.2
ARM64:
cdef_filter_4x4_8bpc_neon:                            383.6   262.1   259.9
cdef_filter_4x8_8bpc_neon:                            684.9   472.2   464.7
cdef_filter_8x8_8bpc_neon:                           1160.0   756.8   788.0

(The checkasm benchmark averages three different cases; the fully
edged case is one of those three, while it's the most common case
in actual video. The difference is much bigger if only benchmarking
that particular case.)

This actually apparently makes the code a little bit slower for the w=4
cases on Cortex A8, while it's a significant speedup on all other cores.

b33f46e8

arm32: cdef: Fix a typo for consistency · aff9a210
Martin Storsjö authored 5 years ago
```
The signedness of elements doesn't matter for vsub; match the vsub.i16
next to it.
```
aff9a210

Feb 16, 2020

cli: Implement line buffering in print_stats() · 09d90658

Henrik Gramner authored 5 years ago

Console output is incredibly slow on Windows, which is aggravated by
the lack of line buffering. As a result, a significant percentage of
overall runtime is actually spent displaying the decoding progress.

Doing the line buffering manually alleviates most of the issue.

09d90658

Feb 13, 2020
- arm: cdef: Remove leftover unused labels and macro parameters · eb7077ed
  Martin Storsjö authored 5 years ago
```
These were missed in 361a3c8e.
```
  eb7077ed
Feb 11, 2020

arm64: looprestoration: NEON implementation of SGR for 10 bpc · e3dbf926

Martin Storsjö authored 5 years ago

This only supports 10 bpc, not 12 bpc, as the sum and tmp buffers can
be int16_t for 10 bpc, but need to be int32_t for 12 bpc.

Make actual templates out of the functions in looprestoration_tmpl.S,
and add box3/5_h to looprestoration16.S.

Extend dav1d_sgr_calc_abX_neon with a mandatory bitdepth_max parameter
(which is passed even in 8bpc mode), add a define to bitdepth.h for
passing such a parameter in all modes. This makes this function
a few instructions slower in 8bpc mode than it was before (overall impact
seems to be around 1% of the total runtime of SGR), but allows using the
same actual function instantiation for all modes, saving a bit of code
size.

Examples of checkasm runtimes:
                           Cortex A53        A72        A73
selfguided_3x3_10bpc_neon:   516755.8   389412.7   349058.7
selfguided_5x5_10bpc_neon:   380699.9   293486.6   254591.6
selfguided_mix_10bpc_neon:   878142.3   667495.9   587844.6

Corresponding 8 bpc numbers for comparison:
selfguided_3x3_8bpc_neon:    491058.1   361473.4   347705.9
selfguided_5x5_8bpc_neon:    352655.0   266423.7   248192.2
selfguided_mix_8bpc_neon:    826094.1   612372.2   581943.1

e3dbf926

arm64: looprestoration: Prepare for 16 bpc by splitting code to separate files · 7cf5d753

Martin Storsjö authored 5 years ago

looprestoration_common.S contains functions that can be used as is
with one single instantiation of the functions for both 8 and 16 bpc.
This file will be built once, regardless of which bitdepths are enabled.

looprestoration_tmpl.S contains functions where the source can be shared
and templated between 8 and 16 bpc. This will be included by the separate
8/16bpc implementaton files.

7cf5d753

arm: looprestoration: Add 8bpc to existing function names, add HIGHBD_*_SUFFIX · 32e265a8

Martin Storsjö authored 5 years ago

Don't add it to dav1d_sgr_calc_ab1/2_neon and box3/5_v, as the same
concrete function implementations can be shared for both 8 and 16 bpc
for those functions.

32e265a8

looprestoration: Add a bpc parameter to the init func · 96da9cc2

Martin Storsjö authored 5 years ago

This allows using completely different codepaths for 10 and 12 bpc,
or just adding SIMD functions for either of them.

96da9cc2

arm: looprestoration: Improve scheduling in box3/5_h slightly · 8fb30657
Martin Storsjö authored 5 years ago
```
Set flags further from the branch instructions that use them.
```
8fb30657

arm: Use int16_t for the tmp intermediate buffer · 8e8fb84d

Martin Storsjö authored 5 years ago

For 8bpc and 10bpc, int16_t is enough here, and for 12bpc, other
intermediate int16_t buffers also need to be made of size coef anyway.

8e8fb84d

arm: looprestoration: Fix a comment · feeaf785
Martin Storsjö authored 5 years ago

feeaf785

Feb 10, 2020
- NEWS: Official naming is AVX2, not AVX-2 · e4208e85
  Jean-Baptiste Kempf authored 5 years ago
  
  e4208e85
- arm64: mc: Reduce the width of a register copy · d4c5ad49
  Martin Storsjö authored 5 years ago and Janne Grunau committed 5 years ago
```
Only copy as much as really is needed/used.
```
  d4c5ad49