Commits · d91252f4db4d57176cbf40c746392ed08782a904 · MARBEAN / dav1d

Sep 19, 2024
- Merge remote-tracking branch 'code.videolan.org/master' · d91252f4
  MARBEAN authored 7 months ago
  
  d91252f4
Sep 18, 2024
- Define __ARM_ARCH with older compilers · a7a40a3f
  Cameron Cawley authored 7 months ago and Martin Storsjö committed 7 months ago
```
This is needed for GCC 4.7 and earlier, as well as Visual Studio 2022 version 17.9 and earlier.
```
  a7a40a3f
- Support older ARM versions with checkasm · 8e993f4d
  Cameron Cawley authored 7 months ago and Martin Storsjö committed 7 months ago
  
  8e993f4d
Sep 17, 2024
- ppc: Factor out dc_only itx · 8d9b1e26
  Luca Barbato authored 7 months ago
  
  8d9b1e26
- ppc: itx 16x4 pwr9 · 75d3ad14
  Luca Barbato authored 9 months ago
  
  75d3ad14
- ppc: itx 4x16 pwr9 · 0bf331a1
  Luca Barbato authored 11 months ago
```
Initial i32x4 version, can be used as base for high bitdept.
```
  0bf331a1
- ppc: Remove high bitdepth macros from the 8bit-only code · 19e122ee
  Luca Barbato authored 11 months ago
  
  19e122ee
- ppc: itx 8x8 pwr9 · b1d847be
  Luca Barbato authored 11 months ago
  
  b1d847be
- ppc: itx 4x8 and 8x4 pwr9 · da51b123
  Luca Barbato authored 1 year ago
  
  da51b123
- ppc: itx 4x4 pwr9 · 33b9d514
  Luca Barbato authored 1 year ago
  
  33b9d514
- NEWS: get ready for 1.5.0 · 21235966
  Jean-Baptiste Kempf authored 7 months ago
  
  21235966
- Update NEWS for 1.4.3 · bd875480
  Jean-Baptiste Kempf authored 10 months ago
  
  bd875480
Sep 12, 2024
- Use #if HAVE_* instead of #ifdef HAVE_* · dd32cd50
  Michael Bradshaw authored 7 months ago and Ronald S. Bultje committed 7 months ago
  
  dd32cd50
- AArch64: Trim Armv8.0 Neon path of 6-tap and 8-tap MC functions · 82e9155c
  Arpad Panyik authored 7 months ago and Martin Storsjö committed 7 months ago
```
There are some instruction sequences we could merge after the lane
load/store patch (ec5c3052).

This change will simplify the loading of filter weights to save 288
bytes in the Armv8.0 Neon path of 6-tap and 8-tap MC functions.
```
  82e9155c
Sep 11, 2024

Remove dav1d/ prefix from dav1d.h · f4a0d7cb

This is possible, because we no longer generate version.h at compile
time.

Reverts header change from 7629402b to
preserve the same behaviour as before.

f4a0d7cb

Sep 10, 2024
- meson: don't generate version.h · 74ccc936
  Kacper Michajłow authored 7 months ago
```
Instead of generating version.h, move the so version there and parse it
in meson.
```
  74ccc936
Sep 06, 2024

Improve density of group context setting macros · 4385e7e1

Kyle Siefring authored 8 months ago and

Ronald S. Bultje committed 7 months ago

Shared object binary size reduction:
x84_64           : 16112 bytes
ARM64            : 16008 bytes
ARM64(+Os)       : 21592 bytes
ARMv7(+Os+mthumb): 18480 bytes

Size reduction of symbols:
x84_64           : 15712 bytes
ARM64            : 18688 bytes
ARM64(+Os)       : 18404 bytes
ARMv7(+Os+mthumb): 17322 bytes

Compiles were done with clang version 18.1.8 and symbol sizes were
obtained using nm on the shared object.

Provides speed ups on older ARM64 cpus with very little impact on other
cpus.

Speedup:

c7i (skylake)
 Nature1080p      : x0.999
 Chimera          : x0.998

odroid C4
 Nature1080p      : x1.007
 Chimera          : x1.016
 Models1080p      : x1.005
 MountainBike1080p: x1.009
 Balloons1080p    : x1.008

Raspberry Pi 4
 Nature1080p      : x1.005
 Chimera          : x0.999
 Models1080p      : x0.999
 MountainBike1080p: x1.004
 Balloons1080p    : x1.003

Raspberry Pi 2 (Cortex-A7):
 (using size optimized build)
 Nature1080p      : x1.003
 Models1080p      : x0.997

4385e7e1

tests: Add an option to dav1d_argon.bash for using a wrapper tool · 166e1df5
Martin Storsjö authored 7 months ago
```
This allows executing all the tools within e.g. valgrind.

This matches the "meson test --wrap <tool>" feature.
```
166e1df5

AArch64: New method for calculating sgr table · 79db1624

Kyle Siefring authored 8 months ago and

Martin Storsjö committed 7 months ago

For the 3x3 part, double the width of the vertical loop. This is done to
provide more latency in the new sgr calculation.

Initial (master): Cortex A53 A55 A72 A73 A76 Apple M1
sgr_3x3_8bpc_neon: 387702.8 383154.2 295742.4 302100.1 185420.7 472.2
sgr_5x5_8bpc_neon: 261725.1 256919.8 194205.1 197585.6 128311.3 332.9
sgr_mix_8bpc_neon: 628085.0 593664.2 453551.8 450553.8 281956.0 711.2

Current:
sgr_3x3_8bpc_neon: 368331.4 363949.7 275499.0 272056.3 169614.4 432.7
sgr_5x5_8bpc_neon: 257866.7 255265.5 195962.5 199557.8 120481.3 319.2
sgr_mix_8bpc_neon: 598234.1 572896.4 418500.4 438910.7 258977.7 659.3

Include a minor improvement that gets rid of a dup instruction.

79db1624

AArch64: Optimize lane load/store in MC functions · ec5c3052

Arpad Panyik authored 7 months ago and

Martin Storsjö committed 7 months ago

Partial register writes can create long dependency chains, which can
reduce performance on out-of-order CPUs. This patch removes most of
these kinds of problems in MC functions by filling the full register
before other lane loading instructions.

Most lane extracting stores can also be optimized using FP scalar
stores when the 0th lane would be extracted.

Relative runtime of micro benchmarks after this patch on some Neoverse
and Cortex CPU cores:

8bpc neon V2 V1 X3 X1 A715 A78 A76
avg w8: 0.942x 1.030x 0.936x 0.935x 1.000x 0.877x 0.976x
w_avg w8: 0.908x 0.913x 0.919x 0.914x 0.999x 0.905x 0.910x
mask w8: 0.937x 0.905x 0.929x 0.907x 1.009x 0.921x 0.868x
w_mask 420 w4: 0.969x 0.968x 0.951x 0.962x 0.995x 0.976x 0.958x
w_mask 420 w8: 0.979x 0.935x 0.936x 0.935x 0.996x 0.948x 0.959x
blend w4: 0.721x 0.841x 0.764x 0.822x 0.772x 0.826x 0.883x
blend w8: 0.692x 0.733x 0.686x 0.730x 0.828x 0.723x 0.762x
blend h w2: 0.738x 0.776x 0.746x 0.775x 0.683x 0.827x 0.851x
blend h w4: 0.858x 0.942x 0.880x 0.933x 0.784x 0.924x 0.965x
blend h w8: 0.804x 0.807x 0.806x 0.805x 0.814x 0.810x 0.748x
blend v w2: 0.898x 0.931x 0.903x 0.949x 0.784x 0.867x 0.875x
blend v w4: 0.935x 0.905x 0.933x 0.922x 0.763x 0.777x 0.807x
blend v w8: 0.803x 0.802x 0.804x 0.815x 0.674x 0.677x 0.678x

16bpc neon V2 V1 X3 X1 A715 A78 A76
avg w4: 0.899x 0.967x 0.897x 0.948x 1.002x 0.901x 0.884x
w_avg w4: 0.952x 0.951x 0.936x 0.946x 0.997x 0.937x 0.925x
mask w4: 0.893x 0.958x 0.887x 0.948x 1.003x 0.938x 0.934x
w_mask 420 w4: 0.933x 0.932x 0.932x 0.939x 1.000x 0.910x 0.955x
w_mask 420 w8: 0.966x 0.962x 0.967x 0.961x 1.000x 0.990x 1.010x
blend w4: 0.367x 0.361x 0.370x 0.352x 0.418x 0.394x 0.476x
blend h w2: 0.365x 0.445x 0.369x 0.437x 0.416x 0.576x 0.699x
blend h w4: 0.343x 0.402x 0.342x 0.398x 0.418x 0.525x 0.603x
blend v w2: 0.464x 0.460x 0.460x 0.447x 0.494x 0.446x 0.503x
blend v w4: 0.432x 0.424x 0.437x 0.416x 0.433x 0.427x 0.534x
blend v w8: 0.936x 0.847x 0.949x 0.848x 1.007x 0.811x 0.785x

bilinear 8bpc neon V2 V1 X3 X1 A715 A78 A76
mct w4 0: 0.982x 0.983x 0.955x 1.029x 0.784x 0.817x 0.814x
mc w2 h: 0.277x 0.333x 0.275x 0.325x 0.299x 0.435x 0.518x
mct w4 h: 0.835x 0.862x 0.814x 0.887x 1.074x 0.899x 0.884x
mc w2 v: 0.887x 0.966x 0.894x 0.945x 0.808x 0.953x 0.997x
mc w4 v: 0.762x 0.899x 0.766x 0.867x 0.695x 0.915x 1.017x
mct w4 v: 0.700x 0.812x 0.740x 0.777x 0.777x 0.824x 0.853x
mc w2 hv: 0.928x 0.985x 0.929x 0.978x 0.789x 0.969x 1.010x
mct w4 hv: 0.887x 0.913x 0.912x 0.920x 1.001x 0.922x 0.937x

bilinear 16bpc neon V2 V1 X3 X1 A715 A78 A76
mc w2 0: 0.991x 1.032x 0.993x 0.970x 0.878x 0.925x 0.999x
mct w4 0: 0.811x 0.730x 0.797x 0.680x 0.808x 0.711x 0.805x
mc w4 h: 0.885x 0.901x 0.895x 0.905x 1.003x 0.909x 0.910x
mct w4 h: 0.902x 0.914x 0.898x 0.896x 1.000x 0.897x 0.934x
mc w2 v: 0.888x 0.966x 0.913x 0.955x 0.824x 0.958x 1.005x
mc w4 v: 0.897x 0.894x 0.903x 0.902x 1.001x 0.895x 0.895x
mct w4 v: 0.924x 0.908x 0.921x 0.901x 1.001x 0.904x 0.918x
mc w4 hv: 0.927x 0.925x 0.924x 0.933x 1.000x 0.936x 0.959x
mct w4 hv: 0.923x 0.944x 0.923x 0.944x 0.999x 0.931x 0.956x

8tap 8bpc neon V2 V1 X3 X1 A715 A78 A76
mct regular w4 0: 0.829x 0.854x 0.735x 0.861x 0.769x 0.766x 0.840x
mc regular w2 h: 0.984x 1.008x 0.983x 1.012x 0.986x 0.989x 0.995x
mc sharp w2 h: 0.987x 1.008x 0.986x 1.011x 0.985x 0.989x 0.995x
mc regular w4 h: 0.907x 0.911x 0.916x 0.908x 0.997x 0.936x 0.932x
mc sharp w4 h: 0.916x 0.914x 0.918x 0.913x 0.999x 0.939x 0.905x
mct regular w4 h: 0.992x 0.979x 0.993x 0.971x 1.000x 0.986x 0.976x
mct sharp w4 h: 0.991x 0.979x 0.989x 0.984x 1.001x 0.979x 0.983x
mc regular w2 v: 1.002x 1.001x 1.005x 1.000x 1.000x 0.998x 0.983x
mc sharp w2 v: 1.005x 1.001x 1.009x 0.998x 0.994x 0.997x 0.989x
mc regular w4 v: 0.985x 0.998x 0.991x 0.998x 1.000x 1.000x 0.983x
mc sharp w4 v: 1.005x 1.002x 1.006x 1.002x 0.998x 0.991x 0.999x
mct regular w4 v: 0.966x 0.967x 0.961x 0.974x 0.996x 0.954x 0.982x
mct sharp w4 v: 0.970x 0.944x 0.967x 0.944x 0.997x 0.951x 0.966x
mc regular w2 hv: 0.993x 0.993x 0.994x 0.987x 0.993x 0.985x 0.999x
mc sharp w2 hv: 0.994x 0.996x 0.992x 0.998x 0.997x 0.999x 0.999x
mc regular w4 hv: 0.964x 0.958x 0.964x 0.960x 0.982x 0.938x 0.958x
mc sharp w4 hv: 0.982x 0.981x 0.980x 0.982x 0.995x 0.986x 0.941x
mct regular w4 hv: 0.993x 0.994x 0.992x 0.994x 0.996x 0.992x 0.988x
mct sharp w4 hv: 0.993x 0.996x 0.991x 0.996x 0.954x 0.992x 1.011x

8tap 16bpc neon V2 V1 X3 X1 A715 A78 A76
mc regular w2 0: 0.869x 1.059x 0.874x 0.956x 0.883x 0.932x 1.000x
mct regular w4 0: 0.348x 0.369x 0.354x 0.377x 0.560x 0.409x 0.648x
mc regular w2 h: 0.996x 0.988x 0.992x 0.985x 0.989x 0.991x 1.006x
mc sharp w2 h: 0.996x 0.989x 0.979x 0.991x 0.987x 0.988x 0.997x
mc regular w4 h: 0.957x 0.937x 0.957x 0.948x 0.961x 0.927x 0.994x
mc sharp w4 h: 0.966x 0.940x 0.962x 0.954x 0.985x 0.929x 0.970x
mct regular w4 h: 0.922x 0.942x 0.932x 0.933x 1.007x 0.938x 0.905x
mct sharp w4 h: 0.919x 0.943x 0.919x 0.931x 0.971x 0.943x 0.929x
mc regular w2 v: 1.000x 0.997x 1.001x 1.003x 1.001x 0.999x 0.984x
mc sharp w2 v: 1.000x 0.999x 1.000x 0.999x 1.000x 1.000x 0.993x
mc regular w4 v: 0.936x 0.941x 0.936x 0.939x 0.999x 0.928x 0.981x
mc sharp w4 v: 0.955x 0.961x 0.949x 0.956x 0.999x 0.947x 0.953x
mct regular w4 v: 0.977x 0.966x 0.979x 0.968x 0.990x 0.972x 0.972x
mct sharp w4 v: 0.973x 0.965x 0.981x 0.963x 0.994x 0.977x 0.974x
mc regular w2 hv: 0.995x 1.001x 0.995x 0.995x 0.995x 1.000x 0.981x
mc sharp w2 hv: 0.993x 1.012x 0.993x 0.988x 0.996x 0.992x 1.008x
mc regular w4 hv: 0.938x 0.943x 0.939x 0.943x 0.986x 0.943x 0.997x
mc sharp w4 hv: 0.969x 0.959x 0.970x 0.974x 0.986x 0.993x 0.997x
mct regular w4 hv: 0.942x 0.970x 0.951x 0.960x 0.977x 0.958x 1.018x
mct sharp w4 hv: 0.923x 0.958x 0.934x 0.955x 0.973x 0.946x 0.986x

ec5c3052

AArch64: Optimize Armv8.0 Neon path of SBD H/HV 6-tap filters · a992a9be

Arpad Panyik authored 7 months ago and

Martin Storsjö committed 7 months ago

The 6-tap horizontal and the horizontal parts of 6-tap HV subpel
filters can be further improved by some pointer arithmetic and saving
some instructions (EXTs) in their data rearrangement codes.

Relative runtime of micro benchmarks after this patch on Cortex CPU
cores:

SBD mct h         X1     A78     A76     A72     A55
 regular  w8:  0.878x  0.894x  0.990x  0.923x  0.944x
 regular w16:  0.962x  0.931x  0.943x  0.949x  0.949x
 regular w32:  0.937x  0.937x  0.972x  0.938x  0.947x
 regular w64:  0.920x  0.965x  0.992x  0.936x  0.944x

SBD mct hv        X1     A78     A76     A72     A55
 regular  w8:  0.931x  0.970x  0.951x  0.950x  0.971x
 regular w16:  0.940x  0.971x  0.941x  0.952x  0.967x
 regular w32:  0.943x  0.972x  0.946x  0.961x  0.974x
 regular w64:  0.943x  0.973x  0.952x  0.944x  0.975x

a992a9be

AArch64: Optimize Armv8.0 Neon path of HBD HV 6-tap filters · 2d808de1

Arpad Panyik authored 7 months ago and

Martin Storsjö committed 7 months ago

The horizontal parts of 6-tap HV subpel filters can be further
improved by some pointer arithmetic and saving some instructions
(EXTs) in their data rearrangement codes.

Relative runtime of micro benchmarks after this patch on Cortex CPU
cores:

HBD mct hv        X1     A78     A76     A72     A55
 regular  w8:  0.952x  0.989x  0.924x  0.973x  0.976x
 regular w16:  0.961x  0.993x  0.928x  0.952x  0.971x
 regular w32:  0.964x  0.996x  0.930x  0.973x  0.972x
 regular w64:  0.963x  0.997x  0.930x  0.969x  0.974x

2d808de1

AArch64: Optimize Armv8.0 Neon path of HBD horizontal 6-tap filters · 93339ce8

Arpad Panyik authored 7 months ago and

Martin Storsjö committed 7 months ago

The 6-tap horizontal subpel filters can be further improved by some
pointer arithmetic and saving some instructions (EXTs) in their data
rearrangement codes.

Relative runtime of micro benchmarks after this patch on some Cortex
CPU cores:

regular:     X1      A78      A76      A55
 mc  w8:  0.915x   0.937x   0.900x   0.982x
 mc w16:  0.917x   0.947x   0.911x   0.971x
 mc w32:  0.914x   0.938x   0.873x   0.961x
 mc w64:  0.918x   0.932x   0.882x   0.964x

93339ce8

AArch64: Optimize Armv8.0 Neon path of HBD horizontal filters · 109b2427

Arpad Panyik authored 7 months ago and

Martin Storsjö committed 7 months ago

The reduction parts of the horizontal HBD MC filters use SRSHL+SQXTUN+
SRSHL instruction sequences. In the horizontal case this can be
rewritten using a single SQSHRUN instruction with an additional
rounding value (34 for 10-bit and 40 for 12-bit).

Relative runtime of micro benchmarks after this patch on some Cortex
CPU cores:

regular:     X1      A78      A76      A55
 mc  w2:  0.847x   0.864x   0.822x   0.859x
 mc  w4:  0.889x   0.994x   0.868x   0.917x
 mc  w8:  0.857x   0.911x   0.915x   0.978x
 mc w16:  0.890x   0.982x   0.868x   0.974x
 mc w32:  0.904x   0.991x   0.873x   0.967x
 mc w64:  0.919x   1.003x   0.860x   0.970x

109b2427

Sep 05, 2024

Support using C11 aligned_alloc for dav1d_alloc_aligned · d2687884
Cameron Cawley authored 7 months ago and Ronald S. Bultje committed 7 months ago

d2687884

meson: fix include directories when building as subproject · 7629402b

Kacper Michajłow authored 7 months ago and

Ronald S. Bultje committed 7 months ago

This makes `#include <dav1d/dav1d.h>` work correctly as we point to the
parent include directory, same as in the normal installation.

Also fixes conflict of including "version.h" which may already exist in
parent project or another subproject. Be more specific about the
headers. Normally it works, but when building as subproject version.h is
generated in build directory, so it no longer is prioritized when
including from dav1d.h and other header with the same name may be
included.

7629402b

Sep 04, 2024
- Allow software renderers with placebo-gl · 507b697e
  Cameron Cawley authored 7 months ago and Ronald S. Bultje committed 7 months ago
  
  507b697e
- Disable the mouse cursor in dav1dplay · 312972d6
  Cameron Cawley authored 7 months ago and Ronald S. Bultje committed 7 months ago
  
  312972d6
- Allow quitting dav1dplay with the escape key · b9cc27d5
  Cameron Cawley authored 7 months ago and Ronald S. Bultje committed 7 months ago
  
  b9cc27d5
- Allow playing videos in full-screen mode · 2f9fc727
  Cameron Cawley authored 7 months ago and Ronald S. Bultje committed 7 months ago
  
  2f9fc727
- dav1dplay: Ensure that SDL is shut down when the application quits · 4e1a8b45
  Cameron Cawley authored 1 year ago and Ronald S. Bultje committed 7 months ago
  
  4e1a8b45
Sep 01, 2024
- Allow getopt fallback to compile on non-Windows platforms · cc6eb3d5
  Cameron Cawley authored 7 months ago
  
  cc6eb3d5
Aug 30, 2024
- picture: copy HDR10+ and T35 metadata only to visible frames · bdef2997
  Cosmin Stejerean authored 9 months ago and James Almer committed 7 months ago
  
  bdef2997
Aug 29, 2024

Check for sys/types.h before using it · 6b3c489a
Cameron Cawley authored 7 months ago and Martin Storsjö committed 7 months ago

6b3c489a
Only include unistd.h and pthread.h when necessary · 7490d986
Cameron Cawley authored 7 months ago and Martin Storsjö committed 7 months ago

7490d986
Remove unused sys/stat.h includes · a796f66e
Cameron Cawley authored 7 months ago and Martin Storsjö committed 7 months ago

a796f66e
Allow compile time CPU detection to be used when trim_dsp is disabled · 41040189
Cameron Cawley authored 7 months ago and Martin Storsjö committed 7 months ago

41040189

aarch64: Split the jump tables to a separate const section · 41511bf1

Martin Storsjö authored 10 months ago

This should allow executing in environments where the executable
memory isn't readable.

Use 4 byte entries instead of 2; most object file formats support
relocations for a 4 byte symbol difference across sections, which
allows keeping the rest of the table lookup code similar to what
it was before.

Referencing a symbol in an arbitrary location in the executable
requires a two instruction sequence (adrp+add, via the movrel
macro).

Thus, the cost of this rewrite is doubling the size of the jump
tables (which were quite small so far), and adding one instruction
in each jump table setup prologue. On an ELF build, the .text section
shrinks by 1176 bytes, and the .rodata section grows by 3136 bytes,
i.e. a 1960 byte increase.

While refactoring, prefer doing sign extension during the load
(using ldrsw rather than ldr, to avoid using the "sxtw" modifier on
the add instruction), as extending ALU arithmetics have a higher
latency.

MS armasm64 doesn't seem to support calculating symbol differences
across sections (see [1]), so keep the jump tables in the text
section there, to let the assembler calculate it at assembly time
instead. (Keeping the condition as _WIN32 for simplicity, as we don't
interact directly with armasm64, but it is wrapped in gas-preprocessor.)

[1] https://developercommunity.visualstudio.com/t/armasm64-unable-to-create-cross-section/10722340

41511bf1

Fix the macro parameter name for the CHECK_SIZE macro · 0d8abee5
Martin Storsjö authored 7 months ago

0d8abee5
Ensure that the refmvs_refpair union is packed · 0255c2b2
Cameron Cawley authored 7 months ago and Ronald S. Bultje committed 7 months ago

0255c2b2