- Sep 19, 2024
-
-
MARBEAN authored
-
- Sep 18, 2024
-
-
This is needed for GCC 4.7 and earlier, as well as Visual Studio 2022 version 17.9 and earlier.
-
-
- Sep 17, 2024
-
-
Luca Barbato authored
-
Luca Barbato authored
-
Luca Barbato authored
Initial i32x4 version, can be used as base for high bitdept.
-
Luca Barbato authored
-
Luca Barbato authored
-
Luca Barbato authored
-
Luca Barbato authored
-
Jean-Baptiste Kempf authored
-
Jean-Baptiste Kempf authored
-
- Sep 12, 2024
-
-
-
There are some instruction sequences we could merge after the lane load/store patch (ec5c3052). This change will simplify the loading of filter weights to save 288 bytes in the Armv8.0 Neon path of 6-tap and 8-tap MC functions.
-
- Sep 11, 2024
-
-
Kacper Michajłow authored
This is possible, because we no longer generate version.h at compile time. Reverts header change from 7629402b to preserve the same behaviour as before.
-
- Sep 10, 2024
-
-
Kacper Michajłow authored
Instead of generating version.h, move the so version there and parse it in meson.
-
- Sep 06, 2024
-
-
Shared object binary size reduction: x84_64 : 16112 bytes ARM64 : 16008 bytes ARM64(+Os) : 21592 bytes ARMv7(+Os+mthumb): 18480 bytes Size reduction of symbols: x84_64 : 15712 bytes ARM64 : 18688 bytes ARM64(+Os) : 18404 bytes ARMv7(+Os+mthumb): 17322 bytes Compiles were done with clang version 18.1.8 and symbol sizes were obtained using nm on the shared object. Provides speed ups on older ARM64 cpus with very little impact on other cpus. Speedup: c7i (skylake) Nature1080p : x0.999 Chimera : x0.998 odroid C4 Nature1080p : x1.007 Chimera : x1.016 Models1080p : x1.005 MountainBike1080p: x1.009 Balloons1080p : x1.008 Raspberry Pi 4 Nature1080p : x1.005 Chimera : x0.999 Models1080p : x0.999 MountainBike1080p: x1.004 Balloons1080p : x1.003 Raspberry Pi 2 (Cortex-A7): (using size optimized build) Nature1080p : x1.003 Models1080p : x0.997
-
Martin Storsjö authored
This allows executing all the tools within e.g. valgrind. This matches the "meson test --wrap <tool>" feature.
-
For the 3x3 part, double the width of the vertical loop. This is done to provide more latency in the new sgr calculation. Initial (master): Cortex A53 A55 A72 A73 A76 Apple M1 sgr_3x3_8bpc_neon: 387702.8 383154.2 295742.4 302100.1 185420.7 472.2 sgr_5x5_8bpc_neon: 261725.1 256919.8 194205.1 197585.6 128311.3 332.9 sgr_mix_8bpc_neon: 628085.0 593664.2 453551.8 450553.8 281956.0 711.2 Current: sgr_3x3_8bpc_neon: 368331.4 363949.7 275499.0 272056.3 169614.4 432.7 sgr_5x5_8bpc_neon: 257866.7 255265.5 195962.5 199557.8 120481.3 319.2 sgr_mix_8bpc_neon: 598234.1 572896.4 418500.4 438910.7 258977.7 659.3 Include a minor improvement that gets rid of a dup instruction.
-
Partial register writes can create long dependency chains, which can reduce performance on out-of-order CPUs. This patch removes most of these kinds of problems in MC functions by filling the full register before other lane loading instructions. Most lane extracting stores can also be optimized using FP scalar stores when the 0th lane would be extracted. Relative runtime of micro benchmarks after this patch on some Neoverse and Cortex CPU cores: 8bpc neon V2 V1 X3 X1 A715 A78 A76 avg w8: 0.942x 1.030x 0.936x 0.935x 1.000x 0.877x 0.976x w_avg w8: 0.908x 0.913x 0.919x 0.914x 0.999x 0.905x 0.910x mask w8: 0.937x 0.905x 0.929x 0.907x 1.009x 0.921x 0.868x w_mask 420 w4: 0.969x 0.968x 0.951x 0.962x 0.995x 0.976x 0.958x w_mask 420 w8: 0.979x 0.935x 0.936x 0.935x 0.996x 0.948x 0.959x blend w4: 0.721x 0.841x 0.764x 0.822x 0.772x 0.826x 0.883x blend w8: 0.692x 0.733x 0.686x 0.730x 0.828x 0.723x 0.762x blend h w2: 0.738x 0.776x 0.746x 0.775x 0.683x 0.827x 0.851x blend h w4: 0.858x 0.942x 0.880x 0.933x 0.784x 0.924x 0.965x blend h w8: 0.804x 0.807x 0.806x 0.805x 0.814x 0.810x 0.748x blend v w2: 0.898x 0.931x 0.903x 0.949x 0.784x 0.867x 0.875x blend v w4: 0.935x 0.905x 0.933x 0.922x 0.763x 0.777x 0.807x blend v w8: 0.803x 0.802x 0.804x 0.815x 0.674x 0.677x 0.678x 16bpc neon V2 V1 X3 X1 A715 A78 A76 avg w4: 0.899x 0.967x 0.897x 0.948x 1.002x 0.901x 0.884x w_avg w4: 0.952x 0.951x 0.936x 0.946x 0.997x 0.937x 0.925x mask w4: 0.893x 0.958x 0.887x 0.948x 1.003x 0.938x 0.934x w_mask 420 w4: 0.933x 0.932x 0.932x 0.939x 1.000x 0.910x 0.955x w_mask 420 w8: 0.966x 0.962x 0.967x 0.961x 1.000x 0.990x 1.010x blend w4: 0.367x 0.361x 0.370x 0.352x 0.418x 0.394x 0.476x blend h w2: 0.365x 0.445x 0.369x 0.437x 0.416x 0.576x 0.699x blend h w4: 0.343x 0.402x 0.342x 0.398x 0.418x 0.525x 0.603x blend v w2: 0.464x 0.460x 0.460x 0.447x 0.494x 0.446x 0.503x blend v w4: 0.432x 0.424x 0.437x 0.416x 0.433x 0.427x 0.534x blend v w8: 0.936x 0.847x 0.949x 0.848x 1.007x 0.811x 0.785x bilinear 8bpc neon V2 V1 X3 X1 A715 A78 A76 mct w4 0: 0.982x 0.983x 0.955x 1.029x 0.784x 0.817x 0.814x mc w2 h: 0.277x 0.333x 0.275x 0.325x 0.299x 0.435x 0.518x mct w4 h: 0.835x 0.862x 0.814x 0.887x 1.074x 0.899x 0.884x mc w2 v: 0.887x 0.966x 0.894x 0.945x 0.808x 0.953x 0.997x mc w4 v: 0.762x 0.899x 0.766x 0.867x 0.695x 0.915x 1.017x mct w4 v: 0.700x 0.812x 0.740x 0.777x 0.777x 0.824x 0.853x mc w2 hv: 0.928x 0.985x 0.929x 0.978x 0.789x 0.969x 1.010x mct w4 hv: 0.887x 0.913x 0.912x 0.920x 1.001x 0.922x 0.937x bilinear 16bpc neon V2 V1 X3 X1 A715 A78 A76 mc w2 0: 0.991x 1.032x 0.993x 0.970x 0.878x 0.925x 0.999x mct w4 0: 0.811x 0.730x 0.797x 0.680x 0.808x 0.711x 0.805x mc w4 h: 0.885x 0.901x 0.895x 0.905x 1.003x 0.909x 0.910x mct w4 h: 0.902x 0.914x 0.898x 0.896x 1.000x 0.897x 0.934x mc w2 v: 0.888x 0.966x 0.913x 0.955x 0.824x 0.958x 1.005x mc w4 v: 0.897x 0.894x 0.903x 0.902x 1.001x 0.895x 0.895x mct w4 v: 0.924x 0.908x 0.921x 0.901x 1.001x 0.904x 0.918x mc w4 hv: 0.927x 0.925x 0.924x 0.933x 1.000x 0.936x 0.959x mct w4 hv: 0.923x 0.944x 0.923x 0.944x 0.999x 0.931x 0.956x 8tap 8bpc neon V2 V1 X3 X1 A715 A78 A76 mct regular w4 0: 0.829x 0.854x 0.735x 0.861x 0.769x 0.766x 0.840x mc regular w2 h: 0.984x 1.008x 0.983x 1.012x 0.986x 0.989x 0.995x mc sharp w2 h: 0.987x 1.008x 0.986x 1.011x 0.985x 0.989x 0.995x mc regular w4 h: 0.907x 0.911x 0.916x 0.908x 0.997x 0.936x 0.932x mc sharp w4 h: 0.916x 0.914x 0.918x 0.913x 0.999x 0.939x 0.905x mct regular w4 h: 0.992x 0.979x 0.993x 0.971x 1.000x 0.986x 0.976x mct sharp w4 h: 0.991x 0.979x 0.989x 0.984x 1.001x 0.979x 0.983x mc regular w2 v: 1.002x 1.001x 1.005x 1.000x 1.000x 0.998x 0.983x mc sharp w2 v: 1.005x 1.001x 1.009x 0.998x 0.994x 0.997x 0.989x mc regular w4 v: 0.985x 0.998x 0.991x 0.998x 1.000x 1.000x 0.983x mc sharp w4 v: 1.005x 1.002x 1.006x 1.002x 0.998x 0.991x 0.999x mct regular w4 v: 0.966x 0.967x 0.961x 0.974x 0.996x 0.954x 0.982x mct sharp w4 v: 0.970x 0.944x 0.967x 0.944x 0.997x 0.951x 0.966x mc regular w2 hv: 0.993x 0.993x 0.994x 0.987x 0.993x 0.985x 0.999x mc sharp w2 hv: 0.994x 0.996x 0.992x 0.998x 0.997x 0.999x 0.999x mc regular w4 hv: 0.964x 0.958x 0.964x 0.960x 0.982x 0.938x 0.958x mc sharp w4 hv: 0.982x 0.981x 0.980x 0.982x 0.995x 0.986x 0.941x mct regular w4 hv: 0.993x 0.994x 0.992x 0.994x 0.996x 0.992x 0.988x mct sharp w4 hv: 0.993x 0.996x 0.991x 0.996x 0.954x 0.992x 1.011x 8tap 16bpc neon V2 V1 X3 X1 A715 A78 A76 mc regular w2 0: 0.869x 1.059x 0.874x 0.956x 0.883x 0.932x 1.000x mct regular w4 0: 0.348x 0.369x 0.354x 0.377x 0.560x 0.409x 0.648x mc regular w2 h: 0.996x 0.988x 0.992x 0.985x 0.989x 0.991x 1.006x mc sharp w2 h: 0.996x 0.989x 0.979x 0.991x 0.987x 0.988x 0.997x mc regular w4 h: 0.957x 0.937x 0.957x 0.948x 0.961x 0.927x 0.994x mc sharp w4 h: 0.966x 0.940x 0.962x 0.954x 0.985x 0.929x 0.970x mct regular w4 h: 0.922x 0.942x 0.932x 0.933x 1.007x 0.938x 0.905x mct sharp w4 h: 0.919x 0.943x 0.919x 0.931x 0.971x 0.943x 0.929x mc regular w2 v: 1.000x 0.997x 1.001x 1.003x 1.001x 0.999x 0.984x mc sharp w2 v: 1.000x 0.999x 1.000x 0.999x 1.000x 1.000x 0.993x mc regular w4 v: 0.936x 0.941x 0.936x 0.939x 0.999x 0.928x 0.981x mc sharp w4 v: 0.955x 0.961x 0.949x 0.956x 0.999x 0.947x 0.953x mct regular w4 v: 0.977x 0.966x 0.979x 0.968x 0.990x 0.972x 0.972x mct sharp w4 v: 0.973x 0.965x 0.981x 0.963x 0.994x 0.977x 0.974x mc regular w2 hv: 0.995x 1.001x 0.995x 0.995x 0.995x 1.000x 0.981x mc sharp w2 hv: 0.993x 1.012x 0.993x 0.988x 0.996x 0.992x 1.008x mc regular w4 hv: 0.938x 0.943x 0.939x 0.943x 0.986x 0.943x 0.997x mc sharp w4 hv: 0.969x 0.959x 0.970x 0.974x 0.986x 0.993x 0.997x mct regular w4 hv: 0.942x 0.970x 0.951x 0.960x 0.977x 0.958x 1.018x mct sharp w4 hv: 0.923x 0.958x 0.934x 0.955x 0.973x 0.946x 0.986x
-
The 6-tap horizontal and the horizontal parts of 6-tap HV subpel filters can be further improved by some pointer arithmetic and saving some instructions (EXTs) in their data rearrangement codes. Relative runtime of micro benchmarks after this patch on Cortex CPU cores: SBD mct h X1 A78 A76 A72 A55 regular w8: 0.878x 0.894x 0.990x 0.923x 0.944x regular w16: 0.962x 0.931x 0.943x 0.949x 0.949x regular w32: 0.937x 0.937x 0.972x 0.938x 0.947x regular w64: 0.920x 0.965x 0.992x 0.936x 0.944x SBD mct hv X1 A78 A76 A72 A55 regular w8: 0.931x 0.970x 0.951x 0.950x 0.971x regular w16: 0.940x 0.971x 0.941x 0.952x 0.967x regular w32: 0.943x 0.972x 0.946x 0.961x 0.974x regular w64: 0.943x 0.973x 0.952x 0.944x 0.975x
-
The horizontal parts of 6-tap HV subpel filters can be further improved by some pointer arithmetic and saving some instructions (EXTs) in their data rearrangement codes. Relative runtime of micro benchmarks after this patch on Cortex CPU cores: HBD mct hv X1 A78 A76 A72 A55 regular w8: 0.952x 0.989x 0.924x 0.973x 0.976x regular w16: 0.961x 0.993x 0.928x 0.952x 0.971x regular w32: 0.964x 0.996x 0.930x 0.973x 0.972x regular w64: 0.963x 0.997x 0.930x 0.969x 0.974x
-
The 6-tap horizontal subpel filters can be further improved by some pointer arithmetic and saving some instructions (EXTs) in their data rearrangement codes. Relative runtime of micro benchmarks after this patch on some Cortex CPU cores: regular: X1 A78 A76 A55 mc w8: 0.915x 0.937x 0.900x 0.982x mc w16: 0.917x 0.947x 0.911x 0.971x mc w32: 0.914x 0.938x 0.873x 0.961x mc w64: 0.918x 0.932x 0.882x 0.964x
-
The reduction parts of the horizontal HBD MC filters use SRSHL+SQXTUN+ SRSHL instruction sequences. In the horizontal case this can be rewritten using a single SQSHRUN instruction with an additional rounding value (34 for 10-bit and 40 for 12-bit). Relative runtime of micro benchmarks after this patch on some Cortex CPU cores: regular: X1 A78 A76 A55 mc w2: 0.847x 0.864x 0.822x 0.859x mc w4: 0.889x 0.994x 0.868x 0.917x mc w8: 0.857x 0.911x 0.915x 0.978x mc w16: 0.890x 0.982x 0.868x 0.974x mc w32: 0.904x 0.991x 0.873x 0.967x mc w64: 0.919x 1.003x 0.860x 0.970x
-
- Sep 05, 2024
-
-
-
This makes `#include <dav1d/dav1d.h>` work correctly as we point to the parent include directory, same as in the normal installation. Also fixes conflict of including "version.h" which may already exist in parent project or another subproject. Be more specific about the headers. Normally it works, but when building as subproject version.h is generated in build directory, so it no longer is prioritized when including from dav1d.h and other header with the same name may be included.
-
- Sep 04, 2024
-
-
- Sep 01, 2024
-
-
Cameron Cawley authored
-
- Aug 30, 2024
-
-
- Aug 29, 2024
-
-
-
-
-
-
Martin Storsjö authored
This should allow executing in environments where the executable memory isn't readable. Use 4 byte entries instead of 2; most object file formats support relocations for a 4 byte symbol difference across sections, which allows keeping the rest of the table lookup code similar to what it was before. Referencing a symbol in an arbitrary location in the executable requires a two instruction sequence (adrp+add, via the movrel macro). Thus, the cost of this rewrite is doubling the size of the jump tables (which were quite small so far), and adding one instruction in each jump table setup prologue. On an ELF build, the .text section shrinks by 1176 bytes, and the .rodata section grows by 3136 bytes, i.e. a 1960 byte increase. While refactoring, prefer doing sign extension during the load (using ldrsw rather than ldr, to avoid using the "sxtw" modifier on the add instruction), as extending ALU arithmetics have a higher latency. MS armasm64 doesn't seem to support calculating symbol differences across sections (see [1]), so keep the jump tables in the text section there, to let the assembler calculate it at assembly time instead. (Keeping the condition as _WIN32 for simplicity, as we don't interact directly with armasm64, but it is wrapped in gas-preprocessor.) [1] https://developercommunity.visualstudio.com/t/armasm64-unable-to-create-cross-section/10722340
-
Martin Storsjö authored
-
-