Skip to content
Snippets Groups Projects
  1. Sep 19, 2024
  2. Sep 18, 2024
  3. Sep 17, 2024
  4. Sep 12, 2024
  5. Sep 11, 2024
  6. Sep 10, 2024
  7. Sep 06, 2024
    • Kyle Siefring's avatar
      Improve density of group context setting macros · 4385e7e1
      Kyle Siefring authored and Ronald S. Bultje's avatar Ronald S. Bultje committed
      Shared object binary size reduction:
      x84_64           : 16112 bytes
      ARM64            : 16008 bytes
      ARM64(+Os)       : 21592 bytes
      ARMv7(+Os+mthumb): 18480 bytes
      
      Size reduction of symbols:
      x84_64           : 15712 bytes
      ARM64            : 18688 bytes
      ARM64(+Os)       : 18404 bytes
      ARMv7(+Os+mthumb): 17322 bytes
      
      Compiles were done with clang version 18.1.8 and symbol sizes were
      obtained using nm on the shared object.
      
      Provides speed ups on older ARM64 cpus with very little impact on other
      cpus.
      
      Speedup:
      
      c7i (skylake)
       Nature1080p      : x0.999
       Chimera          : x0.998
      
      odroid C4
       Nature1080p      : x1.007
       Chimera          : x1.016
       Models1080p      : x1.005
       MountainBike1080p: x1.009
       Balloons1080p    : x1.008
      
      Raspberry Pi 4
       Nature1080p      : x1.005
       Chimera          : x0.999
       Models1080p      : x0.999
       MountainBike1080p: x1.004
       Balloons1080p    : x1.003
      
      Raspberry Pi 2 (Cortex-A7):
       (using size optimized build)
       Nature1080p      : x1.003
       Models1080p      : x0.997
      4385e7e1
    • Martin Storsjö's avatar
      tests: Add an option to dav1d_argon.bash for using a wrapper tool · 166e1df5
      Martin Storsjö authored
      This allows executing all the tools within e.g. valgrind.
      
      This matches the "meson test --wrap <tool>" feature.
      166e1df5
    • Kyle Siefring's avatar
      AArch64: New method for calculating sgr table · 79db1624
      Kyle Siefring authored and Martin Storsjö's avatar Martin Storsjö committed
      For the 3x3 part, double the width of the vertical loop. This is done to
      provide more latency in the new sgr calculation.
      
      Initial (master):  Cortex A53        A55        A72        A73       A76   Apple M1
      sgr_3x3_8bpc_neon:   387702.8   383154.2   295742.4   302100.1  185420.7   472.2
      sgr_5x5_8bpc_neon:   261725.1   256919.8   194205.1   197585.6  128311.3   332.9
      sgr_mix_8bpc_neon:   628085.0   593664.2   453551.8   450553.8  281956.0   711.2
      
      Current:
      sgr_3x3_8bpc_neon:   368331.4   363949.7   275499.0   272056.3  169614.4   432.7
      sgr_5x5_8bpc_neon:   257866.7   255265.5   195962.5   199557.8  120481.3   319.2
      sgr_mix_8bpc_neon:   598234.1   572896.4   418500.4   438910.7  258977.7   659.3
      
      Include a minor improvement that gets rid of a dup instruction.
      79db1624
    • Arpad Panyik's avatar
      AArch64: Optimize lane load/store in MC functions · ec5c3052
      Arpad Panyik authored and Martin Storsjö's avatar Martin Storsjö committed
      Partial register writes can create long dependency chains, which can
      reduce performance on out-of-order CPUs. This patch removes most of
      these kinds of problems in MC functions by filling the full register
      before other lane loading instructions.
      
      Most lane extracting stores can also be optimized using FP scalar
      stores when the 0th lane would be extracted.
      
      Relative runtime of micro benchmarks after this patch on some Neoverse
      and Cortex CPU cores:
      
      8bpc neon                V2      V1      X3      X1    A715     A78     A76
       avg        w8:       0.942x  1.030x  0.936x  0.935x  1.000x  0.877x  0.976x
       w_avg      w8:       0.908x  0.913x  0.919x  0.914x  0.999x  0.905x  0.910x
       mask       w8:       0.937x  0.905x  0.929x  0.907x  1.009x  0.921x  0.868x
       w_mask 420 w4:       0.969x  0.968x  0.951x  0.962x  0.995x  0.976x  0.958x
       w_mask 420 w8:       0.979x  0.935x  0.936x  0.935x  0.996x  0.948x  0.959x
       blend      w4:       0.721x  0.841x  0.764x  0.822x  0.772x  0.826x  0.883x
       blend      w8:       0.692x  0.733x  0.686x  0.730x  0.828x  0.723x  0.762x
       blend    h w2:       0.738x  0.776x  0.746x  0.775x  0.683x  0.827x  0.851x
       blend    h w4:       0.858x  0.942x  0.880x  0.933x  0.784x  0.924x  0.965x
       blend    h w8:       0.804x  0.807x  0.806x  0.805x  0.814x  0.810x  0.748x
       blend    v w2:       0.898x  0.931x  0.903x  0.949x  0.784x  0.867x  0.875x
       blend    v w4:       0.935x  0.905x  0.933x  0.922x  0.763x  0.777x  0.807x
       blend    v w8:       0.803x  0.802x  0.804x  0.815x  0.674x  0.677x  0.678x
      
      16bpc neon               V2      V1      X3      X1    A715     A78     A76
       avg        w4:       0.899x  0.967x  0.897x  0.948x  1.002x  0.901x  0.884x
       w_avg      w4:       0.952x  0.951x  0.936x  0.946x  0.997x  0.937x  0.925x
       mask       w4:       0.893x  0.958x  0.887x  0.948x  1.003x  0.938x  0.934x
       w_mask 420 w4:       0.933x  0.932x  0.932x  0.939x  1.000x  0.910x  0.955x
       w_mask 420 w8:       0.966x  0.962x  0.967x  0.961x  1.000x  0.990x  1.010x
       blend      w4:       0.367x  0.361x  0.370x  0.352x  0.418x  0.394x  0.476x
       blend    h w2:       0.365x  0.445x  0.369x  0.437x  0.416x  0.576x  0.699x
       blend    h w4:       0.343x  0.402x  0.342x  0.398x  0.418x  0.525x  0.603x
       blend    v w2:       0.464x  0.460x  0.460x  0.447x  0.494x  0.446x  0.503x
       blend    v w4:       0.432x  0.424x  0.437x  0.416x  0.433x  0.427x  0.534x
       blend    v w8:       0.936x  0.847x  0.949x  0.848x  1.007x  0.811x  0.785x
      
      bilinear 8bpc neon       V2      V1      X3      X1    A715     A78     A76
       mct     w4  0:       0.982x  0.983x  0.955x  1.029x  0.784x  0.817x  0.814x
       mc      w2  h:       0.277x  0.333x  0.275x  0.325x  0.299x  0.435x  0.518x
       mct     w4  h:       0.835x  0.862x  0.814x  0.887x  1.074x  0.899x  0.884x
       mc      w2  v:       0.887x  0.966x  0.894x  0.945x  0.808x  0.953x  0.997x
       mc      w4  v:       0.762x  0.899x  0.766x  0.867x  0.695x  0.915x  1.017x
       mct     w4  v:       0.700x  0.812x  0.740x  0.777x  0.777x  0.824x  0.853x
       mc      w2 hv:       0.928x  0.985x  0.929x  0.978x  0.789x  0.969x  1.010x
       mct     w4 hv:       0.887x  0.913x  0.912x  0.920x  1.001x  0.922x  0.937x
      
      bilinear 16bpc neon      V2      V1      X3      X1    A715     A78     A76
       mc      w2  0:       0.991x  1.032x  0.993x  0.970x  0.878x  0.925x  0.999x
       mct     w4  0:       0.811x  0.730x  0.797x  0.680x  0.808x  0.711x  0.805x
       mc      w4  h:       0.885x  0.901x  0.895x  0.905x  1.003x  0.909x  0.910x
       mct     w4  h:       0.902x  0.914x  0.898x  0.896x  1.000x  0.897x  0.934x
       mc      w2  v:       0.888x  0.966x  0.913x  0.955x  0.824x  0.958x  1.005x
       mc      w4  v:       0.897x  0.894x  0.903x  0.902x  1.001x  0.895x  0.895x
       mct     w4  v:       0.924x  0.908x  0.921x  0.901x  1.001x  0.904x  0.918x
       mc      w4 hv:       0.927x  0.925x  0.924x  0.933x  1.000x  0.936x  0.959x
       mct     w4 hv:       0.923x  0.944x  0.923x  0.944x  0.999x  0.931x  0.956x
      
      8tap 8bpc neon           V2      V1      X3      X1    A715     A78     A76
       mct regular w4  0:   0.829x  0.854x  0.735x  0.861x  0.769x  0.766x  0.840x
       mc  regular w2  h:   0.984x  1.008x  0.983x  1.012x  0.986x  0.989x  0.995x
       mc  sharp   w2  h:   0.987x  1.008x  0.986x  1.011x  0.985x  0.989x  0.995x
       mc  regular w4  h:   0.907x  0.911x  0.916x  0.908x  0.997x  0.936x  0.932x
       mc  sharp   w4  h:   0.916x  0.914x  0.918x  0.913x  0.999x  0.939x  0.905x
       mct regular w4  h:   0.992x  0.979x  0.993x  0.971x  1.000x  0.986x  0.976x
       mct sharp   w4  h:   0.991x  0.979x  0.989x  0.984x  1.001x  0.979x  0.983x
       mc  regular w2  v:   1.002x  1.001x  1.005x  1.000x  1.000x  0.998x  0.983x
       mc  sharp   w2  v:   1.005x  1.001x  1.009x  0.998x  0.994x  0.997x  0.989x
       mc  regular w4  v:   0.985x  0.998x  0.991x  0.998x  1.000x  1.000x  0.983x
       mc  sharp   w4  v:   1.005x  1.002x  1.006x  1.002x  0.998x  0.991x  0.999x
       mct regular w4  v:   0.966x  0.967x  0.961x  0.974x  0.996x  0.954x  0.982x
       mct sharp   w4  v:   0.970x  0.944x  0.967x  0.944x  0.997x  0.951x  0.966x
       mc  regular w2 hv:   0.993x  0.993x  0.994x  0.987x  0.993x  0.985x  0.999x
       mc  sharp   w2 hv:   0.994x  0.996x  0.992x  0.998x  0.997x  0.999x  0.999x
       mc  regular w4 hv:   0.964x  0.958x  0.964x  0.960x  0.982x  0.938x  0.958x
       mc  sharp   w4 hv:   0.982x  0.981x  0.980x  0.982x  0.995x  0.986x  0.941x
       mct regular w4 hv:   0.993x  0.994x  0.992x  0.994x  0.996x  0.992x  0.988x
       mct sharp   w4 hv:   0.993x  0.996x  0.991x  0.996x  0.954x  0.992x  1.011x
      
      8tap 16bpc neon          V2      V1      X3      X1    A715     A78     A76
       mc  regular w2  0:   0.869x  1.059x  0.874x  0.956x  0.883x  0.932x  1.000x
       mct regular w4  0:   0.348x  0.369x  0.354x  0.377x  0.560x  0.409x  0.648x
       mc  regular w2  h:   0.996x  0.988x  0.992x  0.985x  0.989x  0.991x  1.006x
       mc  sharp   w2  h:   0.996x  0.989x  0.979x  0.991x  0.987x  0.988x  0.997x
       mc  regular w4  h:   0.957x  0.937x  0.957x  0.948x  0.961x  0.927x  0.994x
       mc  sharp   w4  h:   0.966x  0.940x  0.962x  0.954x  0.985x  0.929x  0.970x
       mct regular w4  h:   0.922x  0.942x  0.932x  0.933x  1.007x  0.938x  0.905x
       mct sharp   w4  h:   0.919x  0.943x  0.919x  0.931x  0.971x  0.943x  0.929x
       mc  regular w2  v:   1.000x  0.997x  1.001x  1.003x  1.001x  0.999x  0.984x
       mc  sharp   w2  v:   1.000x  0.999x  1.000x  0.999x  1.000x  1.000x  0.993x
       mc  regular w4  v:   0.936x  0.941x  0.936x  0.939x  0.999x  0.928x  0.981x
       mc  sharp   w4  v:   0.955x  0.961x  0.949x  0.956x  0.999x  0.947x  0.953x
       mct regular w4  v:   0.977x  0.966x  0.979x  0.968x  0.990x  0.972x  0.972x
       mct sharp   w4  v:   0.973x  0.965x  0.981x  0.963x  0.994x  0.977x  0.974x
       mc  regular w2 hv:   0.995x  1.001x  0.995x  0.995x  0.995x  1.000x  0.981x
       mc  sharp   w2 hv:   0.993x  1.012x  0.993x  0.988x  0.996x  0.992x  1.008x
       mc  regular w4 hv:   0.938x  0.943x  0.939x  0.943x  0.986x  0.943x  0.997x
       mc  sharp   w4 hv:   0.969x  0.959x  0.970x  0.974x  0.986x  0.993x  0.997x
       mct regular w4 hv:   0.942x  0.970x  0.951x  0.960x  0.977x  0.958x  1.018x
       mct sharp   w4 hv:   0.923x  0.958x  0.934x  0.955x  0.973x  0.946x  0.986x
      ec5c3052
    • Arpad Panyik's avatar
      AArch64: Optimize Armv8.0 Neon path of SBD H/HV 6-tap filters · a992a9be
      Arpad Panyik authored and Martin Storsjö's avatar Martin Storsjö committed
      The 6-tap horizontal and the horizontal parts of 6-tap HV subpel
      filters can be further improved by some pointer arithmetic and saving
      some instructions (EXTs) in their data rearrangement codes.
      
      Relative runtime of micro benchmarks after this patch on Cortex CPU
      cores:
      
      SBD mct h         X1     A78     A76     A72     A55
       regular  w8:  0.878x  0.894x  0.990x  0.923x  0.944x
       regular w16:  0.962x  0.931x  0.943x  0.949x  0.949x
       regular w32:  0.937x  0.937x  0.972x  0.938x  0.947x
       regular w64:  0.920x  0.965x  0.992x  0.936x  0.944x
      
      SBD mct hv        X1     A78     A76     A72     A55
       regular  w8:  0.931x  0.970x  0.951x  0.950x  0.971x
       regular w16:  0.940x  0.971x  0.941x  0.952x  0.967x
       regular w32:  0.943x  0.972x  0.946x  0.961x  0.974x
       regular w64:  0.943x  0.973x  0.952x  0.944x  0.975x
      a992a9be
    • Arpad Panyik's avatar
      AArch64: Optimize Armv8.0 Neon path of HBD HV 6-tap filters · 2d808de1
      Arpad Panyik authored and Martin Storsjö's avatar Martin Storsjö committed
      The horizontal parts of 6-tap HV subpel filters can be further
      improved by some pointer arithmetic and saving some instructions
      (EXTs) in their data rearrangement codes.
      
      Relative runtime of micro benchmarks after this patch on Cortex CPU
      cores:
      
      HBD mct hv        X1     A78     A76     A72     A55
       regular  w8:  0.952x  0.989x  0.924x  0.973x  0.976x
       regular w16:  0.961x  0.993x  0.928x  0.952x  0.971x
       regular w32:  0.964x  0.996x  0.930x  0.973x  0.972x
       regular w64:  0.963x  0.997x  0.930x  0.969x  0.974x
      2d808de1
    • Arpad Panyik's avatar
      AArch64: Optimize Armv8.0 Neon path of HBD horizontal 6-tap filters · 93339ce8
      Arpad Panyik authored and Martin Storsjö's avatar Martin Storsjö committed
      The 6-tap horizontal subpel filters can be further improved by some
      pointer arithmetic and saving some instructions (EXTs) in their data
      rearrangement codes.
      
      Relative runtime of micro benchmarks after this patch on some Cortex
      CPU cores:
      
      regular:     X1      A78      A76      A55
       mc  w8:  0.915x   0.937x   0.900x   0.982x
       mc w16:  0.917x   0.947x   0.911x   0.971x
       mc w32:  0.914x   0.938x   0.873x   0.961x
       mc w64:  0.918x   0.932x   0.882x   0.964x
      93339ce8
    • Arpad Panyik's avatar
      AArch64: Optimize Armv8.0 Neon path of HBD horizontal filters · 109b2427
      Arpad Panyik authored and Martin Storsjö's avatar Martin Storsjö committed
      The reduction parts of the horizontal HBD MC filters use SRSHL+SQXTUN+
      SRSHL instruction sequences. In the horizontal case this can be
      rewritten using a single SQSHRUN instruction with an additional
      rounding value (34 for 10-bit and 40 for 12-bit).
      
      Relative runtime of micro benchmarks after this patch on some Cortex
      CPU cores:
      
      regular:     X1      A78      A76      A55
       mc  w2:  0.847x   0.864x   0.822x   0.859x
       mc  w4:  0.889x   0.994x   0.868x   0.917x
       mc  w8:  0.857x   0.911x   0.915x   0.978x
       mc w16:  0.890x   0.982x   0.868x   0.974x
       mc w32:  0.904x   0.991x   0.873x   0.967x
       mc w64:  0.919x   1.003x   0.860x   0.970x
      109b2427
  8. Sep 05, 2024
  9. Sep 04, 2024
  10. Sep 01, 2024
  11. Aug 30, 2024
  12. Aug 29, 2024
Loading