Skip to content
Snippets Groups Projects
  1. Aug 24, 2024
    • Martin Storsjö's avatar
      aarch64: Fix a label typo · 27491dd9
      Martin Storsjö authored
      Apparently, this case isn't actually ever executed, at least in most
      checkasm runs, but some tools could complain about the relocation
      against 160b, which pointed elsewhere than intended.
      27491dd9
  2. Aug 23, 2024
  3. Aug 22, 2024
    • Arpad Panyik's avatar
      AArch64: SVE MS armasm64 fix of HBD subpel filters · 472b31f8
      Arpad Panyik authored and Martin Storsjö's avatar Martin Storsjö committed
      MS armasm64 cannot compile some SVE instructions with immediate
      operands, e.g.:
        sub  z0.h, z0.h, #8192
      
      The proper form is:
        sub  z0.h, z0.h, #32, lsl #8
      
      This patch contains the needed fixes.
      472b31f8
    • Martin Storsjö's avatar
      aarch64: mc16: Optimize the BTI landing pads in put/prep_neon · 3329f8d1
      Martin Storsjö authored
      Don't include the BTI landing pad instruction in the loops.
      
      If built with BTI enabled, AARCH64_VALID_JUMP_TARGET expands to
      a no-op instruction that indicates that indirect jumps can land
      there. But there's no need for the loops to include that instruction.
      3329f8d1
    • Arpad Panyik's avatar
      AArch64: Add HBD subpel filters using 128-bit SVE2 · 01558f3f
      Arpad Panyik authored and Martin Storsjö's avatar Martin Storsjö committed
      Add an Armv9.0-A SVE2 code path for high bitdepth convolutions. Only
      2D convolutions have 6-tap specialisations of their vertical passes.
      All other convolutions are 4- or 8-tap filters which fit well with
      the 4-element 16-bit SDOT instruction of SVE2.
      
      This patch renames HBD prep/put_neon to prep/put_16bpc_neon and
      exports put_16bpc_neon.
      
      Benchmarks show up-to 17% FPS increase depending on the input video
      and the CPU used.
      
      This patch will increase the .text by around 8 KiB.
      
      Relative performance to the C reference on some Cortex-A/X CPUs:
      
          regular     A715    A720      X3      X4    A510    A520
       w4 hv neon:    3.93x   4.10x   5.21x   5.17x   3.57x   5.27x
       w4 hv sve2:    4.99x   5.14x   6.00x   6.05x   4.33x   3.99x
       w8 hv neon:    1.72x   1.67x   1.98x   2.18x   2.95x   2.94x
       w8 hv sve2:    2.12x   2.29x   2.52x   2.62x   2.60x   2.60x
      w16 hv neon:    1.59x   1.53x   1.83x   1.89x   2.35x   2.24x
      w16 hv sve2:    1.94x   2.12x   2.33x   2.18x   2.06x   2.06x
      w32 hv neon:    1.49x   1.50x   1.66x   1.76x   2.10x   2.16x
      w32 hv sve2:    1.81x   2.09x   2.11x   2.09x   1.84x   1.87x
      w64 hv neon:    1.52x   1.50x   1.55x   1.71x   1.95x   2.05x
      w64 hv sve2:    1.84x   2.08x   1.97x   1.98x   1.74x   1.77x
      
       w4 h neon:     5.35x   5.47x   7.39x   5.78x   3.92x   5.19x
       w4 h sve2:     7.91x   8.35x  11.95x  10.33x   5.81x   5.42x
       w8 h neon:     4.49x   4.43x   6.50x   4.87x   7.18x   6.17x
       w8 h sve2:     6.09x   6.22x   9.59x   7.70x   7.89x   6.83x
      w16 h neon:     2.53x   2.52x   2.34x   1.86x   2.71x   2.75x
      w16 h sve2:     3.41x   3.47x   3.53x   3.25x   2.89x   2.96x
      w32 h neon:     2.07x   2.08x   1.97x   1.56x   2.17x   2.21x
      w32 h sve2:     2.76x   2.84x   2.94x   2.75x   2.24x   2.29x
      w64 h neon:     1.86x   1.86x   1.76x   1.41x   1.87x   1.88x
      w64 h sve2:     2.47x   2.54x   2.65x   2.46x   1.94x   1.94x
      
       w4 v neon:     5.22x   5.17x   6.36x   5.60x   4.23x   7.30x
       w4 v sve2:     5.86x   5.90x   7.81x   7.16x   4.86x   4.15x
       w8 v neon:     4.83x   4.79x   6.96x   6.45x   4.74x   8.40x
       w8 v sve2:     5.25x   5.23x   7.76x   6.79x   4.84x   4.13x
      w16 v neon:     2.59x   2.60x   2.93x   2.47x   1.80x   4.16x
      w16 v sve2:     2.85x   2.88x   3.36x   2.73x   1.86x   2.00x
      w32 v neon:     2.12x   2.13x   2.33x   2.03x   1.34x   3.11x
      w32 v sve2:     2.36x   2.40x   2.73x   2.32x   1.41x   1.48x
      w64 v neon:     1.94x   1.92x   2.02x   1.78x   1.12x   2.59x
      w64 v sve2:     2.16x   2.15x   2.37x   2.03x   1.17x   1.22x
      
       w4 0 neon:     1.75x   1.71x   1.44x   1.56x   3.18x   2.87x
       w4 0 sve2:     4.28x   4.39x   5.72x   6.42x   5.50x   4.68x
       w8 0 neon:     3.05x   3.04x   4.44x   4.64x   3.84x   3.52x
       w8 0 sve2:     3.85x   3.80x   5.45x   6.01x   4.92x   4.26x
      w16 0 neon:     2.92x   2.93x   3.82x   3.23x   4.58x   4.44x
      w16 0 sve2:     4.29x   4.27x   4.25x   4.15x   5.58x   5.29x
      w32 0 neon:     2.73x   2.76x   3.50x   2.67x   4.44x   4.26x
      w32 0 sve2:     4.09x   4.10x   3.75x   3.39x   5.67x   5.22x
      w64 0 neon:     2.73x   2.70x   3.27x   3.14x   4.57x   4.68x
      w64 0 sve2:     4.06x   3.97x   3.54x   3.18x   6.36x   6.25x
      
            sharp     A715    A720      X3      X4    A510    A520
       w4 hv neon:    3.54x   3.64x   4.43x   4.45x   3.03x   4.72x
       w4 hv sve2:    4.30x   4.55x   5.38x   5.26x   4.04x   3.76x
       w8 hv neon:    1.30x   1.25x   1.51x   1.60x   2.44x   2.43x
       w8 hv sve2:    1.86x   2.06x   2.09x   2.18x   2.37x   2.39x
      w16 hv neon:    1.19x   1.16x   1.43x   1.36x   1.95x   1.98x
      w16 hv sve2:    1.68x   1.91x   1.94x   1.84x   1.89x   1.94x
      w32 hv neon:    1.13x   1.12x   1.30x   1.29x   1.75x   1.81x
      w32 hv sve2:    1.58x   1.84x   1.75x   1.74x   1.70x   1.76x
      w64 hv neon:    1.13x   1.13x   1.21x   1.25x   1.65x   1.69x
      w64 hv sve2:    1.57x   1.84x   1.62x   1.67x   1.62x   1.65x
      
       w4 h neon:     5.38x   5.49x   7.46x   5.74x   3.93x   5.23x
       w4 h sve2:     7.86x   8.37x  11.99x  10.38x   5.81x   5.40x
       w8 h neon:     3.46x   3.49x   5.36x   4.64x   6.40x   5.62x
       w8 h sve2:     5.95x   6.23x   9.61x   7.76x   7.86x   6.89x
      w16 h neon:     1.99x   1.97x   2.07x   1.91x   2.43x   2.51x
      w16 h sve2:     3.42x   3.46x   3.75x   3.23x   2.89x   2.98x
      w32 h neon:     1.67x   1.62x   1.66x   1.63x   1.95x   2.01x
      w32 h sve2:     2.86x   2.84x   2.94x   2.72x   2.21x   2.29x
      w64 h neon:     1.45x   1.45x   1.51x   1.48x   1.69x   1.70x
      w64 h sve2:     2.47x   2.54x   2.64x   2.46x   1.93x   1.95x
      
       w4 v neon:     4.07x   4.01x   5.15x   4.74x   3.38x   6.56x
       w4 v sve2:     5.88x   5.86x   7.81x   7.15x   4.85x   4.39x
       w8 v neon:     3.64x   3.59x   5.38x   4.92x   3.59x   7.23x
       w8 v sve2:     5.23x   5.19x   7.77x   6.66x   4.81x   4.13x
      w16 v neon:     1.93x   1.95x   2.25x   1.92x   1.35x   3.46x
      w16 v sve2:     2.85x   2.88x   3.36x   2.71x   1.86x   1.94x
      w32 v neon:     1.57x   1.58x   1.78x   1.60x   1.01x   2.67x
      w32 v sve2:     2.36x   2.39x   2.73x   2.35x   1.41x   1.50x
      w64 v neon:     1.44x   1.42x   1.54x   1.43x   0.85x   2.19x
      w64 v sve2:     2.17x   2.15x   2.37x   2.06x   1.18x   1.25x
      01558f3f
  4. Aug 21, 2024
    • Arpad Panyik's avatar
      AArch64: Add USMMLA impl. for SBD 6-tap H/HV filters · 713c076d
      Arpad Panyik authored
      Add 6-tap variant of standard bit-depth horizontal subpel filters
      using the Armv8.6 I8MM USMMLA matrix multiply instruction. This patch
      also extends the HV filter with 6-tap horizontal pass using USMMLA.
      
      Benchmarks show up-to 6-7% FPS increase depending on the input video
      and the CPU used.
      
      This patch will increase the .text by around 1.2 KiB.
      
      Relative runtime of micro benchmarks after this patch on Neoverse
      and Cortex CPU cores:
      
      regular      V2      V1      X3    A720    A715    A520    A510
        w8 hv:  0.860x  0.895x  0.870x  0.896x  0.896x  0.938x  0.936x
       w16 hv:  0.829x  0.886x  0.865x  0.908x  0.906x  0.946x  0.944x
       w32 hv:  0.837x  0.883x  0.862x  0.914x  0.915x  0.953x  0.949x
       w64 hv:  0.840x  0.883x  0.862x  0.914x  0.914x  0.955x  0.952x
      
        w8 h:   0.746x  0.754x  0.747x  0.723x  0.724x  0.874x  0.866x
       w16 h:   0.749x  0.764x  0.745x  0.731x  0.731x  0.858x  0.852x
       w32 h:   0.739x  0.754x  0.738x  0.729x  0.729x  0.839x  0.837x
       w64 h:   0.736x  0.749x  0.733x  0.725x  0.726x  0.847x  0.836x
      713c076d
  5. Aug 12, 2024
  6. Aug 04, 2024
  7. Jun 26, 2024
    • Arpad Panyik's avatar
      AArch64: Move constants of DotProd subpel filters to .rodata · 2355eeb8
      Arpad Panyik authored
      The constants used for the subpel filters were placed in the .text
      section for simplicity and peak performance, but this does not work on
      systems with execute only .text sections (e.g.: OpenBSD).
      
      The performance cost of moving the constants to the .rodata section
      is small and mostly within the measurable noise.
      2355eeb8
  8. Jun 25, 2024
    • Martin Storsjö's avatar
      aarch64: Explicitly use the ldur instruction where relevant in mc_dotprod.S · 7fbcdc6d
      Martin Storsjö authored
      The ldr instruction only can handle offsets that are a multiple
      of the element size; most assemblers implicitly produce the ldur
      instruction when a non-aligned offset is provided.
      
      Older versions of MS armasm64, however, error out on this. Since
      MSVC 2022 17.8, armasm64 implicitly can produce ldur, but 2022 17.7
      and earlier require explicitly writing the instruction as ldur.
      
      Despite this, even older versions still fail to build the mc_dotprod.S
      sources, with errors like this:
      
          src\libdav1d.a.p\mc_dotprod.obj.asm(556) : error A2513: operand 2: Constant value out of range
              mov             x10, (((0*15-1)<<7)|(3*15-1))
      
      This happens on MSVC 2022 17.1 and older, while 17.2 and newer
      accept the negative value expression here.
      
      In practice, HAVE_DOTPROD doesn't get enabled by the Meson configure
      script at the moment, as it uses inline assembly to test for external
      assembler features.
      7fbcdc6d
    • Brad Smith's avatar
      Add Arm OpenBSD run-time CPU feature detection support · 431f4fb2
      Brad Smith authored and Martin Storsjö's avatar Martin Storsjö committed
      Add run-time CPU feature detection for DotProd and i8mm on AArch64.
      431f4fb2
    • Henrik Gramner's avatar
  9. Jun 17, 2024
  10. Jun 10, 2024
  11. Jun 05, 2024
  12. May 27, 2024
  13. May 25, 2024
  14. May 20, 2024
  15. May 19, 2024
    • Martin Storsjö's avatar
      arm64: msac: Explicitly use the ldur instruction · 9469e184
      Martin Storsjö authored
      The ldr instruction can take an immediate offset which is a multiple
      of the loaded element size. If the ldr instruction is given an
      immediate offset which isn't a multiple of the element size,
      most assemblers implicitly generate a "ldur" instruction instead.
      
      Older versions of MS armasm64.exe don't do this, but instead error
      out with "error A2518: operand 2: Memory offset must be aligned".
      (Current versions don't do this but correctly generate "ldur"
      implicitly.)
      
      Switch this instruction to an explicit "ldur", like we do elsewhere,
      to fix building with these older tools.
      9469e184
  16. May 18, 2024
  17. May 14, 2024
    • Kyle Siefring's avatar
      ARM64: Various optimizations for symbol decode · 7f68f23c
      Kyle Siefring authored
      Changes stem from redesigning the reduction stage of the multisymbol
      decode function.
      * No longer use adapt4 for 5 possible symbol values
      * Specialize reduction for 4/8/16 decode functions
      * Modify control flow
      
      +------------------------+--------------+--------------+---------------+
      |                        |  Neoverse V1 |  Neoverse N1 |   Cortex A72  |
      |                        | (Graviton 3) | (Graviton 2) |  (Graviton 1) |
      +------------------------+-------+------+-------+------+-------+-------+
      |                        |  Old  |  New |  Old  |  New |  Old  |  New  |
      +------------------------+-------+------+-------+------+-------+-------+
      | decode_bool_neon       |  13.0 | 12.9 |  14.9 | 14.0 |  39.3 |  29.0 |
      +------------------------+-------+------+-------+------+-------+-------+
      | decode_bool_adapt_neon |  15.4 | 15.6 |  17.5 | 16.8 |  41.6 |  33.5 |
      +------------------------+-------+------+-------+------+-------+-------+
      | decode_bool_equi_neon  |  11.3 | 12.0 |  14.0 | 12.2 |  35.0 |  26.3 |
      +------------------------+-------+------+-------+------+-------+-------+
      | decode_hi_tok_c        |  73.7 | 57.8 |  73.4 | 60.5 | 130.1 | 103.9 |
      +------------------------+-------+------+-------+------+-------+-------+
      | decode_hi_tok_neon     |  63.3 | 48.2 |  65.2 | 51.2 | 119.0 | 105.3 |
      +------------------------+-------+------+-------+------+-------+-------+
      | decode_symbol_\        |  28.6 | 22.5 |  28.4 | 23.5 |  67.8 |  55.1 |
      | adapt4_neon            |       |      |       |      |       |       |
      +------------------------+-------+------+-------+------+-------+-------+
      | decode_symbol_\        |  29.5 | 26.6 |  29.0 | 28.8 |  76.6 |  74.0 |
      | adapt8_neon            |       |      |       |      |       |       |
      +------------------------+-------+------+-------+------+-------+-------+
      | decode_symbol_\        |  31.6 | 31.2 |  33.3 | 33.0 |  77.5 |  68.1 |
      | adapt16_neon           |       |      |       |      |       |       |
      +------------------------+-------+------+-------+------+-------+-------+
      7f68f23c
    • Arpad Panyik's avatar
      AArch64: Optimize prep_neon function · d835c6bf
      Arpad Panyik authored and Martin Storsjö's avatar Martin Storsjö committed
      Optimize the widening copy part of subpel filters (the prep_neon
      function). In this patch we combine widening shifts with widening
      multiplications in the inner loops to get maximum throughput.
      
      The change will increase .text by 36 bytes.
      
      Relative performance of micro benchmarks (lower is better):
      
      Cortex-A55:
        mct_w4:   0.795x
        mct_w8:   0.913x
        mct_w16:  0.912x
        mct_w32:  0.838x
        mct_w64:  1.025x
        mct_w128: 1.002x
      
      Cortex-A510:
        mct_w4:   0.760x
        mct_w8:   0.636x
        mct_w16:  0.640x
        mct_w32:  0.854x
        mct_w64:  0.864x
        mct_w128: 0.995x
      
      Cortex-A72:
        mct_w4:   0.616x
        mct_w8:   0.854x
        mct_w16:  0.756x
        mct_w32:  1.052x
        mct_w64:  1.044x
        mct_w128: 0.702x
      
      Cortex-A76:
        mct_w4:   0.837x
        mct_w8:   0.797x
        mct_w16:  0.841x
        mct_w32:  0.804x
        mct_w64:  0.948x
        mct_w128: 0.904x
      
      Cortex-A78:
        mct_w16:  0.542x
        mct_w32:  0.725x
        mct_w64:  0.741x
        mct_w128: 0.745x
      
      Cortex-A715:
        mct_w16:  0.561x
        mct_w32:  0.720x
        mct_w64:  0.740x
        mct_w128: 0.748x
      
      Cortex-X1:
        mct_w32:  0.886x
        mct_w64:  0.882x
        mct_w128: 0.917x
      
      Cortex-X3:
        mct_w32:  0.835x
        mct_w64:  0.803x
        mct_w128: 0.808x
      d835c6bf
    • Arpad Panyik's avatar
      AArch64: Optimize jump table calculation of prep_neon · f0e779bc
      Arpad Panyik authored and Martin Storsjö's avatar Martin Storsjö committed
      Save a complex arithmetic instruction in the jump table address
      calculation of prep_neon function.
      f0e779bc
    • Arpad Panyik's avatar
      AArch64: Optimize BTI landing pads of prep_neon · 1790e132
      Arpad Panyik authored and Martin Storsjö's avatar Martin Storsjö committed
      Move the BTI landing pads out of the inner loops of prep_neon
      function. Only the width=4 and width=8 cases are affected.
      
      If BTI is enabled, moving the AARCH64_VALID_JUMP_TARGET out of the
      inner loops we get better execution speed on Cortex-A510 relative to
      the original (lower is better):
        w4: 0.969x
        w8: 0.722x
      
      Out-of-order cores are not affected.
      1790e132
    • Henrik Gramner's avatar
      84185303
  18. May 13, 2024
    • Arpad Panyik's avatar
      AArch64: Optimize put_neon function · 8141546d
      Arpad Panyik authored
      Optimize the copy part of subpel filters (the put_neon function).
      For small block sizes (<16) the usage of general purpose registers
      is usually the best way to do the copy.
      
      Relative performance of micro benchmarks (lower is better):
      
      Cortex-A55:
        w2:   0.991x
        w4:   0.992x
        w8:   0.999x
        w16:  0.875x
        w32:  0.775x
        w64:  0.914x
        w128: 0.998x
      
      Cortex-A510:
        w2:   0.159x
        w4:   0.080x
        w8:   0.583x
        w16:  0.588x
        w32:  0.966x
        w64:  1.111x
        w128: 0.957x
      
      Cortex-A76:
        w2:   0.903x
        w4:   0.683x
        w8:   0.944x
        w16:  0.948x
        w32:  0.919x
        w64:  0.855x
        w128: 0.991x
      
      Cortex-A78:
        w32:  0.867x
        w64:  0.820x
        w128: 1.011x
      
      Cortex-A715:
        w32:  0.834x
        w64:  0.778x
        w128: 1.000x
      
      Cortex-X1:
        w32:  0.809x
        w64:  0.762x
        w128: 1.000x
      
      Cortex-X3:
        w32: 0.733x
        w64: 0.720x
        w128: 0.999x
      8141546d
    • Arpad Panyik's avatar
      AArch64: Optimize jump table calculation of put_neon · 645d1f9f
      Arpad Panyik authored
      Save a complex arithmetic instruction in the jump table address
      calculation of put_neon function.
      645d1f9f
    • Arpad Panyik's avatar
      AArch64: Optimize BTI landing pads of put_neon · 83452c6e
      Arpad Panyik authored
      Move the BTI landing pads out of the inner loops of put_neon
      function, the only exception is the width=16 case where it is already
      outside of the loops.
      
      When BTI is enabled, the relative performance of omitting the
      AARCH64_VALID_JUMP_TARGET from the inner loops on Cortex-A510 (lower
      is better):
        w2:   0.981x
        w4:   0.991x
        w8:   0.612x
        w32:  0.687x
        w64:  0.813x
        w128: 0.892x
      
      Out-of-order CPUs are mostly unaffected.
      83452c6e
    • Henrik Gramner's avatar
    • Henrik Gramner's avatar
      checkasm: Avoid UB in setjmp() invocations · 471549f2
      Henrik Gramner authored
      Both POSIX and the C standard places several environmental limits on
      setjmp() invocations, with essentially anything beyond comparing the
      return value with a constant as a simple branch condition being UB.
      
      We were previously performing a function call using the setjmp()
      return value as an argument, which is technically not allowed
      even though it happened to work correctly in practice.
      
      Some systems may loosen those restrictions and allow for more
      flexible usage, but we shouldn't be relying on that.
      471549f2
  19. May 12, 2024
    • Arpad Panyik's avatar
      AArch64: Optimize the init of DotProd+ 2D subpel filters · a6d57b11
      Arpad Panyik authored and Jean-Baptiste Kempf's avatar Jean-Baptiste Kempf committed
      Removed some unnecessary vector register copies from the initial
      horizontal filter parts of the HV subpel filters. The performance
      improvements are better for the smaller filter block sizes.
      
      The narrowing shifts were also rewritten at the end of the *filter8*
      because it was only beneficial for the Cortex-A55 among the DotProd
      capable CPU cores. On other out-of-order or newer CPUs the UZP1+SHRN
      instruction combination is better.
      
      Relative performance of micro benchmarks (lower is better):
      
      Cortex-A55:
        mct regular w4:  0.980x
        mct regular w8:  1.007x
        mct regular w16: 1.007x
      
        mct sharp w4:    0.983x
        mct sharp w8:    1.012x
        mct sharp w16:   1.005x
      
      Cortex-A510:
        mct regular w4:  0.935x
        mct regular w8:  0.984x
        mct regular w16: 0.986x
      
        mct sharp w4:    0.927x
        mct sharp w8:    0.983x
        mct sharp w16:   0.987x
      
      Cortex-A78:
        mct regular w4:  0.974x
        mct regular w8:  0.988x
        mct regular w16: 0.991x
      
        mct sharp w4:    0.971x
        mct sharp w8:    0.987x
        mct sharp w16:   0.979x
      
      Cortex-715:
        mct regular w4:  0.958x
        mct regular w8:  0.993x
        mct regular w16: 0.998x
      
        mct sharp w4:    0.974x
        mct sharp w8:    0.991x
        mct sharp w16:   0.997x
      
      Cortex-X1:
        mct regular w4:  0.983x
        mct regular w8:  0.993x
        mct regular w16: 0.996x
      
        mct sharp w4:    0.974x
        mct sharp w8:    0.990x
        mct sharp w16:   0.995x
      
      Cortex-X3:
        mct regular w4:  0.953x
        mct regular w8:  0.993x
        mct regular w16: 0.997x
      
        mct sharp w4:    0.981x
        mct sharp w8:    0.993x
        mct sharp w16:   0.995x
      a6d57b11
Loading