Skip to content
Snippets Groups Projects
  1. Mar 22, 2024
  2. Mar 21, 2024
  3. Mar 18, 2024
  4. Mar 15, 2024
  5. Mar 09, 2024
  6. Mar 08, 2024
  7. Mar 07, 2024
  8. Mar 05, 2024
    • Arpad Panyik's avatar
      AArch64: Specialise HBD Neon convolutions for 6-tap filters · 932b323c
      Arpad Panyik authored and Martin Storsjö's avatar Martin Storsjö committed
      The 8-tap sub-pel filters used for motion vector interpolation are:
      regular, smooth, sharp. The regular and smooth filter kernels are
      zero-padded, so they are effectively 6-tap filters (some of them are
      5-tap or even 4-tap).
      
      This patch specialises the high bit-depth versions of put_8tap_neon
      and prep_8tap_neon functions for 6-tap filters, avoiding a lot of
      redundant work to multiply by and add zero. Wherever the sharp
      filtering is used the 8-tap path will be always selected.
      
      Benchmarks can show a 0.5-10.8% FPS uplift highly depending on the
      input video source. Binary size increase is ~8.5 KiB.
      932b323c
    • Arpad Panyik's avatar
      AArch64: Optimize 6-tap SBD HV Neon convolution · b0a329d6
      Arpad Panyik authored and Arpad Panyik's avatar Arpad Panyik committed
      Optimize the 6-tap standard bit-depth horizontal-vertical combined
      convolution to avoid unnecessary reads and horizontal convolution
      steps at the beginning and end of the algorithm. This also saves some
      instructions in the final binary.
      
      Performance of this function increases by up to 5.5% depending on
      block size.
      b0a329d6
  9. Mar 04, 2024
    • Martin Storsjö's avatar
    • Martin Storsjö's avatar
      aarch64: Check for assembler support for various aarch64 extensions · e1f80dec
      Martin Storsjö authored
      First check if the assembler supports the ".arch" directive, and
      what architecture levels are supported.
      
      In principle, we'd only need to check for support for ".arch armv8.2-a",
      since that's enough for enabling the i8mm and sve2 extensions.
      
      However, recent Clang versions (before version 17) wasn't able to
      enable the dotprod and i8mm extensions via the ".arch_extension"
      directives, so check for support for armv8.4-a and armv8.6-a as well,
      which enable dotprod and i8mm implicitly.
      
      This allows assembling these instructions on most commonly available
      GCC and Clang based toolchains, while still allowing toggling support
      for the instruction sets on and off within the source files.
      
      Within assembly, we disable these extensions by default, so that
      instructions enabled within these extension sets can't be used
      by accident in unintended functions. Code meaning to use these
      extensions can be assembled like this:
      
          #if HAVE_SVE
          ENABLE_SVE
          // code
          DISABLE_SVE
          #endif
      e1f80dec
  10. Feb 29, 2024
    • Henrik Gramner's avatar
      checkasm: Add --list-cpuflags option · 85a10359
      Henrik Gramner authored
      Prints a list of cpuflags available for the current architecture.
      
      Flags which are supported on the current system will be printed in
      green, and flags which are unsupported in red with a ~ prefix.
      85a10359
  11. Feb 28, 2024
  12. Feb 27, 2024
    • Nathan E. Egge's avatar
      riscv64/itx: Add 16x16 8bpc eob test · b7963a73
      Nathan E. Egge authored
      Kendryte K230                                         Before          After
      
      inv_txfm_add_16x16_adst_adst_0_8bpc_rvv:          1804.9 (8.45x)  1374.3 (11.18x)
      inv_txfm_add_16x16_adst_adst_1_8bpc_rvv:          1805.2 (8.45x)  1374.3 (11.17x)
      inv_txfm_add_16x16_adst_dct_0_8bpc_rvv:           1626.6 (8.92x)  1185.8 (12.22x)
      inv_txfm_add_16x16_adst_dct_1_8bpc_rvv:           1626.5 (8.91x)  1185.9 (12.22x)
      inv_txfm_add_16x16_adst_flipadst_0_8bpc_rvv:      1824.2 (8.38x)  1372.1 (11.22x)
      inv_txfm_add_16x16_adst_flipadst_1_8bpc_rvv:      1824.2 (8.37x)  1372.2 (11.21x)
      inv_txfm_add_16x16_dct_adst_0_8bpc_rvv:           1627.3 (8.94x)  1283.5 (11.29x)
      inv_txfm_add_16x16_dct_adst_1_8bpc_rvv:           1627.2 (8.95x)  1283.2 (11.29x)
      inv_txfm_add_16x16_dct_dct_0_8bpc_rvv:            1449.3 (1.08x)  1095.2 ( 1.44x)
      inv_txfm_add_16x16_dct_dct_1_8bpc_rvv:            1449.1 (9.52x)  1095.1 (12.45x)
      inv_txfm_add_16x16_dct_flipadst_0_8bpc_rvv:       1643.0 (8.87x)  1283.5 (11.29x)
      inv_txfm_add_16x16_dct_flipadst_1_8bpc_rvv:       1643.3 (8.87x)  1283.3 (11.30x)
      inv_txfm_add_16x16_dct_identity_0_8bpc_rvv:       1155.4 (9.23x)   805.9 (13.17x)
      inv_txfm_add_16x16_dct_identity_1_8bpc_rvv:       1155.4 (9.24x)   805.9 (13.17x)
      inv_txfm_add_16x16_flipadst_adst_0_8bpc_rvv:      1812.2 (8.43x)  1370.9 (11.23x)
      inv_txfm_add_16x16_flipadst_adst_1_8bpc_rvv:      1811.7 (8.44x)  1370.8 (11.24x)
      inv_txfm_add_16x16_flipadst_dct_0_8bpc_rvv:       1637.2 (8.88x)  1190.8 (12.19x)
      inv_txfm_add_16x16_flipadst_dct_1_8bpc_rvv:       1637.6 (8.87x)  1190.9 (12.19x)
      inv_txfm_add_16x16_flipadst_flipadst_0_8bpc_rvv:  1831.1 (8.34x)  1374.7 (11.21x)
      inv_txfm_add_16x16_flipadst_flipadst_1_8bpc_rvv:  1830.8 (8.35x)  1374.5 (11.22x)
      inv_txfm_add_16x16_identity_dct_0_8bpc_rvv:       1156.2 (8.67x)   948.6 (10.49x)
      inv_txfm_add_16x16_identity_dct_1_8bpc_rvv:       1156.3 (8.68x)   948.6 (10.49x)
      inv_txfm_add_16x16_identity_identity_0_8bpc_rvv:   879.3 (7.81x)   673.5 (10.28x)
      inv_txfm_add_16x16_identity_identity_1_8bpc_rvv:   879.3 (7.81x)   673.5 (10.28x)
      b7963a73
    • Nathan E. Egge's avatar
      riscv64/itx: Add 8x16 8bpc eob test · 70122512
      Nathan E. Egge authored
      Kendryte K230                                        Before          After
      
      inv_txfm_add_8x16_adst_adst_0_8bpc_rvv:           853.9 ( 9.00x)  698.3 (11.03x)
      inv_txfm_add_8x16_adst_adst_1_8bpc_rvv:           853.8 ( 9.00x)  698.3 (11.03x)
      inv_txfm_add_8x16_adst_dct_0_8bpc_rvv:            763.0 ( 9.55x)  609.2 (12.00x)
      inv_txfm_add_8x16_adst_dct_1_8bpc_rvv:            763.1 ( 9.55x)  609.3 (11.94x)
      inv_txfm_add_8x16_adst_flipadst_0_8bpc_rvv:       857.1 ( 8.99x)  701.6 (11.00x)
      inv_txfm_add_8x16_adst_flipadst_1_8bpc_rvv:       856.8 ( 8.98x)  701.3 (10.97x)
      inv_txfm_add_8x16_adst_identity_0_8bpc_rvv:       622.9 ( 9.22x)  468.5 (12.36x)
      inv_txfm_add_8x16_adst_identity_1_8bpc_rvv:       622.9 ( 9.23x)  468.6 (12.37x)
      inv_txfm_add_8x16_dct_adst_0_8bpc_rvv:            770.1 ( 9.32x)  655.1 (10.93x)
      inv_txfm_add_8x16_dct_adst_1_8bpc_rvv:            770.1 ( 9.34x)  655.4 (10.93x)
      inv_txfm_add_8x16_dct_dct_0_8bpc_rvv:             679.8 ( 1.23x)  566.1 ( 1.48x)
      inv_txfm_add_8x16_dct_dct_1_8bpc_rvv:             679.8 ( 9.98x)  566.5 (11.89x)
      inv_txfm_add_8x16_dct_flipadst_0_8bpc_rvv:        771.1 ( 9.34x)  667.4 (10.75x)
      inv_txfm_add_8x16_dct_flipadst_1_8bpc_rvv:        771.1 ( 9.34x)  667.3 (10.76x)
      inv_txfm_add_8x16_dct_identity_0_8bpc_rvv:        532.3 ( 9.84x)  422.1 (12.42x)
      inv_txfm_add_8x16_dct_identity_1_8bpc_rvv:        532.4 ( 9.85x)  422.2 (12.40x)
      inv_txfm_add_8x16_flipadst_adst_0_8bpc_rvv:       858.4 ( 8.98x)  699.2 (11.03x)
      inv_txfm_add_8x16_flipadst_adst_1_8bpc_rvv:       858.5 ( 8.98x)  699.3 (11.03x)
      inv_txfm_add_8x16_flipadst_dct_0_8bpc_rvv:        768.6 ( 9.52x)  609.7 (11.97x)
      inv_txfm_add_8x16_flipadst_dct_1_8bpc_rvv:        768.4 ( 9.52x)  609.6 (11.97x)
      inv_txfm_add_8x16_flipadst_flipadst_0_8bpc_rvv:   866.5 ( 8.91x)  706.5 (10.92x)
      inv_txfm_add_8x16_flipadst_flipadst_1_8bpc_rvv:   866.4 ( 8.92x)  706.6 (10.95x)
      inv_txfm_add_8x16_flipadst_identity_0_8bpc_rvv:   621.9 ( 9.28x)  464.6 (12.46x)
      inv_txfm_add_8x16_flipadst_identity_1_8bpc_rvv:   621.8 ( 9.28x)  464.6 (12.46x)
      inv_txfm_add_8x16_identity_adst_0_8bpc_rvv:       584.9 ( 9.78x)  564.1 (10.12x)
      inv_txfm_add_8x16_identity_adst_1_8bpc_rvv:       584.8 ( 9.78x)  563.9 (10.12x)
      inv_txfm_add_8x16_identity_dct_0_8bpc_rvv:        495.0 (10.75x)  474.6 (11.13x)
      inv_txfm_add_8x16_identity_dct_1_8bpc_rvv:        494.3 (10.75x)  474.7 (11.12x)
      inv_txfm_add_8x16_identity_flipadst_0_8bpc_rvv:   588.1 ( 9.76x)  568.1 (10.07x)
      inv_txfm_add_8x16_identity_flipadst_1_8bpc_rvv:   588.7 ( 9.74x)  568.0 (10.07x)
      inv_txfm_add_8x16_identity_identity_0_8bpc_rvv:   349.5 (10.78x)  328.8 (11.46x)
      inv_txfm_add_8x16_identity_identity_1_8bpc_rvv:   349.4 (10.79x)  328.7 (11.46x)
      70122512
    • Nathan E. Egge's avatar
      riscv64/itx: Add 4x16 8bpc eob test · afeeb3cc
      Nathan E. Egge authored
      Kendryte K230                                        Before         After
      
      inv_txfm_add_4x16_adst_adst_0_8bpc_rvv:           429.9 (7.45x)  381.3 (8.58x)
      inv_txfm_add_4x16_adst_adst_1_8bpc_rvv:           430.0 (7.45x)  381.3 (8.57x)
      inv_txfm_add_4x16_adst_dct_0_8bpc_rvv:            381.0 (8.01x)  332.5 (9.19x)
      inv_txfm_add_4x16_adst_dct_1_8bpc_rvv:            381.0 (8.00x)  332.5 (9.19x)
      inv_txfm_add_4x16_adst_flipadst_0_8bpc_rvv:       432.8 (7.42x)  384.5 (8.52x)
      inv_txfm_add_4x16_adst_flipadst_1_8bpc_rvv:       432.8 (7.42x)  384.4 (8.52x)
      inv_txfm_add_4x16_adst_identity_0_8bpc_rvv:       304.6 (7.32x)  249.8 (9.18x)
      inv_txfm_add_4x16_adst_identity_1_8bpc_rvv:       304.5 (7.32x)  249.8 (9.18x)
      inv_txfm_add_4x16_dct_adst_0_8bpc_rvv:            407.2 (7.68x)  371.4 (8.57x)
      inv_txfm_add_4x16_dct_adst_1_8bpc_rvv:            407.1 (7.68x)  371.5 (8.58x)
      inv_txfm_add_4x16_dct_dct_0_8bpc_rvv:             357.9 (1.27x)  323.1 (1.41x)
      inv_txfm_add_4x16_dct_dct_1_8bpc_rvv:             357.9 (8.29x)  322.9 (9.16x)
      inv_txfm_add_4x16_dct_flipadst_0_8bpc_rvv:        410.0 (7.62x)  376.6 (8.45x)
      inv_txfm_add_4x16_dct_flipadst_1_8bpc_rvv:        410.0 (7.62x)  376.5 (8.47x)
      inv_txfm_add_4x16_dct_identity_0_8bpc_rvv:        275.2 (7.79x)  240.5 (9.21x)
      inv_txfm_add_4x16_dct_identity_1_8bpc_rvv:        275.3 (7.78x)  240.6 (9.19x)
      inv_txfm_add_4x16_flipadst_adst_0_8bpc_rvv:       430.5 (7.51x)  382.6 (8.60x)
      inv_txfm_add_4x16_flipadst_adst_1_8bpc_rvv:       430.1 (7.52x)  382.8 (8.60x)
      inv_txfm_add_4x16_flipadst_dct_0_8bpc_rvv:        381.1 (8.09x)  333.8 (9.21x)
      inv_txfm_add_4x16_flipadst_dct_1_8bpc_rvv:        381.0 (8.08x)  333.7 (9.21x)
      inv_txfm_add_4x16_flipadst_flipadst_0_8bpc_rvv:   433.0 (7.48x)  385.7 (8.55x)
      inv_txfm_add_4x16_flipadst_flipadst_1_8bpc_rvv:   433.0 (7.48x)  385.7 (8.55x)
      inv_txfm_add_4x16_flipadst_identity_0_8bpc_rvv:   298.6 (7.57x)  250.8 (9.28x)
      inv_txfm_add_4x16_flipadst_identity_1_8bpc_rvv:   298.6 (7.57x)  250.9 (9.27x)
      inv_txfm_add_4x16_identity_adst_0_8bpc_rvv:       361.5 (7.93x)  347.3 (8.35x)
      inv_txfm_add_4x16_identity_adst_1_8bpc_rvv:       361.4 (7.93x)  347.4 (8.35x)
      inv_txfm_add_4x16_identity_dct_0_8bpc_rvv:        310.9 (8.69x)  297.8 (9.02x)
      inv_txfm_add_4x16_identity_dct_1_8bpc_rvv:        311.0 (8.69x)  297.8 (9.02x)
      inv_txfm_add_4x16_identity_flipadst_0_8bpc_rvv:   364.1 (7.88x)  350.5 (8.29x)
      inv_txfm_add_4x16_identity_flipadst_1_8bpc_rvv:   364.2 (7.88x)  350.4 (8.31x)
      inv_txfm_add_4x16_identity_identity_0_8bpc_rvv:   229.7 (8.22x)  211.4 (9.11x)
      inv_txfm_add_4x16_identity_identity_1_8bpc_rvv:   229.7 (8.21x)  211.2 (9.12x)
      afeeb3cc
  13. Feb 26, 2024
  14. Feb 22, 2024
Loading