Skip to content
Snippets Groups Projects
  1. May 16, 2020
  2. May 15, 2020
  3. May 14, 2020
  4. May 13, 2020
  5. May 12, 2020
  6. May 11, 2020
  7. May 10, 2020
    • Henrik Gramner's avatar
      x86: Use 'test' instead of 'or' to compare with zero · 4d97f5a9
      Henrik Gramner authored and Henrik Gramner's avatar Henrik Gramner committed
      Allows for macro-op fusion.
      4d97f5a9
    • Henrik Gramner's avatar
      x86: Unconditionally compile msac_init.c · 28d33357
      Henrik Gramner authored and Henrik Gramner's avatar Henrik Gramner committed
      Eliminates the x86-64 check from the meson configuration file to be
      consistent with how other x86-64-exclusive code is handled.
      28d33357
    • Henrik Gramner's avatar
      x86-64: Do msac refill before calling dav1d_msac_init_x86() · 6a6c3528
      Henrik Gramner authored and Henrik Gramner's avatar Henrik Gramner committed
      Allows for constant propagation and tail call elimination in the
      msac initialization, which is performed in each tile.
      6a6c3528
    • Henrik Gramner's avatar
      msac: Avoid attempting to refill after eob has already been reached · 631d7720
      Henrik Gramner authored and Henrik Gramner's avatar Henrik Gramner committed
      Utilize the unsigned representation of a signed integer to skip
      the refill code if the count was already negative to begin with,
      which saves a few clock cycles at the end of each tile.
      631d7720
    • Martin Storsjö's avatar
      arm64: itx: Add NEON implementation of itx for 10 bpc · eaedb95d
      Martin Storsjö authored
      Add an element size specifier to the existing individual transform
      functions for 8 bpc, naming them e.g. inv_dct_8h_x8_neon, to clarify
      that they operate on input vectors of 8h, and make the symbols
      public, to let the 10 bpc case call them from a different object file.
      The same convention is used in the new itx16.S, like inv_dct_4s_x8_neon.
      
      Make the existing itx.S compiled regardless of whether 8 bpc support
      is enabled. For builds with 8 bpc support disabled, this does include
      the unused frontend functions though, but this is hopefully tolerable
      to avoid having to split the file into a sharable file for transforms
      and a separate one for frontends.
      
      This only implements the 10 bpc case, as that case can use transforms
      operating on 16 bit coefficients in the second pass.
      
      Relative speedup vs C for a few functions:
      
                                           Cortex A53    A72    A73
      inv_txfm_add_4x4_dct_dct_0_10bpc_neon:     4.14   4.06   4.49
      inv_txfm_add_4x4_dct_dct_1_10bpc_neon:     6.51   6.49   6.42
      inv_txfm_add_8x8_dct_dct_0_10bpc_neon:     5.02   4.63   6.23
      inv_txfm_add_8x8_dct_dct_1_10bpc_neon:     8.54   7.13  11.96
      inv_txfm_add_16x16_dct_dct_0_10bpc_neon:   5.52   6.60   8.03
      inv_txfm_add_16x16_dct_dct_1_10bpc_neon:  11.27   9.62  12.22
      inv_txfm_add_16x16_dct_dct_2_10bpc_neon:   9.60   6.97   8.59
      inv_txfm_add_32x32_dct_dct_0_10bpc_neon:   2.60   3.48   3.19
      inv_txfm_add_32x32_dct_dct_1_10bpc_neon:  14.65  12.64  16.86
      inv_txfm_add_32x32_dct_dct_2_10bpc_neon:  11.57   8.80  12.68
      inv_txfm_add_32x32_dct_dct_3_10bpc_neon:   8.79   8.00   9.21
      inv_txfm_add_32x32_dct_dct_4_10bpc_neon:   7.58   6.21   7.80
      inv_txfm_add_64x64_dct_dct_0_10bpc_neon:   2.41   2.85   2.75
      inv_txfm_add_64x64_dct_dct_1_10bpc_neon:  12.91  10.27  12.24
      inv_txfm_add_64x64_dct_dct_2_10bpc_neon:  10.96   7.97  10.31
      inv_txfm_add_64x64_dct_dct_3_10bpc_neon:   8.95   7.42   9.55
      inv_txfm_add_64x64_dct_dct_4_10bpc_neon:   7.97   6.12   7.82
      eaedb95d
    • Martin Storsjö's avatar
      arm: Mark global symbols hidden · ff3054fe
      Martin Storsjö authored
      This matches what is done in C by -fvisibility=hidden.
      
      This avoids issues with relocations against other symbols exported
      from another assembly file.
      ff3054fe
    • Martin Storsjö's avatar
      arm64: itx: Prepare for other bitdepths · d4002c88
      Martin Storsjö authored
      d4002c88
    • Martin Storsjö's avatar
    • Martin Storsjö's avatar
    • Martin Storsjö's avatar
      arm64: itx: Fix the eob checking for dct_dct_64x16 · a6711a5c
      Martin Storsjö authored
      Before this, we never did the early exit from the first pass.
      
      Before:                               Cortex A53      A72      A73
      inv_txfm_add_64x16_dct_dct_1_8bpc_neon:   7275.7   5198.3   5250.9
      inv_txfm_add_64x16_dct_dct_2_8bpc_neon:   7276.1   5197.0   5251.3
      inv_txfm_add_64x16_dct_dct_3_8bpc_neon:   7275.8   5196.2   5254.5
      inv_txfm_add_64x16_dct_dct_4_8bpc_neon:   7273.6   5198.8   5254.2
      After:
      inv_txfm_add_64x16_dct_dct_1_8bpc_neon:   5187.8   3763.8   3735.0
      inv_txfm_add_64x16_dct_dct_2_8bpc_neon:   7280.6   5185.6   5256.3
      inv_txfm_add_64x16_dct_dct_3_8bpc_neon:   7270.7   5179.8   5250.3
      inv_txfm_add_64x16_dct_dct_4_8bpc_neon:   7271.7   5212.4   5256.4
      
      The other related variants didn't have this bug and properly exited
      early when possible.
      a6711a5c
    • Martin Storsjö's avatar
      arm64: itx: Simplify inv_txfm_horz_dct_32x8 · 39d6c599
      Martin Storsjö authored
      Unify some loads and stores, avoiding some extra pointer moving.
      39d6c599
    • Martin Storsjö's avatar
      arm64: itx: Minor optimizations for the 8x32 functions · b6b1394b
      Martin Storsjö authored
      This gives a couple cycles speedup.
      b6b1394b
    • Martin Storsjö's avatar
      arm64: itx: Cosmetic fix up · 208a2abd
      Martin Storsjö authored
      208a2abd
    • Martin Storsjö's avatar
      arm64: itx: Remove an unused constant · 92669a3e
      Martin Storsjö authored
      This isn't used for a sqrdmulh in its current form here.
      
      The one left in idct_coeffs[1] isn't used within the idct itself,
      but inv_txfm_horz_scale_dct_32x8 relies on it being left there for
      use with sqrdmulh scaling later.
      92669a3e
    • Martin Storsjö's avatar
      arm64: itx: Remove a todo comment about more special cased functions · b4f1c1c6
      Martin Storsjö authored
      These cases were removed from x86 to save space and simplify the code
      in e0b88bd2, as those cases
      were essentially unused in real world bitstreams.
      b4f1c1c6
Loading