Skip to content
Snippets Groups Projects
  1. Feb 11, 2020
  2. Feb 10, 2020
    • Jean-Baptiste Kempf's avatar
      e4208e85
    • Martin Storsjö's avatar
      arm64: mc: Reduce the width of a register copy · d4c5ad49
      Martin Storsjö authored and Janne Grunau's avatar Janne Grunau committed
      Only copy as much as really is needed/used.
      d4c5ad49
    • Martin Storsjö's avatar
      arm64: mc: Use two regs for alternating output rows for w4/8 in avg/w_avg/mask · b1167ce1
      Martin Storsjö authored and Janne Grunau's avatar Janne Grunau committed
      It was already done this way for w32/64. Not doing it for w16 as it
      didn't help there (and instead gave a small slowdown due to the two
      setup instructions).
      
      This gives a small speedup on in-order cores like A53.
      
      Before:         Cortex A53     A72     A73
      avg_w4_8bpc_neon:     60.9    25.6    29.0
      avg_w8_8bpc_neon:    143.6    52.8    64.0
      After:
      avg_w4_8bpc_neon:     56.7    26.7    28.5
      avg_w8_8bpc_neon:    137.2    54.5    64.4
      b1167ce1
    • Martin Storsjö's avatar
      arm64: mc: Simplify avg/w_avg/mask by always using the w16 macro · 0bad117e
      Martin Storsjö authored and Janne Grunau's avatar Janne Grunau committed
      This shortens the source by 40 lines, and gives a significant
      speedup on A53, a small speedup on A72 and a very minor slowdown
      for avg/w_avg on A73.
      
      Before:           Cortex A53     A72     A73
      avg_w4_8bpc_neon:       67.4    26.1    25.4
      avg_w8_8bpc_neon:      158.7    56.3    59.1
      avg_w16_8bpc_neon:     382.9   154.1   160.7
      w_avg_w4_8bpc_neon:     99.9    43.6    39.4
      w_avg_w8_8bpc_neon:    253.2    98.3    99.0
      w_avg_w16_8bpc_neon:   543.1   285.0   301.8
      mask_w4_8bpc_neon:     110.6    51.4    45.1
      mask_w8_8bpc_neon:     295.0   129.9   114.0
      mask_w16_8bpc_neon:    654.6   365.8   369.7
      After:
      avg_w4_8bpc_neon:       60.8    26.3    29.0
      avg_w8_8bpc_neon:      142.8    52.9    64.1
      avg_w16_8bpc_neon:     378.2   153.4   160.8
      w_avg_w4_8bpc_neon:     78.7    41.0    40.9
      w_avg_w8_8bpc_neon:    190.6    90.1   105.1
      w_avg_w16_8bpc_neon:   531.1   279.3   301.4
      mask_w4_8bpc_neon:      86.6    47.2    44.9
      mask_w8_8bpc_neon:     222.0   114.3   114.9
      mask_w16_8bpc_neon:    639.5   356.0   369.8
      0bad117e
    • Jean-Baptiste Kempf's avatar
      Update NEWS for 0.6.0 · 2e68c1f3
      Jean-Baptiste Kempf authored
      2e68c1f3
  3. Feb 08, 2020
    • Martin Storsjö's avatar
      arm64: mc: NEON implementation of warp for 16 bpc · 8974c155
      Martin Storsjö authored
      Checkasm benchmark numbers:
                         Cortex A53     A72     A73
      warp_8x8_16bpc_neon:   2029.9  1150.5  1225.2
      warp_8x8t_16bpc_neon:  2007.6  1129.0  1192.3
      
      Corresponding numbers for 8bpc for comparison:
      
      warp_8x8_8bpc_neon:    1863.8  1052.8  1106.2
      warp_8x8t_8bpc_neon:   1847.4  1048.3  1099.8
      8974c155
  4. Feb 07, 2020
    • Martin Storsjö's avatar
      arm64: cdef: Add NEON implementations of CDEF for 16 bpc · e6cebeb7
      Martin Storsjö authored
      As some functions are made for both 8bpc and 16bpc from a shared
      template, those functions are moved to a separate assembly file
      which is included. That assembly file (cdef_tmpl.S) isn't intended
      to be assembled on its own (just like utils.S), but if it is
      assembled, it should produce an empty object file.
      
      Checkasm benchmarks:
                               Cortex A53     A72     A73
      cdef_dir_16bpc_neon:          422.7   305.5   314.0
      cdef_filter_4x4_16bpc_neon:   452.9   282.7   296.6
      cdef_filter_4x8_16bpc_neon:   800.9   515.3   534.1
      cdef_filter_8x8_16bpc_neon:  1417.1   922.7   942.6
      
      Corresponding numbers for 8bpc for comparison:
      
      cdef_dir_8bpc_neon:          394.7   268.8   281.8
      cdef_filter_4x4_8bpc_neon:   461.5   300.9   307.7
      cdef_filter_4x8_8bpc_neon:   831.6   546.1   555.6
      cdef_filter_8x8_8bpc_neon:  1454.6   934.0   960.0
      e6cebeb7
    • Martin Storsjö's avatar
      arm: cdef: Prepare for 16bpc · 1d5ef8df
      Martin Storsjö authored
      1d5ef8df
  5. Feb 06, 2020
    • Henrik Gramner's avatar
      19ce77e0
    • Henrik Gramner's avatar
      Reorder the Dav1dFrameHeader struct to fix alignment issues · 58a4ba07
      Henrik Gramner authored
      The ar_coeff_shift element needs to have a 16-byte alignment on x86.
      58a4ba07
    • Martin Storsjö's avatar
      arm64: looprestoration: NEON implementation of wiener filter for 16 bpc · c89eb564
      Martin Storsjö authored
      Checkasm benchmarks:     Cortex A53       A72       A73
      wiener_chroma_16bpc_neon:  190288.4  129369.5  127284.1
      wiener_luma_16bpc_neon:    195108.4  129387.8  127042.7
      
      The corresponding numbers for 8 bpc for comparison:
      wiener_chroma_8bpc_neon:   150586.9  101647.1   97709.9
      wiener_luma_8bpc_neon:     146297.4  101593.2   97670.5
      c89eb564
    • Martin Storsjö's avatar
      arm: looprestoration: Fix the wiener C wrapper function for high bitdepths · c2a2e6ee
      Martin Storsjö authored
      Use HIGHBD_DECL_SUFFIX and HIGHBD_TAIL_SUFFIX where necessary, add
      a missing sizeof(pixel).
      c2a2e6ee
    • Martin Storsjö's avatar
    • Martin Storsjö's avatar
      arm: looprestoration: Clarify a comment · 2653292c
      Martin Storsjö authored
      2653292c
    • Martin Storsjö's avatar
      arm64: mc: NEON implementation of put/prep 8tap/bilin for 16 bpc · fe44861b
      Martin Storsjö authored
      Examples of checkasm benchmarks:
                                         Cortex A53       A72       A73
      mc_8tap_regular_w8_0_16bpc_neon:         96.8      49.6      62.5
      mc_8tap_regular_w8_h_16bpc_neon:        570.3     388.0     467.2
      mc_8tap_regular_w8_hv_16bpc_neon:      1035.8     776.7     891.3
      mc_8tap_regular_w8_v_16bpc_neon:        400.6     285.0     278.3
      mc_bilinear_w8_0_16bpc_neon:             90.0      44.8      57.8
      mc_bilinear_w8_h_16bpc_neon:            191.2     158.7     156.4
      mc_bilinear_w8_hv_16bpc_neon:           295.9     234.6     244.9
      mc_bilinear_w8_v_16bpc_neon:            147.2      98.7      89.2
      mct_8tap_regular_w8_0_16bpc_neon:       139.4      78.7      84.9
      mct_8tap_regular_w8_h_16bpc_neon:       612.5     396.8     479.1
      mct_8tap_regular_w8_hv_16bpc_neon:     1112.4     814.6     963.2
      mct_8tap_regular_w8_v_16bpc_neon:       461.8     370.8     353.4
      mct_bilinear_w8_0_16bpc_neon:           135.6      76.2      80.5
      mct_bilinear_w8_h_16bpc_neon:           211.3     159.4     141.7
      mct_bilinear_w8_hv_16bpc_neon:          325.7     237.2     227.2
      mct_bilinear_w8_v_16bpc_neon:           180.7     135.9     129.5
      
      For comparison, the corresponding numbers for 8 bpc:
      
      mc_8tap_regular_w8_0_8bpc_neon:          78.6      41.0      39.5
      mc_8tap_regular_w8_h_8bpc_neon:         371.2     299.6     348.3
      mc_8tap_regular_w8_hv_8bpc_neon:        817.1     675.0     726.5
      mc_8tap_regular_w8_v_8bpc_neon:         243.7     260.4     253.0
      mc_bilinear_w8_0_8bpc_neon:              74.8      35.4      36.1
      mc_bilinear_w8_h_8bpc_neon:             179.9      69.9      79.2
      mc_bilinear_w8_hv_8bpc_neon:            210.8     132.4     144.8
      mc_bilinear_w8_v_8bpc_neon:             141.6      64.9      65.4
      mct_8tap_regular_w8_0_8bpc_neon:        101.7      54.4      59.5
      mct_8tap_regular_w8_h_8bpc_neon:        391.3     329.1     358.3
      mct_8tap_regular_w8_hv_8bpc_neon:       880.4     754.9     829.4
      mct_8tap_regular_w8_v_8bpc_neon:        270.8     300.8     277.4
      mct_bilinear_w8_0_8bpc_neon:             97.6      54.0      55.4
      mct_bilinear_w8_h_8bpc_neon:            173.3      73.5      79.5
      mct_bilinear_w8_hv_8bpc_neon:           228.3     163.0     174.0
      mct_bilinear_w8_v_8bpc_neon:            128.9      72.5      63.3
      fe44861b
  6. Feb 05, 2020
  7. Feb 04, 2020
    • Martin Storsjö's avatar
      arm64: mc: NEON implementation of avg/mask/w_avg for 16 bpc · 03511f8c
      Martin Storsjö authored
      avg_w4_16bpc_neon:         78.2     43.2     48.9
      avg_w8_16bpc_neon:        199.1    108.7    123.1
      avg_w16_16bpc_neon:       615.6    339.9    373.9
      avg_w32_16bpc_neon:      2313.0   1390.6   1490.6
      avg_w64_16bpc_neon:      5783.6   3119.5   3653.0
      avg_w128_16bpc_neon:    15444.6   8168.7   8907.9
      w_avg_w4_16bpc_neon:      120.1     87.8     92.4
      w_avg_w8_16bpc_neon:      321.6    252.4    263.1
      w_avg_w16_16bpc_neon:    1017.5    794.5    831.2
      w_avg_w32_16bpc_neon:    3911.4   3154.7   3306.5
      w_avg_w64_16bpc_neon:    9977.9   7794.9   8022.3
      w_avg_w128_16bpc_neon:  25721.5  19274.6  20041.7
      mask_w4_16bpc_neon:       139.5     96.5    104.3
      mask_w8_16bpc_neon:       376.0    283.9    300.1
      mask_w16_16bpc_neon:     1217.2    906.7    950.0
      mask_w32_16bpc_neon:     4811.1   3669.0   3901.3
      mask_w64_16bpc_neon:    12036.4   8918.4   9244.8
      mask_w128_16bpc_neon:   30888.8  21999.0  23206.7
      
      For comparison, these are the corresponding numbers for 8bpc:
      
      avg_w4_8bpc_neon:          56.7     26.2     28.5
      avg_w8_8bpc_neon:         137.2     52.8     64.3
      avg_w16_8bpc_neon:        377.9    151.5    161.6
      avg_w32_8bpc_neon:       1528.9    614.5    633.9
      avg_w64_8bpc_neon:       3792.5   1814.3   1518.3
      avg_w128_8bpc_neon:     10685.3   5220.4   3879.9
      w_avg_w4_8bpc_neon:        75.2     53.0     41.1
      w_avg_w8_8bpc_neon:       186.7    120.1    105.2
      w_avg_w16_8bpc_neon:      531.6    314.1    302.1
      w_avg_w32_8bpc_neon:     2138.4   1120.4   1171.5
      w_avg_w64_8bpc_neon:     5151.9   2910.5   2857.1
      w_avg_w128_8bpc_neon:   13945.0   7330.5   7389.1
      mask_w4_8bpc_neon:         82.0     47.2     45.1
      mask_w8_8bpc_neon:        213.5    115.4    115.8
      mask_w16_8bpc_neon:       639.8    356.2    370.1
      mask_w32_8bpc_neon:      2566.9   1489.8   1435.0
      mask_w64_8bpc_neon:      6727.6   3822.8   3425.2
      mask_w128_8bpc_neon:    17893.0   9622.6   9161.3
      03511f8c
    • Martin Storsjö's avatar
    • Martin Storsjö's avatar
  8. Feb 03, 2020
  9. Feb 01, 2020
    • Henrik Gramner's avatar
      Rework the CDEF top edge handling · bb178db0
      Henrik Gramner authored
      Avoids some pointer chasing and simplifies the DSP code, at the cost
      of making the initialization a little bit more complicated.
      
      Also reduces memory usage by a small amount due to properly sizing
      the buffers instead of always allocating enough space for 4:4:4.
      bb178db0
    • Henrik Gramner's avatar
    • Henrik Gramner's avatar
      Avoid masking the lsb in high bit-depth stride calculations · fbc1b420
      Henrik Gramner authored
      We specify most strides in bytes, but since C defines offsets
      in multiples of sizeof(type) we use the PXSTRIDE() macro to
      downshift the strides by one in high-bit depth templated files.
      
      This however means that the compiler is required to mask away
      the least significant bit, because it could in theory be non-zero.
      
      Avoid that by telling the compiler (when compiled in release mode)
      that the lsb is in fact guaranteed to always be zero.
      fbc1b420
  10. Jan 29, 2020
    • Henrik Gramner's avatar
      checkasm: Increase buffer alignment to 64-byte on x86-64 · 9c29f229
      Henrik Gramner authored
      Required for AVX-512.
      9c29f229
    • Martin Storsjö's avatar
      arm: cdef: Add special cased versions for pri_strength/sec_strength being zero · 361a3c8e
      Martin Storsjö authored
      Before:
      ARM32:                    Cortex A7      A8      A9     A53     A72     A73
      cdef_filter_4x4_8bpc_neon:    964.6   599.5   707.9   601.2   465.1   405.2
      cdef_filter_4x8_8bpc_neon:   1726.0  1066.2  1238.7  1041.7   798.6   725.3
      cdef_filter_8x8_8bpc_neon:   2974.4  1671.8  1943.9  1806.1  1229.8  1242.1
      ARM64:
      cdef_filter_4x4_8bpc_neon:                            569.2   337.8   348.7
      cdef_filter_4x8_8bpc_neon:                           1031.1   623.3   633.6
      cdef_filter_8x8_8bpc_neon:                           1847.5  1097.7  1117.5
      
      After:
      ARM32:                    Cortex A7      A8      A9     A53     A72     A73
      cdef_filter_4x4_8bpc_neon:    798.4   524.2   617.3   506.8   432.4   361.1
      cdef_filter_4x8_8bpc_neon:   1394.7   910.4  1054.0   863.6   730.2   632.2
      cdef_filter_8x8_8bpc_neon:   2364.6  1453.8  1675.1  1466.0  1086.4  1107.7
      ARM64:
      cdef_filter_4x4_8bpc_neon:                            461.7   303.1   308.6
      cdef_filter_4x8_8bpc_neon:                            833.0   547.5   556.0
      cdef_filter_8x8_8bpc_neon:                           1459.3   934.1   967.9
      361a3c8e
    • Martin Storsjö's avatar
      arm: cdef: Fix some comment typos · 6ad9bd5f
      Martin Storsjö authored
      6ad9bd5f
  11. Jan 28, 2020
  12. Jan 27, 2020
  13. Jan 25, 2020
  14. Jan 21, 2020
Loading