Skip to content
Snippets Groups Projects
  1. Feb 18, 2020
  2. Feb 17, 2020
    • Martin Storsjö's avatar
      arm: cdef: Do an 8 bit implementation for cases with all edges present · b33f46e8
      Martin Storsjö authored
      This increases the code size by around 3 KB on arm64.
      
      Before:
      ARM32:                    Cortex A7      A8      A9     A53     A72     A73
      cdef_filter_4x4_8bpc_neon:    807.1   517.0   617.7   506.6   429.9   357.8
      cdef_filter_4x8_8bpc_neon:   1407.9   899.3  1054.6   862.3   726.5   628.1
      cdef_filter_8x8_8bpc_neon:   2394.9  1456.8  1676.8  1461.2  1084.4  1101.2
      ARM64:
      cdef_filter_4x4_8bpc_neon:                            460.7   301.8   308.0
      cdef_filter_4x8_8bpc_neon:                            831.6   547.0   555.2
      cdef_filter_8x8_8bpc_neon:                           1454.6   935.6   960.4
      
      After:
      ARM32:
      cdef_filter_4x4_8bpc_neon:    669.3   541.3   524.4   424.9   322.7   298.1
      cdef_filter_4x8_8bpc_neon:   1159.1   922.9   881.1   709.2   538.3   514.1
      cdef_filter_8x8_8bpc_neon:   1888.8  1285.4  1358.5  1152.9   839.3   871.2
      ARM64:
      cdef_filter_4x4_8bpc_neon:                            383.6   262.1   259.9
      cdef_filter_4x8_8bpc_neon:                            684.9   472.2   464.7
      cdef_filter_8x8_8bpc_neon:                           1160.0   756.8   788.0
      
      (The checkasm benchmark averages three different cases; the fully
      edged case is one of those three, while it's the most common case
      in actual video. The difference is much bigger if only benchmarking
      that particular case.)
      
      This actually apparently makes the code a little bit slower for the w=4
      cases on Cortex A8, while it's a significant speedup on all other cores.
      b33f46e8
    • Martin Storsjö's avatar
      arm32: cdef: Fix a typo for consistency · aff9a210
      Martin Storsjö authored
      The signedness of elements doesn't matter for vsub; match the vsub.i16
      next to it.
      aff9a210
  3. Feb 16, 2020
    • Henrik Gramner's avatar
      cli: Implement line buffering in print_stats() · 09d90658
      Henrik Gramner authored
      Console output is incredibly slow on Windows, which is aggravated by
      the lack of line buffering. As a result, a significant percentage of
      overall runtime is actually spent displaying the decoding progress.
      
      Doing the line buffering manually alleviates most of the issue.
      09d90658
  4. Feb 13, 2020
  5. Feb 11, 2020
    • Martin Storsjö's avatar
      arm64: looprestoration: NEON implementation of SGR for 10 bpc · e3dbf926
      Martin Storsjö authored
      This only supports 10 bpc, not 12 bpc, as the sum and tmp buffers can
      be int16_t for 10 bpc, but need to be int32_t for 12 bpc.
      
      Make actual templates out of the functions in looprestoration_tmpl.S,
      and add box3/5_h to looprestoration16.S.
      
      Extend dav1d_sgr_calc_abX_neon with a mandatory bitdepth_max parameter
      (which is passed even in 8bpc mode), add a define to bitdepth.h for
      passing such a parameter in all modes. This makes this function
      a few instructions slower in 8bpc mode than it was before (overall impact
      seems to be around 1% of the total runtime of SGR), but allows using the
      same actual function instantiation for all modes, saving a bit of code
      size.
      
      Examples of checkasm runtimes:
                                 Cortex A53        A72        A73
      selfguided_3x3_10bpc_neon:   516755.8   389412.7   349058.7
      selfguided_5x5_10bpc_neon:   380699.9   293486.6   254591.6
      selfguided_mix_10bpc_neon:   878142.3   667495.9   587844.6
      
      Corresponding 8 bpc numbers for comparison:
      selfguided_3x3_8bpc_neon:    491058.1   361473.4   347705.9
      selfguided_5x5_8bpc_neon:    352655.0   266423.7   248192.2
      selfguided_mix_8bpc_neon:    826094.1   612372.2   581943.1
      e3dbf926
    • Martin Storsjö's avatar
      arm64: looprestoration: Prepare for 16 bpc by splitting code to separate files · 7cf5d753
      Martin Storsjö authored
      looprestoration_common.S contains functions that can be used as is
      with one single instantiation of the functions for both 8 and 16 bpc.
      This file will be built once, regardless of which bitdepths are enabled.
      
      looprestoration_tmpl.S contains functions where the source can be shared
      and templated between 8 and 16 bpc. This will be included by the separate
      8/16bpc implementaton files.
      7cf5d753
    • Martin Storsjö's avatar
      arm: looprestoration: Add 8bpc to existing function names, add HIGHBD_*_SUFFIX · 32e265a8
      Martin Storsjö authored
      Don't add it to dav1d_sgr_calc_ab1/2_neon and box3/5_v, as the same
      concrete function implementations can be shared for both 8 and 16 bpc
      for those functions.
      32e265a8
    • Martin Storsjö's avatar
      looprestoration: Add a bpc parameter to the init func · 96da9cc2
      Martin Storsjö authored
      This allows using completely different codepaths for 10 and 12 bpc,
      or just adding SIMD functions for either of them.
      96da9cc2
    • Martin Storsjö's avatar
      arm: looprestoration: Improve scheduling in box3/5_h slightly · 8fb30657
      Martin Storsjö authored
      Set flags further from the branch instructions that use them.
      8fb30657
    • Martin Storsjö's avatar
      arm: Use int16_t for the tmp intermediate buffer · 8e8fb84d
      Martin Storsjö authored
      For 8bpc and 10bpc, int16_t is enough here, and for 12bpc, other
      intermediate int16_t buffers also need to be made of size coef anyway.
      8e8fb84d
    • Martin Storsjö's avatar
      arm: looprestoration: Fix a comment · feeaf785
      Martin Storsjö authored
      feeaf785
  6. Feb 10, 2020
    • Jean-Baptiste Kempf's avatar
      e4208e85
    • Martin Storsjö's avatar
      arm64: mc: Reduce the width of a register copy · d4c5ad49
      Martin Storsjö authored and Janne Grunau's avatar Janne Grunau committed
      Only copy as much as really is needed/used.
      d4c5ad49
    • Martin Storsjö's avatar
      arm64: mc: Use two regs for alternating output rows for w4/8 in avg/w_avg/mask · b1167ce1
      Martin Storsjö authored and Janne Grunau's avatar Janne Grunau committed
      It was already done this way for w32/64. Not doing it for w16 as it
      didn't help there (and instead gave a small slowdown due to the two
      setup instructions).
      
      This gives a small speedup on in-order cores like A53.
      
      Before:         Cortex A53     A72     A73
      avg_w4_8bpc_neon:     60.9    25.6    29.0
      avg_w8_8bpc_neon:    143.6    52.8    64.0
      After:
      avg_w4_8bpc_neon:     56.7    26.7    28.5
      avg_w8_8bpc_neon:    137.2    54.5    64.4
      b1167ce1
    • Martin Storsjö's avatar
      arm64: mc: Simplify avg/w_avg/mask by always using the w16 macro · 0bad117e
      Martin Storsjö authored and Janne Grunau's avatar Janne Grunau committed
      This shortens the source by 40 lines, and gives a significant
      speedup on A53, a small speedup on A72 and a very minor slowdown
      for avg/w_avg on A73.
      
      Before:           Cortex A53     A72     A73
      avg_w4_8bpc_neon:       67.4    26.1    25.4
      avg_w8_8bpc_neon:      158.7    56.3    59.1
      avg_w16_8bpc_neon:     382.9   154.1   160.7
      w_avg_w4_8bpc_neon:     99.9    43.6    39.4
      w_avg_w8_8bpc_neon:    253.2    98.3    99.0
      w_avg_w16_8bpc_neon:   543.1   285.0   301.8
      mask_w4_8bpc_neon:     110.6    51.4    45.1
      mask_w8_8bpc_neon:     295.0   129.9   114.0
      mask_w16_8bpc_neon:    654.6   365.8   369.7
      After:
      avg_w4_8bpc_neon:       60.8    26.3    29.0
      avg_w8_8bpc_neon:      142.8    52.9    64.1
      avg_w16_8bpc_neon:     378.2   153.4   160.8
      w_avg_w4_8bpc_neon:     78.7    41.0    40.9
      w_avg_w8_8bpc_neon:    190.6    90.1   105.1
      w_avg_w16_8bpc_neon:   531.1   279.3   301.4
      mask_w4_8bpc_neon:      86.6    47.2    44.9
      mask_w8_8bpc_neon:     222.0   114.3   114.9
      mask_w16_8bpc_neon:    639.5   356.0   369.8
      0bad117e
    • Jean-Baptiste Kempf's avatar
      Update NEWS for 0.6.0 · 2e68c1f3
      Jean-Baptiste Kempf authored
      2e68c1f3
  7. Feb 08, 2020
    • Martin Storsjö's avatar
      arm64: mc: NEON implementation of warp for 16 bpc · 8974c155
      Martin Storsjö authored
      Checkasm benchmark numbers:
                         Cortex A53     A72     A73
      warp_8x8_16bpc_neon:   2029.9  1150.5  1225.2
      warp_8x8t_16bpc_neon:  2007.6  1129.0  1192.3
      
      Corresponding numbers for 8bpc for comparison:
      
      warp_8x8_8bpc_neon:    1863.8  1052.8  1106.2
      warp_8x8t_8bpc_neon:   1847.4  1048.3  1099.8
      8974c155
  8. Feb 07, 2020
    • Martin Storsjö's avatar
      arm64: cdef: Add NEON implementations of CDEF for 16 bpc · e6cebeb7
      Martin Storsjö authored
      As some functions are made for both 8bpc and 16bpc from a shared
      template, those functions are moved to a separate assembly file
      which is included. That assembly file (cdef_tmpl.S) isn't intended
      to be assembled on its own (just like utils.S), but if it is
      assembled, it should produce an empty object file.
      
      Checkasm benchmarks:
                               Cortex A53     A72     A73
      cdef_dir_16bpc_neon:          422.7   305.5   314.0
      cdef_filter_4x4_16bpc_neon:   452.9   282.7   296.6
      cdef_filter_4x8_16bpc_neon:   800.9   515.3   534.1
      cdef_filter_8x8_16bpc_neon:  1417.1   922.7   942.6
      
      Corresponding numbers for 8bpc for comparison:
      
      cdef_dir_8bpc_neon:          394.7   268.8   281.8
      cdef_filter_4x4_8bpc_neon:   461.5   300.9   307.7
      cdef_filter_4x8_8bpc_neon:   831.6   546.1   555.6
      cdef_filter_8x8_8bpc_neon:  1454.6   934.0   960.0
      e6cebeb7
    • Martin Storsjö's avatar
      arm: cdef: Prepare for 16bpc · 1d5ef8df
      Martin Storsjö authored
      1d5ef8df
  9. Feb 06, 2020
    • Henrik Gramner's avatar
      19ce77e0
    • Henrik Gramner's avatar
      Reorder the Dav1dFrameHeader struct to fix alignment issues · 58a4ba07
      Henrik Gramner authored
      The ar_coeff_shift element needs to have a 16-byte alignment on x86.
      58a4ba07
    • Martin Storsjö's avatar
      arm64: looprestoration: NEON implementation of wiener filter for 16 bpc · c89eb564
      Martin Storsjö authored
      Checkasm benchmarks:     Cortex A53       A72       A73
      wiener_chroma_16bpc_neon:  190288.4  129369.5  127284.1
      wiener_luma_16bpc_neon:    195108.4  129387.8  127042.7
      
      The corresponding numbers for 8 bpc for comparison:
      wiener_chroma_8bpc_neon:   150586.9  101647.1   97709.9
      wiener_luma_8bpc_neon:     146297.4  101593.2   97670.5
      c89eb564
    • Martin Storsjö's avatar
      arm: looprestoration: Fix the wiener C wrapper function for high bitdepths · c2a2e6ee
      Martin Storsjö authored
      Use HIGHBD_DECL_SUFFIX and HIGHBD_TAIL_SUFFIX where necessary, add
      a missing sizeof(pixel).
      c2a2e6ee
    • Martin Storsjö's avatar
    • Martin Storsjö's avatar
      arm: looprestoration: Clarify a comment · 2653292c
      Martin Storsjö authored
      2653292c
    • Martin Storsjö's avatar
      arm64: mc: NEON implementation of put/prep 8tap/bilin for 16 bpc · fe44861b
      Martin Storsjö authored
      Examples of checkasm benchmarks:
                                         Cortex A53       A72       A73
      mc_8tap_regular_w8_0_16bpc_neon:         96.8      49.6      62.5
      mc_8tap_regular_w8_h_16bpc_neon:        570.3     388.0     467.2
      mc_8tap_regular_w8_hv_16bpc_neon:      1035.8     776.7     891.3
      mc_8tap_regular_w8_v_16bpc_neon:        400.6     285.0     278.3
      mc_bilinear_w8_0_16bpc_neon:             90.0      44.8      57.8
      mc_bilinear_w8_h_16bpc_neon:            191.2     158.7     156.4
      mc_bilinear_w8_hv_16bpc_neon:           295.9     234.6     244.9
      mc_bilinear_w8_v_16bpc_neon:            147.2      98.7      89.2
      mct_8tap_regular_w8_0_16bpc_neon:       139.4      78.7      84.9
      mct_8tap_regular_w8_h_16bpc_neon:       612.5     396.8     479.1
      mct_8tap_regular_w8_hv_16bpc_neon:     1112.4     814.6     963.2
      mct_8tap_regular_w8_v_16bpc_neon:       461.8     370.8     353.4
      mct_bilinear_w8_0_16bpc_neon:           135.6      76.2      80.5
      mct_bilinear_w8_h_16bpc_neon:           211.3     159.4     141.7
      mct_bilinear_w8_hv_16bpc_neon:          325.7     237.2     227.2
      mct_bilinear_w8_v_16bpc_neon:           180.7     135.9     129.5
      
      For comparison, the corresponding numbers for 8 bpc:
      
      mc_8tap_regular_w8_0_8bpc_neon:          78.6      41.0      39.5
      mc_8tap_regular_w8_h_8bpc_neon:         371.2     299.6     348.3
      mc_8tap_regular_w8_hv_8bpc_neon:        817.1     675.0     726.5
      mc_8tap_regular_w8_v_8bpc_neon:         243.7     260.4     253.0
      mc_bilinear_w8_0_8bpc_neon:              74.8      35.4      36.1
      mc_bilinear_w8_h_8bpc_neon:             179.9      69.9      79.2
      mc_bilinear_w8_hv_8bpc_neon:            210.8     132.4     144.8
      mc_bilinear_w8_v_8bpc_neon:             141.6      64.9      65.4
      mct_8tap_regular_w8_0_8bpc_neon:        101.7      54.4      59.5
      mct_8tap_regular_w8_h_8bpc_neon:        391.3     329.1     358.3
      mct_8tap_regular_w8_hv_8bpc_neon:       880.4     754.9     829.4
      mct_8tap_regular_w8_v_8bpc_neon:        270.8     300.8     277.4
      mct_bilinear_w8_0_8bpc_neon:             97.6      54.0      55.4
      mct_bilinear_w8_h_8bpc_neon:            173.3      73.5      79.5
      mct_bilinear_w8_hv_8bpc_neon:           228.3     163.0     174.0
      mct_bilinear_w8_v_8bpc_neon:            128.9      72.5      63.3
      fe44861b
  10. Feb 05, 2020
  11. Feb 04, 2020
    • Martin Storsjö's avatar
      arm64: mc: NEON implementation of avg/mask/w_avg for 16 bpc · 03511f8c
      Martin Storsjö authored
      avg_w4_16bpc_neon:         78.2     43.2     48.9
      avg_w8_16bpc_neon:        199.1    108.7    123.1
      avg_w16_16bpc_neon:       615.6    339.9    373.9
      avg_w32_16bpc_neon:      2313.0   1390.6   1490.6
      avg_w64_16bpc_neon:      5783.6   3119.5   3653.0
      avg_w128_16bpc_neon:    15444.6   8168.7   8907.9
      w_avg_w4_16bpc_neon:      120.1     87.8     92.4
      w_avg_w8_16bpc_neon:      321.6    252.4    263.1
      w_avg_w16_16bpc_neon:    1017.5    794.5    831.2
      w_avg_w32_16bpc_neon:    3911.4   3154.7   3306.5
      w_avg_w64_16bpc_neon:    9977.9   7794.9   8022.3
      w_avg_w128_16bpc_neon:  25721.5  19274.6  20041.7
      mask_w4_16bpc_neon:       139.5     96.5    104.3
      mask_w8_16bpc_neon:       376.0    283.9    300.1
      mask_w16_16bpc_neon:     1217.2    906.7    950.0
      mask_w32_16bpc_neon:     4811.1   3669.0   3901.3
      mask_w64_16bpc_neon:    12036.4   8918.4   9244.8
      mask_w128_16bpc_neon:   30888.8  21999.0  23206.7
      
      For comparison, these are the corresponding numbers for 8bpc:
      
      avg_w4_8bpc_neon:          56.7     26.2     28.5
      avg_w8_8bpc_neon:         137.2     52.8     64.3
      avg_w16_8bpc_neon:        377.9    151.5    161.6
      avg_w32_8bpc_neon:       1528.9    614.5    633.9
      avg_w64_8bpc_neon:       3792.5   1814.3   1518.3
      avg_w128_8bpc_neon:     10685.3   5220.4   3879.9
      w_avg_w4_8bpc_neon:        75.2     53.0     41.1
      w_avg_w8_8bpc_neon:       186.7    120.1    105.2
      w_avg_w16_8bpc_neon:      531.6    314.1    302.1
      w_avg_w32_8bpc_neon:     2138.4   1120.4   1171.5
      w_avg_w64_8bpc_neon:     5151.9   2910.5   2857.1
      w_avg_w128_8bpc_neon:   13945.0   7330.5   7389.1
      mask_w4_8bpc_neon:         82.0     47.2     45.1
      mask_w8_8bpc_neon:        213.5    115.4    115.8
      mask_w16_8bpc_neon:       639.8    356.2    370.1
      mask_w32_8bpc_neon:      2566.9   1489.8   1435.0
      mask_w64_8bpc_neon:      6727.6   3822.8   3425.2
      mask_w128_8bpc_neon:    17893.0   9622.6   9161.3
      03511f8c
    • Martin Storsjö's avatar
    • Martin Storsjö's avatar
  12. Feb 03, 2020
  13. Feb 01, 2020
    • Henrik Gramner's avatar
      Rework the CDEF top edge handling · bb178db0
      Henrik Gramner authored
      Avoids some pointer chasing and simplifies the DSP code, at the cost
      of making the initialization a little bit more complicated.
      
      Also reduces memory usage by a small amount due to properly sizing
      the buffers instead of always allocating enough space for 4:4:4.
      bb178db0
    • Henrik Gramner's avatar
    • Henrik Gramner's avatar
      Avoid masking the lsb in high bit-depth stride calculations · fbc1b420
      Henrik Gramner authored
      We specify most strides in bytes, but since C defines offsets
      in multiples of sizeof(type) we use the PXSTRIDE() macro to
      downshift the strides by one in high-bit depth templated files.
      
      This however means that the compiler is required to mask away
      the least significant bit, because it could in theory be non-zero.
      
      Avoid that by telling the compiler (when compiled in release mode)
      that the lsb is in fact guaranteed to always be zero.
      fbc1b420
  14. Jan 29, 2020
Loading