Skip to content
Snippets Groups Projects
  1. Nov 23, 2020
  2. Nov 22, 2020
    • Henrik Gramner's avatar
      Add more buffer pools · 236e1122
      Henrik Gramner authored and Henrik Gramner's avatar Henrik Gramner committed
      Add buffer pools for miscellaneous smaller buffers that are
      repeatedly being freed and reallocated.
      
      Also improve dav1d_ref_create() by consolidating two separate
      memory allocations into a single one.
      236e1122
  3. Nov 20, 2020
    • Martin Storsjö's avatar
      arm32: mc: NEON implementation of warp8x8 for 16 bpc · dc98fff8
      Martin Storsjö authored
      Checkasm benchmarks:
                          Cortex A7      A8     A53     A72     A73
      warp_8x8_16bpc_neon:   4062.6  2109.4  2462.0  1338.9  1391.1
      warp_8x8t_16bpc_neon:  3996.3  2102.4  2412.0  1273.8  1368.9
      
      Corresponding numbers for arm64, for comparison:
                                         Cortex A53     A72     A73
      warp_8x8_16bpc_neon:                   2037.0  1148.8  1222.0
      warp_8x8t_16bpc_neon:                  2008.0  1120.4  1200.9
      dc98fff8
    • Martin Storsjö's avatar
      arm32: cdef: Add NEON implementations of CDEF for 16 bpc · 018e64e7
      Martin Storsjö authored
      Use a shared template file for assembly functions that can be
      templated into 8 and 16 bpc forms, just like in the arm64 version.
      
      Checkasm benchmarks:
                                Cortex A7      A8     A53     A72     A73
      cdef_dir_16bpc_neon:          975.9   853.2   555.2   378.7   386.9
      cdef_filter_4x4_16bpc_neon:   746.9   521.7   481.2   333.0   340.8
      cdef_filter_4x8_16bpc_neon:  1300.0   885.5   816.3   582.7   599.5
      cdef_filter_8x8_16bpc_neon:  2282.5  1415.0  1417.6  1059.0  1076.3
      
      Corresponding numbers for arm64, for comparison:
                                               Cortex A53     A72     A73
      cdef_dir_16bpc_neon:                          418.0   306.7   310.7
      cdef_filter_4x4_16bpc_neon:                   453.4   282.9   297.4
      cdef_filter_4x8_16bpc_neon:                   807.5   514.2   533.8
      cdef_filter_8x8_16bpc_neon:                  1425.2   924.4   942.0
      018e64e7
    • Martin Storsjö's avatar
    • Martin Storsjö's avatar
      arm64: cdef: Fix a comment typo · c48ea15f
      Martin Storsjö authored
      c48ea15f
    • Matthias Dressel's avatar
      Update THANKS.md · ba875b96
      Matthias Dressel authored
      ba875b96
  4. Nov 18, 2020
  5. Nov 17, 2020
  6. Nov 16, 2020
  7. Nov 07, 2020
    • oddstone's avatar
      Fix variable name · ffd052bd
      oddstone authored
      The first index to task_idx_to_sby_and_tile_idx is task_idx not tile_idx
      ffd052bd
  8. Oct 21, 2020
  9. Oct 02, 2020
  10. Oct 01, 2020
  11. Sep 27, 2020
  12. Sep 24, 2020
  13. Sep 20, 2020
  14. Sep 17, 2020
  15. Sep 15, 2020
    • Wan-Teh Chang's avatar
      Ban op->idc that may drop all layer-specific OBUs · 50e876c6
      Wan-Teh Chang authored
      If c->operating_point_idc is nonzero and either bits 0-7 or bits 8-11 in
      it are all 0s, it will cause dav1d_parse_obus() to drop all
      layer-specific OBUs. Prohibit any op->idc with such properties because
      it could be selected as c->operating_point_idc.
      50e876c6
  16. Sep 06, 2020
  17. Sep 03, 2020
    • Martin Storsjö's avatar
      arm32: mc: NEON implementation of put/prep 8tap/bilin for 16 bpc · 856662b4
      Martin Storsjö authored
      Examples of checkasm benchmarks:
                                        Cortex A7      A8      A9     A53     A72     A73
      mc_8tap_regular_w8_0_16bpc_neon:      158.7   106.2   167.0   127.9    55.0    77.2
      mc_8tap_regular_w8_h_16bpc_neon:     1000.8   557.5   749.2   609.2   401.4   485.4
      mc_8tap_regular_w8_hv_16bpc_neon:    2278.9  1255.4  1352.5  1277.2   867.8   915.9
      mc_8tap_regular_w8_v_16bpc_neon:     1060.0   393.6   485.5   448.3   298.0   298.2
      mc_bilinear_w8_0_16bpc_neon:          159.7    96.6   161.1   123.7    55.4    74.7
      mc_bilinear_w8_h_16bpc_neon:          342.3   250.8   352.9   239.0   158.4   165.1
      mc_bilinear_w8_hv_16bpc_neon:         587.7   373.8   469.0   339.8   244.4   247.5
      mc_bilinear_w8_v_16bpc_neon:          285.8   189.3   284.9   180.4   103.4   100.9
      mct_8tap_regular_w8_0_16bpc_neon:     233.0   136.6   229.3   169.3    86.2    98.3
      mct_8tap_regular_w8_h_16bpc_neon:    1106.8   588.3   817.9   654.1   406.4   489.8
      mct_8tap_regular_w8_hv_16bpc_neon:   2473.3  1326.3  1428.2  1373.7   903.3   951.1
      mct_8tap_regular_w8_v_16bpc_neon:    1266.0   474.1   581.3   505.9   382.0   373.4
      mct_bilinear_w8_0_16bpc_neon:         232.9   126.2   225.0   166.3    86.2    91.7
      mct_bilinear_w8_h_16bpc_neon:         380.6   270.6   386.0   259.7   154.1   151.9
      mct_bilinear_w8_hv_16bpc_neon:        631.4   409.2   509.4   372.1   243.1   244.1
      mct_bilinear_w8_v_16bpc_neon:         349.5   233.5   347.9   212.4   138.7   138.4
      
      For comparison, the corresponding numbers for the existing arm64
      implementation:
      
                                                               Cortex A53     A72     A73
      mc_8tap_regular_w8_0_16bpc_neon:                               94.1    48.9    62.3
      mc_8tap_regular_w8_h_16bpc_neon:                              570.4   388.1   467.3
      mc_8tap_regular_w8_hv_16bpc_neon:                            1035.8   775.0   891.2
      mc_8tap_regular_w8_v_16bpc_neon:                              399.8   284.5   278.2
      mc_bilinear_w8_0_16bpc_neon:                                   90.0    44.3    57.4
      mc_bilinear_w8_h_16bpc_neon:                                  191.7   158.7   156.4
      mc_bilinear_w8_hv_16bpc_neon:                                 295.6   235.0   244.9
      mc_bilinear_w8_v_16bpc_neon:                                  147.2    99.0    88.8
      mct_8tap_regular_w8_0_16bpc_neon:                             139.4    78.4    84.9
      mct_8tap_regular_w8_h_16bpc_neon:                             612.3   395.9   478.6
      mct_8tap_regular_w8_hv_16bpc_neon:                           1113.0   804.3   963.5
      mct_8tap_regular_w8_v_16bpc_neon:                             462.1   370.8   353.3
      mct_bilinear_w8_0_16bpc_neon:                                 135.6    77.0    80.5
      mct_bilinear_w8_h_16bpc_neon:                                 210.8   159.2   141.7
      mct_bilinear_w8_hv_16bpc_neon:                                325.7   238.4   227.3
      mct_bilinear_w8_v_16bpc_neon:                                 180.7   136.7   129.5
      856662b4
    • Martin Storsjö's avatar
      arm64: mc: Apply tuning from w4/w8 case to w2 case in 16 bpc 8tap_hv · 4ae3f5f7
      Martin Storsjö authored
      Narrowing the intermediates from the horizontal pass is beneficial
      (on most cores, but a small slowdown on A53) here as well. This
      increases consistency in the code between the cases.
      
      (The corresponding change in the upcoming arm32 version is beneficial
      on all tested cores except for on A53 - it helps, on some cores a lot,
      on A7, A8, A9, A72, A73 and only makes it marginally slower on A53.)
      
      Before:                        Cortex A53     A72     A73
      mc_8tap_regular_w2_hv_16bpc_neon:   457.7   301.0   317.1
      After:
      mc_8tap_regular_w2_hv_16bpc_neon:   472.0   276.0   284.3
      4ae3f5f7
    • Martin Storsjö's avatar
      arm: mc: Avoid an unnecessary mov in 8tap_hv w2 · 65a1aafd
      Martin Storsjö authored
      This matches how the same logic is written for w4 and above.
      65a1aafd
    • Martin Storsjö's avatar
    • Martin Storsjö's avatar
      arm32: mc: Use narrower vext.8 in 8tap_w4_h · ea7e13e7
      Martin Storsjö authored
      The previous form was a leftover from how it had to be written on
      aarch64.
      ea7e13e7
    • Martin Storsjö's avatar
      arm64: mc: Use more descriptive element specifiers for loads/stores in 16 bpc put_neon · 13fad75d
      Martin Storsjö authored
      For loads of a half/full register, the actual size of the elements
      doesn't matter, but it makes the code more readable and understandable.
      13fad75d
  18. Sep 01, 2020
    • Henrik Gramner's avatar
      cli: Use proper integer math in Y4M PAR calculations · 3bfe8c7c
      Henrik Gramner authored and Henrik Gramner's avatar Henrik Gramner committed
      The previous floating-point implementation produced results that were
      sometimes slightly off due to rounding errors.
      
      For example, a frame size of 432x240 with a render size of 176x240
      previously resulted in a PAR of 98:240 instead of the correct 11:27.
      
      Also reduce fractions to produce more readable numbers.
      3bfe8c7c
  19. Aug 30, 2020
Loading