Skip to content
Snippets Groups Projects
  1. Oct 07, 2019
  2. Oct 03, 2019
  3. Oct 02, 2019
  4. Oct 01, 2019
    • Henrik Gramner's avatar
      Simplify README build instructions · 16e0741a
      Henrik Gramner authored and Henrik Gramner's avatar Henrik Gramner committed
      16e0741a
    • Ronald S. Bultje's avatar
      Minor cleanup · f6a8cc0c
      Ronald S. Bultje authored
      f6a8cc0c
    • Martin Storsjö's avatar
      arm64: ipred: NEON implementation of dc/h/v prediction modes · f7743da1
      Martin Storsjö authored
      Relative speedups over the C code:
                                    Cortex A53    A72    A73
      intra_pred_dc_128_w4_8bpc_neon:     2.08   1.47   2.17
      intra_pred_dc_128_w8_8bpc_neon:     3.33   2.49   4.03
      intra_pred_dc_128_w16_8bpc_neon:    3.93   3.86   3.75
      intra_pred_dc_128_w32_8bpc_neon:    3.14   3.79   2.90
      intra_pred_dc_128_w64_8bpc_neon:    3.68   1.97   2.42
      intra_pred_dc_left_w4_8bpc_neon:    2.41   1.70   2.23
      intra_pred_dc_left_w8_8bpc_neon:    3.53   2.41   3.32
      intra_pred_dc_left_w16_8bpc_neon:   3.87   3.54   3.34
      intra_pred_dc_left_w32_8bpc_neon:   4.10   3.60   2.76
      intra_pred_dc_left_w64_8bpc_neon:   3.72   2.00   2.39
      intra_pred_dc_top_w4_8bpc_neon:     2.27   1.66   2.07
      intra_pred_dc_top_w8_8bpc_neon:     3.83   2.69   3.43
      intra_pred_dc_top_w16_8bpc_neon:    3.66   3.60   3.20
      intra_pred_dc_top_w32_8bpc_neon:    3.92   3.54   2.66
      intra_pred_dc_top_w64_8bpc_neon:    3.60   1.98   2.30
      intra_pred_dc_w4_8bpc_neon:         2.29   1.42   2.16
      intra_pred_dc_w8_8bpc_neon:         3.56   2.83   3.05
      intra_pred_dc_w16_8bpc_neon:        3.46   3.37   3.15
      intra_pred_dc_w32_8bpc_neon:        3.79   3.41   2.74
      intra_pred_dc_w64_8bpc_neon:        3.52   2.01   2.41
      intra_pred_h_w4_8bpc_neon:         10.34   5.74   5.94
      intra_pred_h_w8_8bpc_neon:         12.13   6.33   6.43
      intra_pred_h_w16_8bpc_neon:        10.66   7.31   5.85
      intra_pred_h_w32_8bpc_neon:         6.28   4.18   2.88
      intra_pred_h_w64_8bpc_neon:         3.96   1.85   1.75
      intra_pred_v_w4_8bpc_neon:         11.44   6.12   7.57
      intra_pred_v_w8_8bpc_neon:         14.76   7.58   7.95
      intra_pred_v_w16_8bpc_neon:        11.34   6.28   5.88
      intra_pred_v_w32_8bpc_neon:         6.56   3.33   3.34
      intra_pred_v_w64_8bpc_neon:         4.57   1.24   1.97
      f7743da1
  5. Sep 30, 2019
    • Victorien Le Couviour--Tuffet's avatar
      x86: add warp_affine SSE4 and SSSE3 asm · a91a03b0
      Victorien Le Couviour--Tuffet authored
      ------------------------------------------
      x86_64: warp_8x8_8bpc_c: 1773.4
      x86_32: warp_8x8_8bpc_c: 1740.4
      ----------
      x86_64: warp_8x8_8bpc_ssse3: 317.5
      x86_32: warp_8x8_8bpc_ssse3: 378.4
      ----------
      x86_64: warp_8x8_8bpc_sse4: 303.7
      x86_32: warp_8x8_8bpc_sse4: 367.7
      ----------
      x86_64: warp_8x8_8bpc_avx2: 224.9
      ---------------------
      ---------------------
      x86_64: warp_8x8t_8bpc_c: 1664.6
      x86_32: warp_8x8t_8bpc_c: 1674.0
      ----------
      x86_64: warp_8x8t_8bpc_ssse3: 320.7
      x86_32: warp_8x8t_8bpc_ssse3: 379.5
      ----------
      x86_64: warp_8x8t_8bpc_sse4: 304.8
      x86_32: warp_8x8t_8bpc_sse4: 369.8
      ----------
      x86_64: warp_8x8t_8bpc_avx2: 228.5
      ------------------------------------------
      a91a03b0
  6. Sep 29, 2019
    • Martin Storsjö's avatar
      arm64: itx: Fix overflows in idct · 713aa34c
      Martin Storsjö authored
      Don't add two 16 bit coefficients in 16 bit, if the result isn't supposed
      to be clipped.
      
      This fixes mismatches for some samples, see issue #299.
      
      Before:                                Cortex A53       A72       A73
      inv_txfm_add_4x4_dct_dct_1_8bpc_neon:        93.0      52.8      49.5
      inv_txfm_add_8x8_dct_dct_1_8bpc_neon:       260.0     186.0     196.4
      inv_txfm_add_16x16_dct_dct_2_8bpc_neon:    1371.0     953.4    1028.6
      inv_txfm_add_32x32_dct_dct_4_8bpc_neon:    7363.2    4887.5    5135.8
      inv_txfm_add_64x64_dct_dct_4_8bpc_neon:   25029.0   17492.3   18404.5
      After:
      inv_txfm_add_4x4_dct_dct_1_8bpc_neon:       105.0      58.7      55.2
      inv_txfm_add_8x8_dct_dct_1_8bpc_neon:       294.0     211.5     209.9
      inv_txfm_add_16x16_dct_dct_2_8bpc_neon:    1495.8    1050.4    1070.6
      inv_txfm_add_32x32_dct_dct_4_8bpc_neon:    7866.7    5197.8    5321.4
      inv_txfm_add_64x64_dct_dct_4_8bpc_neon:   25807.2   18619.3   18526.9
      713aa34c
    • Martin Storsjö's avatar
      arm64: itx: Consistently use the factor 2896 in adst · 0ed3ad19
      Martin Storsjö authored
      The scaled form 2896>>4 shouldn't be necessary with valid bistreams.
      0ed3ad19
    • Martin Storsjö's avatar
      arm64: itx: Use smull+smlal instead of addl+mul · a4950bce
      Martin Storsjö authored
      Even though smull+smlal does two multiplications instead of one,
      the combination seems to be better handled by actual cores.
      
      Before:                                 Cortex A53      A72      A73
      inv_txfm_add_8x8_adst_adst_1_8bpc_neon:      356.0    279.2    278.0
      inv_txfm_add_16x16_adst_adst_2_8bpc_neon:   1785.0   1329.5   1308.8
      After:
      inv_txfm_add_8x8_adst_adst_1_8bpc_neon:      360.0    253.2    269.3
      inv_txfm_add_16x16_adst_adst_2_8bpc_neon:   1793.1   1300.9   1254.0
      
      (In this particular cases, it seems like it is a minor regression
      on A53, which is probably more due to having to change the ordering
      of some instructions, due to how smull+smlal+smull2+smlal2 overwrites
      the second output register sooner than an addl+addl2 would have, but
      in general, smull+smlal seems to be equally good or better than
      addl+mul on A53 as well.)
      a4950bce
  7. Sep 27, 2019
    • Niklas Haas's avatar
      dav1dplay: initial support for --zerocopy · 490a1420
      Niklas Haas authored and Jean-Baptiste Kempf's avatar Jean-Baptiste Kempf committed
      Right now this just allocates a new buffer for every frame, uses it,
      then discards it immediately. This is not optimal, either dav1d should
      start reusing buffers internally or we need to pool them in dav1dplay.
      
      As it stands, this is not really a performance gain. I'll have to
      investigate why, but my suspicion is that seeing any gains might require
      reusing buffers somewhere.
      
      Note: Thrashing buffers is not as bad as it seems, initially. Not only
      does libplacebo pool and reuse GPU memory and buffer state objects
      internally, but this also absolves us from having to do any manual
      polling to figure out when the buffer is reusable again. Creating, using
      and immediately destroying buffers actually isn't as bad an approach as
      it might otherwise seem.
      
      It's entirely possible that this is only bad because of lock contention.
      As said, I'll have to investigate further...
      490a1420
    • Niklas Haas's avatar
      dav1dplay: add --untimed for benchmarking purposes · 3f35ef1f
      Niklas Haas authored and Jean-Baptiste Kempf's avatar Jean-Baptiste Kempf committed
      Useful to test the effects of performance changes to the
      decoding/rendering loop as a whole.
      3f35ef1f
    • Niklas Haas's avatar
      dav1dplay: add --highquality to toggle render quality · f6ae8c9c
      Niklas Haas authored and Jean-Baptiste Kempf's avatar Jean-Baptiste Kempf committed
      Only meaningful with libplacebo. The defaults are higher quality than
      SDL so it's an unfair comparison and definitely too much for slow iGPUs
      at 4K res. Make the defaults fast/dumb processing only, and guard the
      debanding/dithering/upscaling/etc. behind a new --highquality flag.
      f6ae8c9c
  8. Sep 19, 2019
    • Victorien Le Couviour--Tuffet's avatar
      x86: add 32-bit support to SSSE3 deblock lpf · c0865f35
      Victorien Le Couviour--Tuffet authored
      ------------------------------------------
      x86_64: lpf_h_sb_uv_w4_8bpc_c: 430.6
      x86_32: lpf_h_sb_uv_w4_8bpc_c: 788.6
      x86_64: lpf_h_sb_uv_w4_8bpc_ssse3: 322.0
      x86_32: lpf_h_sb_uv_w4_8bpc_ssse3: 302.4
      ---------------------
      x86_64: lpf_h_sb_uv_w6_8bpc_c: 981.9
      x86_32: lpf_h_sb_uv_w6_8bpc_c: 1579.6
      x86_64: lpf_h_sb_uv_w6_8bpc_ssse3: 421.5
      x86_32: lpf_h_sb_uv_w6_8bpc_ssse3: 431.6
      ---------------------
      x86_64: lpf_h_sb_y_w4_8bpc_c: 3001.7
      x86_32: lpf_h_sb_y_w4_8bpc_c: 7021.3
      x86_64: lpf_h_sb_y_w4_8bpc_ssse3: 466.3
      x86_32: lpf_h_sb_y_w4_8bpc_ssse3: 564.7
      ---------------------
      x86_64: lpf_h_sb_y_w8_8bpc_c: 4457.7
      x86_32: lpf_h_sb_y_w8_8bpc_c: 3657.8
      x86_64: lpf_h_sb_y_w8_8bpc_ssse3: 818.9
      x86_32: lpf_h_sb_y_w8_8bpc_ssse3: 927.9
      ---------------------
      x86_64: lpf_h_sb_y_w16_8bpc_c: 1967.9
      x86_32: lpf_h_sb_y_w16_8bpc_c: 3343.5
      x86_64: lpf_h_sb_y_w16_8bpc_ssse3: 1836.7
      x86_32: lpf_h_sb_y_w16_8bpc_ssse3: 1975.0
      ---------------------
      x86_64: lpf_v_sb_uv_w4_8bpc_c: 369.4
      x86_32: lpf_v_sb_uv_w4_8bpc_c: 793.6
      x86_64: lpf_v_sb_uv_w4_8bpc_ssse3: 110.9
      x86_32: lpf_v_sb_uv_w4_8bpc_ssse3: 133.0
      ---------------------
      x86_64: lpf_v_sb_uv_w6_8bpc_c: 769.6
      x86_32: lpf_v_sb_uv_w6_8bpc_c: 1576.7
      x86_64: lpf_v_sb_uv_w6_8bpc_ssse3: 222.2
      x86_32: lpf_v_sb_uv_w6_8bpc_ssse3: 232.2
      ---------------------
      x86_64: lpf_v_sb_y_w4_8bpc_c: 772.4
      x86_32: lpf_v_sb_y_w4_8bpc_c: 2596.5
      x86_64: lpf_v_sb_y_w4_8bpc_ssse3: 179.8
      x86_32: lpf_v_sb_y_w4_8bpc_ssse3: 234.7
      ---------------------
      x86_64: lpf_v_sb_y_w8_8bpc_c: 1660.2
      x86_32: lpf_v_sb_y_w8_8bpc_c: 3979.9
      x86_64: lpf_v_sb_y_w8_8bpc_ssse3: 468.3
      x86_32: lpf_v_sb_y_w8_8bpc_ssse3: 580.9
      ---------------------
      x86_64: lpf_v_sb_y_w16_8bpc_c: 1889.6
      x86_32: lpf_v_sb_y_w16_8bpc_c: 4728.7
      x86_64: lpf_v_sb_y_w16_8bpc_ssse3: 1142.0
      x86_32: lpf_v_sb_y_w16_8bpc_ssse3: 1174.8
      ------------------------------------------
      c0865f35
    • Ronald S. Bultje's avatar
      x86: add deblocking loopfilters SSSE3 asm (64-bit) · 1e4e6c7a
      Ronald S. Bultje authored and Victorien Le Couviour--Tuffet's avatar Victorien Le Couviour--Tuffet committed
      ---------------------
      x86_64:
      ------------------------------------------
      lpf_h_sb_uv_w4_8bpc_c: 430.6
      lpf_h_sb_uv_w4_8bpc_ssse3: 322.0
      lpf_h_sb_uv_w4_8bpc_avx2: 200.4
      ---------------------
      lpf_h_sb_uv_w6_8bpc_c: 981.9
      lpf_h_sb_uv_w6_8bpc_ssse3: 421.5
      lpf_h_sb_uv_w6_8bpc_avx2: 270.0
      ---------------------
      lpf_h_sb_y_w4_8bpc_c: 3001.7
      lpf_h_sb_y_w4_8bpc_ssse3: 466.3
      lpf_h_sb_y_w4_8bpc_avx2: 383.1
      ---------------------
      lpf_h_sb_y_w8_8bpc_c: 4457.7
      lpf_h_sb_y_w8_8bpc_ssse3: 818.9
      lpf_h_sb_y_w8_8bpc_avx2: 537.0
      ---------------------
      lpf_h_sb_y_w16_8bpc_c: 1967.9
      lpf_h_sb_y_w16_8bpc_ssse3: 1836.7
      lpf_h_sb_y_w16_8bpc_avx2: 1078.2
      ---------------------
      lpf_v_sb_uv_w4_8bpc_c: 369.4
      lpf_v_sb_uv_w4_8bpc_ssse3: 110.9
      lpf_v_sb_uv_w4_8bpc_avx2: 58.1
      ---------------------
      lpf_v_sb_uv_w6_8bpc_c: 769.6
      lpf_v_sb_uv_w6_8bpc_ssse3: 222.2
      lpf_v_sb_uv_w6_8bpc_avx2: 117.8
      ---------------------
      lpf_v_sb_y_w4_8bpc_c: 772.4
      lpf_v_sb_y_w4_8bpc_ssse3: 179.8
      lpf_v_sb_y_w4_8bpc_avx2: 173.6
      ---------------------
      lpf_v_sb_y_w8_8bpc_c: 1660.2
      lpf_v_sb_y_w8_8bpc_ssse3: 468.3
      lpf_v_sb_y_w8_8bpc_avx2: 345.8
      ---------------------
      lpf_v_sb_y_w16_8bpc_c: 1889.6
      lpf_v_sb_y_w16_8bpc_ssse3: 1142.0
      lpf_v_sb_y_w16_8bpc_avx2: 568.1
      ------------------------------------------
      1e4e6c7a
  9. Sep 10, 2019
  10. Sep 06, 2019
  11. Sep 05, 2019
    • Henrik Gramner's avatar
      Silence some clang-cl warnings · acad1a99
      Henrik Gramner authored
      For some reason the MSVC CRT _wassert() function is not flagged as
       __declspec(noreturn), so when using those headers the compiler will
      expect execution to continue after an assertion has been triggered
      and will therefore complain about the use of uninitialized variables
      when compiled in debug mode in certain code paths.
      
      Reorder some case statements as a workaround.
      acad1a99
    • Henrik Gramner's avatar
      x86: Fix buffer overead in mc put · 69dae683
      Henrik Gramner authored and Henrik Gramner's avatar Henrik Gramner committed
      For w <= 32 we can't process more than two rows per loop iteration.
      
      Credit to OSS-Fuzz.
      69dae683
  12. Sep 04, 2019
    • Henrik Gramner's avatar
      x86: Increase precision of the final inverse ADST transform stages · a9315f5f
      Henrik Gramner authored and Henrik Gramner's avatar Henrik Gramner committed
      16-bit precision is sufficient for the second pass, but the first pass
      requires 32-bit precision to correctly handle some esoteric edge cases.
      a9315f5f
    • Martin Storsjö's avatar
      arm64: itx: Do the final calculation of adst4/adst8/adst16 in 32 bit to avoid too narrow clipping · e2702eaf
      Martin Storsjö authored
      See issue #295, this fixes it for arm64.
      
      Before:                                 Cortex A53      A72      A73
      inv_txfm_add_4x4_adst_adst_1_8bpc_neon:      103.0     63.2     65.2
      inv_txfm_add_4x8_adst_adst_1_8bpc_neon:      197.0    145.0    134.2
      inv_txfm_add_8x8_adst_adst_1_8bpc_neon:      332.0    248.0    247.1
      inv_txfm_add_16x16_adst_adst_2_8bpc_neon:   1676.8   1197.0   1186.8
      After:
      inv_txfm_add_4x4_adst_adst_1_8bpc_neon:      103.0     76.4     67.0
      inv_txfm_add_4x8_adst_adst_1_8bpc_neon:      205.0    155.0    143.8
      inv_txfm_add_8x8_adst_adst_1_8bpc_neon:      358.0    269.0    276.2
      inv_txfm_add_16x16_adst_adst_2_8bpc_neon:   1785.2   1347.8   1312.1
      
      This would probably only be needed for adst in the first pass, but
      the additional code complexity from splitting the implementations
      (as we currently don't have transforms differentiated between first
      and second pass) isn't necessarily worth it (the speedup over C code
      is still 8-10x).
      e2702eaf
    • Henrik Gramner's avatar
      Prefer __builtin_unreachable() over __assume() on clang-cl · c0e1988b
      Henrik Gramner authored
      __assume() doesn't work correctly in clang-cl versions prior to 7.0.0
      which causes bogus warnings regarding use of uninitialized variables
      to be printed. Avoid that by using __builtin_unreachable() instead.
      c0e1988b
    • Henrik Gramner's avatar
      Fix clang-cl assertion warning · 666c71a0
      Henrik Gramner authored
      clang-cl doesn't like function calls in __assume statements, even
      trivial inline ones.
      666c71a0
    • Janne Grunau's avatar
      arm: Fix assembling with older binutils · e65abadf
      Janne Grunau authored and Martin Storsjö's avatar Martin Storsjö committed
      This large constant needs a movw instruction, which newer binutils can
      figure out, but older versions need stated explicitly.
      
      This fixes #296.
      e65abadf
  13. Sep 03, 2019
    • Janne Grunau's avatar
      TileContext: reorder scratch buffer to avoid conflicts · 863c3731
      Janne Grunau authored
      The chroma part of pal_idx potentially conflicts during intra
      reconstruction with edge_{8,16}bpc. Fixes out of range pixel values
      caused by invalid palette indices in
      clusterfuzz-testcase-minimized-dav1d_fuzzer_mt-5076736684851200.
      Fixes #294. Reported as integer overflows in boxsum5sqr with undefined
      behavior sanitizer. Credits to oss-fuzz.
      863c3731
  14. Sep 01, 2019
  15. Aug 30, 2019
  16. Aug 29, 2019
Loading