Skip to content
Snippets Groups Projects
  1. Jun 18, 2021
  2. Jun 17, 2021
  3. Jun 15, 2021
  4. Jun 12, 2021
    • Martin Storsjö's avatar
      arm32: filmgrain: Add NEON implementation of fgy and fguv for 16 bpc · ddbbfde1
      Martin Storsjö authored and Jean-Baptiste Kempf's avatar Jean-Baptiste Kempf committed
      Relative speedup over C code:
                                      Cortex A7     A8     A9    A53    A72    A73
      fguv_32x32xn_16bpc_420_csfl0_neon:   3.47   1.72   2.99   4.18   2.68   6.19
      fguv_32x32xn_16bpc_420_csfl1_neon:   3.24   1.36   2.58   3.78   2.73   5.27
      fguv_32x32xn_16bpc_422_csfl0_neon:   3.57   2.07   3.05   4.32   2.74   6.20
      fguv_32x32xn_16bpc_422_csfl1_neon:   3.33   1.44   2.62   3.89   2.71   5.28
      fguv_32x32xn_16bpc_444_csfl0_neon:   3.48   1.69   3.06   4.48   2.97   6.69
      fguv_32x32xn_16bpc_444_csfl1_neon:   3.06   1.16   2.36   3.85   2.75   5.19
      fgy_32x32xn_16bpc_neon:              2.89   1.05   2.29   3.49   2.49   3.15
      
      Absolute numbers:
                                        Cortex A7       A8       A9      A53      A72      A73
      fguv_32x32xn_16bpc_420_csfl0_neon:   6237.3  12701.0   6687.1   4525.8   3220.8   3195.4
      fguv_32x32xn_16bpc_420_csfl1_neon:   5143.2  11684.8   5926.4   3857.2   2604.7   2556.5
      fguv_32x32xn_16bpc_422_csfl0_neon:   6347.3  11005.2   6797.5   4582.4   3300.4   3250.5
      fguv_32x32xn_16bpc_422_csfl1_neon:   5275.2  11594.8   5992.6   3931.1   2668.7   2607.3
      fguv_32x32xn_16bpc_444_csfl0_neon:   5181.6  11310.0   5575.4   3629.7   2383.8   2530.0
      fguv_32x32xn_16bpc_444_csfl1_neon:   4081.9  10958.8   4868.5   2962.9   1870.3   2034.2
      fgy_32x32xn_16bpc_neon:             15439.1  43129.0  19406.6  11542.3   7463.9   7827.8
      
      Corresponding numbers for arm64:
                                                                  Cortex A53      A72      A73
      fguv_32x32xn_16bpc_420_csfl0_neon:                              4019.2   3247.4   3259.6
      fguv_32x32xn_16bpc_420_csfl1_neon:                              3460.1   2628.7   2640.8
      fguv_32x32xn_16bpc_422_csfl0_neon:                              4034.4   3329.9   3287.5
      fguv_32x32xn_16bpc_422_csfl1_neon:                              3468.3   2749.3   2686.6
      fguv_32x32xn_16bpc_444_csfl0_neon:                              3117.7   2447.4   2539.8
      fguv_32x32xn_16bpc_444_csfl1_neon:                              2641.2   1977.2   2132.8
      fgy_32x32xn_16bpc_neon:                                         9873.5   7605.7   7656.2
      ddbbfde1
  5. Jun 11, 2021
    • Ronald S. Bultje's avatar
      Add 10/12-bit deblock SSSE3 implementation · f7043e47
      Ronald S. Bultje authored
      Currently 64-bit only.
      f7043e47
    • Martin Storsjö's avatar
      arm32: filmgrain: Add NEON implementations of fgy and fguv for 8 bpc · c187e704
      Martin Storsjö authored
      Relative speedup over C code:
                                     Cortex A7     A8     A9    A53    A72    A73
      fguv_32x32xn_8bpc_420_csfl0_neon:   4.20   2.19   3.48   4.93   3.60   5.93
      fguv_32x32xn_8bpc_420_csfl1_neon:   3.92   1.52   2.84   4.34   3.82   5.93
      fguv_32x32xn_8bpc_422_csfl0_neon:   4.27   2.13   3.58   5.02   4.04   5.95
      fguv_32x32xn_8bpc_422_csfl1_neon:   3.99   1.56   2.91   4.43   3.89   6.00
      fguv_32x32xn_8bpc_444_csfl0_neon:   4.48   2.08   3.89   5.66   4.07   6.51
      fguv_32x32xn_8bpc_444_csfl1_neon:   4.45   1.41   2.99   5.28   3.63   6.09
      fgy_32x32xn_8bpc_neon:              3.61   1.10   2.62   4.35   3.06   3.74
      
      Absolute numbers:
                                       Cortex A7       A8       A9      A53      A72      A73
      fguv_32x32xn_8bpc_420_csfl0_neon:   5318.8  11167.7   6024.6   3909.9   2945.2   2993.5
      fguv_32x32xn_8bpc_420_csfl1_neon:   4351.0  10929.7   5269.5   3316.8   2166.5   2256.9
      fguv_32x32xn_8bpc_422_csfl0_neon:   5387.9  11746.7   6080.0   3945.8   2988.1   3046.3
      fguv_32x32xn_8bpc_422_csfl1_neon:   4396.0  11083.2   5300.8   3354.9   2216.4   2291.4
      fguv_32x32xn_8bpc_444_csfl0_neon:   4347.9  10595.0   5134.4   3079.1   2277.7   2392.9
      fguv_32x32xn_8bpc_444_csfl1_neon:   3295.0  10518.2   4442.6   2476.3   1716.3   1829.2
      fgy_32x32xn_8bpc_neon:             12376.2  41046.9  17259.7   9153.1   6610.4   7005.3
      
      Corresponding numbers for arm64:                           Cortex A53      A72      A73
      fguv_32x32xn_8bpc_420_csfl0_neon:                              3822.9   2920.0   2935.7
      fguv_32x32xn_8bpc_420_csfl1_neon:                              3209.7   2231.7   2335.4
      fguv_32x32xn_8bpc_422_csfl0_neon:                              3807.9   2886.5   2966.7
      fguv_32x32xn_8bpc_422_csfl1_neon:                              3197.1   2187.9   2355.9
      fguv_32x32xn_8bpc_444_csfl0_neon:                              2757.8   2227.4   2334.4
      fguv_32x32xn_8bpc_444_csfl1_neon:                              2244.6   1719.1   1786.7
      fgy_32x32xn_8bpc_neon:                                         8192.2   6563.3   6969.1
      c187e704
  6. Jun 10, 2021
  7. Jun 09, 2021
  8. Jun 07, 2021
  9. Jun 05, 2021
  10. Jun 04, 2021
  11. Jun 02, 2021
    • Martin Storsjö's avatar
      checkasm: Remove an unused variable/parameter · 90dad3ee
      Martin Storsjö authored
      Clang 13 got support for warning about variables that are set but
      not used. We disable warnings for unused parameters, but in this case,
      the parameter variable is updated within the function too, which
      Clang warns about.
      90dad3ee
  12. May 31, 2021
  13. May 27, 2021
  14. May 25, 2021
    • Martin Storsjö's avatar
      arm64: filmgrain: Fix overflows in gen_grain · c389d895
      Martin Storsjö authored
      After multiplying two int8_t, the maximum possible output is
      -128*-128 = 16384. One can't add two such values in an int16_t (even if
      all the products of all other int8_t combinations can be).
      
      Previously the summing used 16 bit intermediates for the sum of two
      products and only lengtheted the result to 32 bit when accumulating
      three or more products.
      
      Before:                    Cortex A53       A72       A73   Apple M1
      gen_grain_y_ar1_8bpc_neon:   112598.5   71309.2   74889.8   372.2
      gen_grain_y_ar2_8bpc_neon:   139932.4   91442.3   95788.4   387.3
      gen_grain_y_ar3_8bpc_neon:   185607.6  115691.6  131655.8   403.0
      After:
      gen_grain_y_ar1_8bpc_neon:   112968.8   71897.9   76171.2   371.2
      gen_grain_y_ar2_8bpc_neon:   142768.8   94517.9   97934.4   387.5
      gen_grain_y_ar3_8bpc_neon:   191625.2  121083.0  135975.3   405.6
      c389d895
  15. May 18, 2021
  16. May 16, 2021
  17. May 14, 2021
  18. May 13, 2021
  19. May 12, 2021
  20. May 11, 2021
  21. May 10, 2021
Loading