Skip to content
Snippets Groups Projects
  1. Oct 12, 2024
  2. Oct 09, 2024
    • Bogdan Gligorijević's avatar
      riscv64/mc: warp_8x8 and warp_8x8t 8bpc · b2e7f06c
      Bogdan Gligorijević authored
      Benchmarks:
      - Kendryte K230:
      warp_8x8_8bpc_c:      4549.7 ( 1.00x)
      warp_8x8_8bpc_rvv:    2504.7 ( 1.82x)
      warp_8x8t_8bpc_c:     4414.7 ( 1.00x)
      warp_8x8t_8bpc_rvv:   2465.7 ( 1.79x)
      
      - Banana Pi BPI-F3:
      warp_8x8_8bpc_c:      4431.2 ( 1.00x)
      warp_8x8_8bpc_rvv:    3297.4 ( 1.34x)
      warp_8x8t_8bpc_c:     4299.3 ( 1.00x)
      warp_8x8t_8bpc_rvv:   3255.7 ( 1.32x)
      b2e7f06c
    • Niklas Haas's avatar
      riscv64/mc: Re-order instructions · 56f6d166
      Niklas Haas authored
      To avoid read-after-write. Speedup is about 1% for width=4 on a K230.
      56f6d166
    • Niklas Haas's avatar
      riscv64/mc: Add bidir functions · 3d12677c
      Niklas Haas authored
      This code compromises between the performance of a dedicated kernel per
      VLEN/width pair, and the flexibility of a fully VLEN-dynamic loop, by
      using a single special case for w=4, and subdividing the rest into the
      unrolled four line fast path, and the general-purpose slow path (for
      large width on small VLEN).
      
      Kendryte K230
      
      avg_w4_8bpc_c:          346.8 ( 1.00x)
      avg_w4_8bpc_rvv:         50.3 ( 6.90x)
      avg_w8_8bpc_c:         1054.9 ( 1.00x)
      avg_w8_8bpc_rvv:        139.1 ( 7.58x)
      avg_w16_8bpc_c:        3396.3 ( 1.00x)
      avg_w16_8bpc_rvv:       350.6 ( 9.69x)
      avg_w32_8bpc_c:       13734.3 ( 1.00x)
      avg_w32_8bpc_rvv:      1226.3 (11.20x)
      avg_w64_8bpc_c:       33260.9 ( 1.00x)
      avg_w64_8bpc_rvv:      3869.4 ( 8.60x)
      avg_w128_8bpc_c:      83441.3 ( 1.00x)
      avg_w128_8bpc_rvv:     9765.1 ( 8.54x)
      
      w_avg_w4_8bpc_c:        444.3 ( 1.00x)
      w_avg_w4_8bpc_rvv:       75.8 ( 5.86x)
      w_avg_w8_8bpc_c:       1365.6 ( 1.00x)
      w_avg_w8_8bpc_rvv:      208.8 ( 6.54x)
      w_avg_w16_8bpc_c:      4420.8 ( 1.00x)
      w_avg_w16_8bpc_rvv:     570.7 ( 7.75x)
      w_avg_w32_8bpc_c:     18010.9 ( 1.00x)
      w_avg_w32_8bpc_rvv:    2074.4 ( 8.68x)
      w_avg_w64_8bpc_c:     43050.4 ( 1.00x)
      w_avg_w64_8bpc_rvv:    5799.5 ( 7.42x)
      w_avg_w128_8bpc_c:   107153.6 ( 1.00x)
      w_avg_w128_8bpc_rvv:  14272.0 ( 7.51x)
      
      mask_w4_8bpc_c:        497.6 ( 1.00x)
      mask_w4_8bpc_rvv:       88.5 ( 5.63x)
      mask_w8_8bpc_c:       1528.5 ( 1.00x)
      mask_w8_8bpc_rvv:      253.1 ( 6.04x)
      mask_w16_8bpc_c:      4953.8 ( 1.00x)
      mask_w16_8bpc_rvv:     679.0 ( 7.30x)
      mask_w32_8bpc_c:     20298.3 ( 1.00x)
      mask_w32_8bpc_rvv:    3012.9 ( 6.74x)
      mask_w64_8bpc_c:     49718.8 ( 1.00x)
      mask_w64_8bpc_rvv:    7291.7 ( 6.82x)
      mask_w128_8bpc_c:   126740.3 ( 1.00x)
      mask_w128_8bpc_rvv:  18351.1 ( 6.91x)
      3d12677c
    • Niklas Haas's avatar
      riscv: Add $vtype helper definitions · 50ac8260
      Niklas Haas authored
      50ac8260
    • Nathan E. Egge's avatar
      riscv64/mc: Branchless vsetvl in blend_v function · cc7d8773
      Nathan E. Egge authored
      Kendryte K230
      
      blend_v_w2_8bpc_c:       221.4 ( 1.00x)
      blend_v_w2_8bpc_rvv:     147.7 ( 1.50x)
      blend_v_w4_8bpc_c:       945.3 ( 1.00x)
      blend_v_w4_8bpc_rvv:     243.3 ( 3.89x)
      blend_v_w8_8bpc_c:      1786.9 ( 1.00x)
      blend_v_w8_8bpc_rvv:     256.1 ( 6.98x)
      blend_v_w16_8bpc_c:     3472.1 ( 1.00x)
      blend_v_w16_8bpc_rvv:    351.1 ( 9.89x)
      blend_v_w32_8bpc_c:     6832.1 ( 1.00x)
      blend_v_w32_8bpc_rvv:    635.4 (10.75x)
      
      SpacemiT K1
      
      blend_v_w2_8bpc_c:       218.0 ( 1.00x)
      blend_v_w2_8bpc_rvv:     144.3 ( 1.51x)
      blend_v_w4_8bpc_c:       921.7 ( 1.00x)
      blend_v_w4_8bpc_rvv:     237.1 ( 3.89x)
      blend_v_w8_8bpc_c:      1739.8 ( 1.00x)
      blend_v_w8_8bpc_rvv:     237.4 ( 7.33x)
      blend_v_w16_8bpc_c:     3376.6 ( 1.00x)
      blend_v_w16_8bpc_rvv:    296.3 (11.40x)
      blend_v_w32_8bpc_c:     6647.2 ( 1.00x)
      blend_v_w32_8bpc_rvv:    408.1 (16.29x)
      cc7d8773
    • Nathan E. Egge's avatar
      riscv64/mc: Branchless vsetvl in blend_h function · 2da8107e
      Nathan E. Egge authored
      Kendryte K230
      
      blend_h_w2_8bpc_c:        165.9 ( 1.00x)
      blend_h_w2_8bpc_rvv:       83.8 ( 1.98x)
      blend_h_w4_8bpc_c:        295.2 ( 1.00x)
      blend_h_w4_8bpc_rvv:       83.8 ( 3.52x)
      blend_h_w8_8bpc_c:        557.9 ( 1.00x)
      blend_h_w8_8bpc_rvv:       92.5 ( 6.03x)
      blend_h_w16_8bpc_c:      1078.8 ( 1.00x)
      blend_h_w16_8bpc_rvv:     117.3 ( 9.19x)
      blend_h_w32_8bpc_c:      2117.8 ( 1.00x)
      blend_h_w32_8bpc_rvv:     200.5 (10.57x)
      blend_h_w64_8bpc_c:      4194.7 ( 1.00x)
      blend_h_w64_8bpc_rvv:     363.2 (11.55x)
      blend_h_w128_8bpc_c:    10271.4 ( 1.00x)
      blend_h_w128_8bpc_rvv:    844.5 (12.16x)
      
      SpacemiT K1
      
      blend_h_w2_8bpc_c:        162.5 ( 1.00x)
      blend_h_w2_8bpc_rvv:       83.9 ( 1.94x)
      blend_h_w4_8bpc_c:        288.6 ( 1.00x)
      blend_h_w4_8bpc_rvv:       83.7 ( 3.45x)
      blend_h_w8_8bpc_c:        544.7 ( 1.00x)
      blend_h_w8_8bpc_rvv:       84.0 ( 6.48x)
      blend_h_w16_8bpc_c:      1052.8 ( 1.00x)
      blend_h_w16_8bpc_rvv:     102.9 (10.23x)
      blend_h_w32_8bpc_c:      2068.0 ( 1.00x)
      blend_h_w32_8bpc_rvv:     131.4 (15.73x)
      blend_h_w64_8bpc_c:      4093.7 ( 1.00x)
      blend_h_w64_8bpc_rvv:     220.3 (18.58x)
      blend_h_w128_8bpc_c:    10023.1 ( 1.00x)
      blend_h_w128_8bpc_rvv:    467.3 (21.45x)
      2da8107e
    • Nathan E. Egge's avatar
      riscv64/mc: Branchless vsetvl in blend function · b374b24c
      Nathan E. Egge authored
      Kendryte K230
      
      blend_w4_8bpc_c:       204.8 ( 1.00x)
      blend_w4_8bpc_rvv:      59.8 ( 3.42x)
      blend_w8_8bpc_c:       608.9 ( 1.00x)
      blend_w8_8bpc_rvv:      87.2 ( 6.98x)
      blend_w16_8bpc_c:     2362.4 ( 1.00x)
      blend_w16_8bpc_rvv:    225.2 (10.49x)
      blend_w32_8bpc_c:     5990.4 ( 1.00x)
      blend_w32_8bpc_rvv:    518.3 (11.56x)
      
      SpacemiT K1
      
      blend_w4_8bpc_c:       201.6 ( 1.00x)
      blend_w4_8bpc_rvv:      58.0 ( 3.48x)
      blend_w8_8bpc_c:       595.1 ( 1.00x)
      blend_w8_8bpc_rvv:      82.1 ( 7.25x)
      blend_w16_8bpc_c:     2308.8 ( 1.00x)
      blend_w16_8bpc_rvv:    189.0 (12.22x)
      blend_w32_8bpc_c:     5853.1 ( 1.00x)
      blend_w32_8bpc_rvv:    339.5 (17.24x)
      b374b24c
    • Nathan E. Egge's avatar
      riscv64/mc: Add VLEN=256 8bpc RVV blend_v function · 0e3f70e8
      Nathan E. Egge authored
      SpacemiT K1
      
      blend_v_w2_8bpc_c:       217.0 ( 1.00x)
      blend_v_w2_8bpc_rvv:     143.3 ( 1.51x)
      blend_v_w4_8bpc_c:       921.6 ( 1.00x)
      blend_v_w4_8bpc_rvv:     236.3 ( 3.90x)
      blend_v_w8_8bpc_c:      1738.2 ( 1.00x)
      blend_v_w8_8bpc_rvv:     238.1 ( 7.30x)
      blend_v_w16_8bpc_c:     3376.1 ( 1.00x)
      blend_v_w16_8bpc_rvv:    298.0 (11.33x)
      blend_v_w32_8bpc_c:     6648.0 ( 1.00x)
      blend_v_w32_8bpc_rvv:    409.5 (16.24x)
      0e3f70e8
    • Nathan E. Egge's avatar
      riscv64/mc: Add VLEN=256 8bpc RVV blend_h function · a5b95448
      Nathan E. Egge authored
      SpacemiT K1
      
      blend_h_w2_8bpc_c:        161.8 ( 1.00x)
      blend_h_w2_8bpc_rvv:       83.5 ( 1.94x)
      blend_h_w4_8bpc_c:        288.4 ( 1.00x)
      blend_h_w4_8bpc_rvv:       83.7 ( 3.45x)
      blend_h_w8_8bpc_c:        543.9 ( 1.00x)
      blend_h_w8_8bpc_rvv:       84.5 ( 6.44x)
      blend_h_w16_8bpc_c:      1051.6 ( 1.00x)
      blend_h_w16_8bpc_rvv:     103.8 (10.13x)
      blend_h_w32_8bpc_c:      2066.0 ( 1.00x)
      blend_h_w32_8bpc_rvv:     133.8 (15.44x)
      blend_h_w64_8bpc_c:      4092.7 ( 1.00x)
      blend_h_w64_8bpc_rvv:     225.2 (18.18x)
      blend_h_w128_8bpc_c:    10011.3 ( 1.00x)
      blend_h_w128_8bpc_rvv:    474.7 (21.09x)
      a5b95448
    • Nathan E. Egge's avatar
      riscv64/mc: Add VLEN=256 8bpc RVV blend function · 83485c50
      Nathan E. Egge authored
      SpacemiT K1
      
      blend_w4_8bpc_c:       201.3 ( 1.00x)
      blend_w4_8bpc_rvv:      59.3 ( 3.40x)
      blend_w8_8bpc_c:       595.1 ( 1.00x)
      blend_w8_8bpc_rvv:      84.1 ( 7.07x)
      blend_w16_8bpc_c:     2309.0 ( 1.00x)
      blend_w16_8bpc_rvv:    190.5 (12.12x)
      blend_w32_8bpc_c:     5854.7 ( 1.00x)
      blend_w32_8bpc_rvv:    341.6 (17.14x)
      83485c50
    • Nathan E. Egge's avatar
      7f2bb2fb
    • Nathan E. Egge's avatar
      riscv64/mc: Add 8bpc RVV blend_v function · 01da36eb
      Nathan E. Egge authored
      Kendryte K230
      
      blend_v_w2_8bpc_c:       219.6 ( 1.00x)
      blend_v_w2_8bpc_rvv:     141.8 ( 1.55x)
      blend_v_w4_8bpc_c:       942.9 ( 1.00x)
      blend_v_w4_8bpc_rvv:     240.9 ( 3.91x)
      blend_v_w8_8bpc_c:      1783.5 ( 1.00x)
      blend_v_w8_8bpc_rvv:     254.7 ( 7.00x)
      blend_v_w16_8bpc_c:     3466.5 ( 1.00x)
      blend_v_w16_8bpc_rvv:    350.5 ( 9.89x)
      blend_v_w32_8bpc_c:     6825.2 ( 1.00x)
      blend_v_w32_8bpc_rvv:    635.1 (10.75x)
      01da36eb
    • Nathan E. Egge's avatar
      riscv64/mc: Add 8bpc RVV blend_h function · d3a94f11
      Nathan E. Egge authored
      Kendryte K230
      
      blend_h_w2_8bpc_c:        165.4 ( 1.00x)
      blend_h_w2_8bpc_rvv:       79.4 ( 2.08x)
      blend_h_w4_8bpc_c:        294.6 ( 1.00x)
      blend_h_w4_8bpc_rvv:       81.5 ( 3.61x)
      blend_h_w8_8bpc_c:        556.9 ( 1.00x)
      blend_h_w8_8bpc_rvv:       90.2 ( 6.17x)
      blend_h_w16_8bpc_c:      1077.6 ( 1.00x)
      blend_h_w16_8bpc_rvv:     116.1 ( 9.29x)
      blend_h_w32_8bpc_c:      2116.2 ( 1.00x)
      blend_h_w32_8bpc_rvv:     200.5 (10.55x)
      blend_h_w64_8bpc_c:      4191.8 ( 1.00x)
      blend_h_w64_8bpc_rvv:     363.3 (11.54x)
      blend_h_w128_8bpc_c:    10264.6 ( 1.00x)
      blend_h_w128_8bpc_rvv:    844.1 (12.16x)
      d3a94f11
    • Nathan E. Egge's avatar
      riscv64/mc: Add 8bpc RVV blend function · f851fcd0
      Nathan E. Egge authored
      Kendryte K230
      
      blend_w4_8bpc_c:       204.5 ( 1.00x)
      blend_w4_8bpc_rvv:      56.4 ( 3.62x)
      blend_w8_8bpc_c:       608.6 ( 1.00x)
      blend_w8_8bpc_rvv:      87.3 ( 6.97x)
      blend_w16_8bpc_c:     2363.8 ( 1.00x)
      blend_w16_8bpc_rvv:    225.1 (10.50x)
      blend_w32_8bpc_c:     5990.3 ( 1.00x)
      blend_w32_8bpc_rvv:    518.8 (11.55x)
      f851fcd0
    • Bogdan Gligorijević's avatar
      Tone down loop to only 2 iterations · 848c5a2d
      Bogdan Gligorijević authored
      Benchmark pending
      848c5a2d
    • Bogdan Gligorijević's avatar
      Scalar dc calculation · a0a08d85
      Bogdan Gligorijević authored
      Current benchmark:
      
      - Kendryte K230:
      inv_txfm_add_16x16_dct_dct_0_8bpc_c:     1729.4 ( 1.00x)
      inv_txfm_add_16x16_dct_dct_0_8bpc_rvv:    153.2 (11.29x)
      
      - spacemiT K1:
      inv_txfm_add_16x16_dct_dct_0_8bpc_c:     1533.4 ( 1.00x)
      inv_txfm_add_16x16_dct_dct_0_8bpc_rvv:    176.8 ( 8.67x)
      a0a08d85
    • Bogdan Gligorijević's avatar
      riscv64/itx: Special case 16x16 8bpc dct_dct eob=0 · c8749f06
      Bogdan Gligorijević authored
      Performance comparison:
      
      - SpacemiT K1:                             Master branch:       itx_16x16:
        inv_txfm_add_16x16_dct_dct_0_8bpc_c:     1534.1 ( 1.00x)      1534.9 ( 1.00x)
        inv_txfm_add_16x16_dct_dct_0_8bpc_rvv:   1173.6 ( 1.31x)       173.1 ( 8.87x)
      
      - Kendryte K230:                           Master branch:       itx_16x16:
        inv_txfm_add_16x16_dct_dct_0_8bpc_c:     1576.0 ( 1.00x)      1579.1 ( 1.00x)
        inv_txfm_add_16x16_dct_dct_0_8bpc_rvv:   1095.5 ( 1.44x)       146.8 (10.75x)
      c8749f06
    • Bogdan Gligorijević's avatar
      ipred_paeth · 0cdf1b4b
      Bogdan Gligorijević authored
      Benchmarks:
      - Kendryte K230:
      intra_pred_paeth_w4_8bpc_c:       412.9 ( 1.00x)
      intra_pred_paeth_w4_8bpc_rvv:     688.0 ( 0.60x)
      intra_pred_paeth_w8_8bpc_c:      1206.6 ( 1.00x)
      intra_pred_paeth_w8_8bpc_rvv:    1094.3 ( 1.10x)
      intra_pred_paeth_w16_8bpc_c:     3889.7 ( 1.00x)
      intra_pred_paeth_w16_8bpc_rvv:   1796.7 ( 2.16x)
      intra_pred_paeth_w32_8bpc_c:     9797.2 ( 1.00x)
      intra_pred_paeth_w32_8bpc_rvv:   4323.9 ( 2.27x)
      intra_pred_paeth_w64_8bpc_c:    24242.5 ( 1.00x)
      intra_pred_paeth_w64_8bpc_rvv:  10739.8 ( 2.26x)
      
      - Banana Pi BPI-F3
      intra_pred_paeth_w4_8bpc_c:       395.1 ( 1.00x)
      intra_pred_paeth_w4_8bpc_rvv:     705.4 ( 0.56x)
      intra_pred_paeth_w8_8bpc_c:      1184.9 ( 1.00x)
      intra_pred_paeth_w8_8bpc_rvv:    1125.3 ( 1.05x)
      intra_pred_paeth_w16_8bpc_c:     3807.8 ( 1.00x)
      intra_pred_paeth_w16_8bpc_rvv:   1850.8 ( 2.06x)
      intra_pred_paeth_w32_8bpc_c:     9985.1 ( 1.00x)
      intra_pred_paeth_w32_8bpc_rvv:   2235.5 ( 4.47x)
      intra_pred_paeth_w64_8bpc_c:    24040.4 ( 1.00x)
      intra_pred_paeth_w64_8bpc_rvv:   5450.0 ( 4.41x)
      0cdf1b4b
    • Bogdan Gligorijević's avatar
      pal_pred · b830ac82
      Bogdan Gligorijević authored
      Benchmarks:
      
      - Kendryte K230:
      pal_pred_w4_8bpc_c:        115.6 ( 1.00x)
      pal_pred_w4_8bpc_rvv:      331.4 ( 0.35x)
      pal_pred_w4_16bpc_c:       140.8 ( 1.00x)
      pal_pred_w4_16bpc_rvv:     247.9 ( 0.57x)
      pal_pred_w8_8bpc_c:        334.9 ( 1.00x)
      pal_pred_w8_8bpc_rvv:      520.8 ( 0.64x)
      pal_pred_w8_16bpc_c:       412.7 ( 1.00x)
      pal_pred_w8_16bpc_rvv:     386.2 ( 1.07x)
      pal_pred_w16_8bpc_c:      1044.4 ( 1.00x)
      pal_pred_w16_8bpc_rvv:     842.8 ( 1.24x)
      pal_pred_w16_16bpc_c:     1300.3 ( 1.00x)
      pal_pred_w16_16bpc_rvv:    619.9 ( 2.10x)
      pal_pred_w32_8bpc_c:      2452.8 ( 1.00x)
      pal_pred_w32_8bpc_rvv:    1016.1 ( 2.41x)
      pal_pred_w32_16bpc_c:     3072.1 ( 1.00x)
      pal_pred_w32_16bpc_rvv:   1440.5 ( 2.13x)
      pal_pred_w64_8bpc_c:      6015.8 ( 1.00x)
      pal_pred_w64_8bpc_rvv:    2505.5 ( 2.40x)
      pal_pred_w64_16bpc_c:     7552.4 ( 1.00x)
      pal_pred_w64_16bpc_rvv:   3512.7 ( 2.15x)
      
      - Banana Pi BPI-F3:
      pal_pred_w4_8bpc_c:        102.2 ( 1.00x)
      pal_pred_w4_8bpc_rvv:      511.2 ( 0.20x)
      pal_pred_w4_16bpc_c:       137.7 ( 1.00x)
      pal_pred_w4_16bpc_rvv:     330.9 ( 0.42x)
      pal_pred_w8_8bpc_c:        289.2 ( 1.00x)
      pal_pred_w8_8bpc_rvv:      819.6 ( 0.35x)
      pal_pred_w8_16bpc_c:       402.6 ( 1.00x)
      pal_pred_w8_16bpc_rvv:     520.7 ( 0.77x)
      pal_pred_w16_8bpc_c:       894.5 ( 1.00x)
      pal_pred_w16_8bpc_rvv:    1326.6 ( 0.67x)
      pal_pred_w16_16bpc_c:     1268.6 ( 1.00x)
      pal_pred_w16_16bpc_rvv:    845.8 ( 1.50x)
      pal_pred_w32_8bpc_c:      2094.5 ( 1.00x)
      pal_pred_w32_8bpc_rvv:    1610.9 ( 1.30x)
      pal_pred_w32_16bpc_c:     2999.4 ( 1.00x)
      pal_pred_w32_16bpc_rvv:   1029.8 ( 2.91x)
      pal_pred_w64_8bpc_c:      5128.0 ( 1.00x)
      pal_pred_w64_8bpc_rvv:    2000.8 ( 2.56x)
      pal_pred_w64_16bpc_c:     7375.0 ( 1.00x)
      pal_pred_w64_16bpc_rvv:   2518.2 ( 2.93x)
      b830ac82
    • Bogdan Gligorijević's avatar
      ipred_smooth · 44541dfa
      Bogdan Gligorijević authored
      Benchmarks:
      - Kendryte K230:
      intra_pred_smooth_w4_8bpc_c:        392.6 ( 1.00x)
      intra_pred_smooth_w4_8bpc_rvv:      311.3 ( 1.26x)
      intra_pred_smooth_w8_8bpc_c:       1204.1 ( 1.00x)
      intra_pred_smooth_w8_8bpc_rvv:      488.9 ( 2.46x)
      intra_pred_smooth_w16_8bpc_c:      3885.9 ( 1.00x)
      intra_pred_smooth_w16_8bpc_rvv:     796.6 ( 4.88x)
      intra_pred_smooth_w32_8bpc_c:      9305.7 ( 1.00x)
      intra_pred_smooth_w32_8bpc_rvv:    1806.7 ( 5.15x)
      intra_pred_smooth_w64_8bpc_c:     23043.0 ( 1.00x)
      intra_pred_smooth_w64_8bpc_rvv:    4344.3 ( 5.30x)
      
      - spacemiT K1:
      intra_pred_smooth_w4_8bpc_c:        384.1 ( 1.00x)
      intra_pred_smooth_w4_8bpc_rvv:      322.2 ( 1.19x)
      intra_pred_smooth_w8_8bpc_c:       1177.6 ( 1.00x)
      intra_pred_smooth_w8_8bpc_rvv:      507.1 ( 2.32x)
      intra_pred_smooth_w16_8bpc_c:      3801.2 ( 1.00x)
      intra_pred_smooth_w16_8bpc_rvv:     814.4 ( 4.67x)
      intra_pred_smooth_w32_8bpc_c:      9103.1 ( 1.00x)
      intra_pred_smooth_w32_8bpc_rvv:     980.8 ( 9.28x)
      intra_pred_smooth_w64_8bpc_c:     22540.1 ( 1.00x)
      intra_pred_smooth_w64_8bpc_rvv:    2319.3 ( 9.72x)
      44541dfa
    • Bogdan Gligorijević's avatar
      ipred cfl functions · d711f974
      Bogdan Gligorijević authored
      Benchmarks:
      
      - Kendryte K230:
      cfl_pred_cfl_128_w4_8bpc_c:         497.3 ( 1.00x)
      cfl_pred_cfl_128_w4_8bpc_rvv:       369.6 ( 1.35x)
      cfl_pred_cfl_128_w4_16bpc_c:        425.2 ( 1.00x)
      cfl_pred_cfl_128_w4_16bpc_rvv:      385.5 ( 1.10x)
      cfl_pred_cfl_128_w8_8bpc_c:        1544.2 ( 1.00x)
      cfl_pred_cfl_128_w8_8bpc_rvv:       584.2 ( 2.64x)
      cfl_pred_cfl_128_w8_16bpc_c:       1306.2 ( 1.00x)
      cfl_pred_cfl_128_w8_16bpc_rvv:      608.8 ( 2.15x)
      cfl_pred_cfl_128_w16_8bpc_c:       3085.6 ( 1.00x)
      cfl_pred_cfl_128_w16_8bpc_rvv:      584.2 ( 5.28x)
      cfl_pred_cfl_128_w16_16bpc_c:      2657.1 ( 1.00x)
      cfl_pred_cfl_128_w16_16bpc_rvv:     608.9 ( 4.36x)
      cfl_pred_cfl_128_w32_8bpc_c:       8405.6 ( 1.00x)
      cfl_pred_cfl_128_w32_8bpc_rvv:     1416.1 ( 5.94x)
      cfl_pred_cfl_128_w32_16bpc_c:      7199.9 ( 1.00x)
      cfl_pred_cfl_128_w32_16bpc_rvv:    1479.8 ( 4.87x)
      cfl_pred_cfl_left_w4_8bpc_c:        553.1 ( 1.00x)
      cfl_pred_cfl_left_w4_8bpc_rvv:      395.6 ( 1.40x)
      cfl_pred_cfl_left_w4_16bpc_c:       486.7 ( 1.00x)
      cfl_pred_cfl_left_w4_16bpc_rvv:     409.1 ( 1.19x)
      cfl_pred_cfl_left_w8_8bpc_c:       1610.8 ( 1.00x)
      cfl_pred_cfl_left_w8_8bpc_rvv:      610.4 ( 2.64x)
      cfl_pred_cfl_left_w8_16bpc_c:      1378.0 ( 1.00x)
      cfl_pred_cfl_left_w8_16bpc_rvv:     636.2 ( 2.17x)
      cfl_pred_cfl_left_w16_8bpc_c:      3154.4 ( 1.00x)
      cfl_pred_cfl_left_w16_8bpc_rvv:     610.4 ( 5.17x)
      cfl_pred_cfl_left_w16_16bpc_c:     2733.2 ( 1.00x)
      cfl_pred_cfl_left_w16_16bpc_rvv:    636.3 ( 4.30x)
      cfl_pred_cfl_left_w32_8bpc_c:      8451.7 ( 1.00x)
      cfl_pred_cfl_left_w32_8bpc_rvv:    1442.5 ( 5.86x)
      cfl_pred_cfl_left_w32_16bpc_c:     7267.2 ( 1.00x)
      cfl_pred_cfl_left_w32_16bpc_rvv:   1509.4 ( 4.81x)
      cfl_pred_cfl_top_w4_8bpc_c:         544.7 ( 1.00x)
      cfl_pred_cfl_top_w4_8bpc_rvv:       395.8 ( 1.38x)
      cfl_pred_cfl_top_w4_16bpc_c:        475.1 ( 1.00x)
      cfl_pred_cfl_top_w4_16bpc_rvv:      406.7 ( 1.17x)
      cfl_pred_cfl_top_w8_8bpc_c:        1599.3 ( 1.00x)
      cfl_pred_cfl_top_w8_8bpc_rvv:       610.4 ( 2.62x)
      cfl_pred_cfl_top_w8_16bpc_c:       1363.8 ( 1.00x)
      cfl_pred_cfl_top_w8_16bpc_rvv:      630.3 ( 2.16x)
      cfl_pred_cfl_top_w16_8bpc_c:       3161.0 ( 1.00x)
      cfl_pred_cfl_top_w16_8bpc_rvv:      610.5 ( 5.18x)
      cfl_pred_cfl_top_w16_16bpc_c:      2735.9 ( 1.00x)
      cfl_pred_cfl_top_w16_16bpc_rvv:     634.3 ( 4.31x)
      cfl_pred_cfl_top_w32_8bpc_c:       8564.4 ( 1.00x)
      cfl_pred_cfl_top_w32_8bpc_rvv:     1442.8 ( 5.94x)
      cfl_pred_cfl_top_w32_16bpc_c:      7294.9 ( 1.00x)
      cfl_pred_cfl_top_w32_16bpc_rvv:    1511.5 ( 4.83x)
      cfl_pred_cfl_w4_8bpc_c:             571.5 ( 1.00x)
      cfl_pred_cfl_w4_8bpc_rvv:           421.0 ( 1.36x)
      cfl_pred_cfl_w4_16bpc_c:            499.1 ( 1.00x)
      cfl_pred_cfl_w4_16bpc_rvv:          462.8 ( 1.08x)
      cfl_pred_cfl_w8_8bpc_c:            1642.0 ( 1.00x)
      cfl_pred_cfl_w8_8bpc_rvv:           635.8 ( 2.58x)
      cfl_pred_cfl_w8_16bpc_c:           1401.4 ( 1.00x)
      cfl_pred_cfl_w8_16bpc_rvv:          686.1 ( 2.04x)
      cfl_pred_cfl_w16_8bpc_c:           3204.3 ( 1.00x)
      cfl_pred_cfl_w16_8bpc_rvv:          635.8 ( 5.04x)
      cfl_pred_cfl_w16_16bpc_c:          2784.8 ( 1.00x)
      cfl_pred_cfl_w16_16bpc_rvv:         686.1 ( 4.06x)
      cfl_pred_cfl_w32_8bpc_c:           8623.9 ( 1.00x)
      cfl_pred_cfl_w32_8bpc_rvv:         1465.9 ( 5.88x)
      cfl_pred_cfl_w32_16bpc_c:          7357.8 ( 1.00x)
      cfl_pred_cfl_w32_16bpc_rvv:        1556.3 ( 4.73x)
      
      - Banana Pi BPI-F3:
      cfl_pred_cfl_128_w4_8bpc_c:         485.5 ( 1.00x)
      cfl_pred_cfl_128_w4_8bpc_rvv:       366.4 ( 1.33x)
      cfl_pred_cfl_128_w4_16bpc_c:        393.5 ( 1.00x)
      cfl_pred_cfl_128_w4_16bpc_rvv:      378.7 ( 1.04x)
      cfl_pred_cfl_128_w8_8bpc_c:        1507.9 ( 1.00x)
      cfl_pred_cfl_128_w8_8bpc_rvv:       577.4 ( 2.61x)
      cfl_pred_cfl_128_w8_16bpc_c:       1205.7 ( 1.00x)
      cfl_pred_cfl_128_w8_16bpc_rvv:      605.1 ( 1.99x)
      cfl_pred_cfl_128_w16_8bpc_c:       3019.3 ( 1.00x)
      cfl_pred_cfl_128_w16_8bpc_rvv:      577.4 ( 5.23x)
      cfl_pred_cfl_128_w16_16bpc_c:      2506.5 ( 1.00x)
      cfl_pred_cfl_128_w16_16bpc_rvv:     605.1 ( 4.14x)
      cfl_pred_cfl_128_w32_8bpc_c:       8170.0 ( 1.00x)
      cfl_pred_cfl_128_w32_8bpc_rvv:      715.6 (11.42x)
      cfl_pred_cfl_128_w32_16bpc_c:      6686.7 ( 1.00x)
      cfl_pred_cfl_128_w32_16bpc_rvv:     749.7 ( 8.92x)
      cfl_pred_cfl_left_w4_8bpc_c:        539.4 ( 1.00x)
      cfl_pred_cfl_left_w4_8bpc_rvv:      393.2 ( 1.37x)
      cfl_pred_cfl_left_w4_16bpc_c:       452.0 ( 1.00x)
      cfl_pred_cfl_left_w4_16bpc_rvv:     401.2 ( 1.13x)
      cfl_pred_cfl_left_w8_8bpc_c:       1572.4 ( 1.00x)
      cfl_pred_cfl_left_w8_8bpc_rvv:      604.1 ( 2.60x)
      cfl_pred_cfl_left_w8_16bpc_c:      1274.5 ( 1.00x)
      cfl_pred_cfl_left_w8_16bpc_rvv:     629.0 ( 2.03x)
      cfl_pred_cfl_left_w16_8bpc_c:      3096.0 ( 1.00x)
      cfl_pred_cfl_left_w16_8bpc_rvv:     604.1 ( 5.13x)
      cfl_pred_cfl_left_w16_16bpc_c:     2591.4 ( 1.00x)
      cfl_pred_cfl_left_w16_16bpc_rvv:    629.0 ( 4.12x)
      cfl_pred_cfl_left_w32_8bpc_c:      8266.0 ( 1.00x)
      cfl_pred_cfl_left_w32_8bpc_rvv:     742.4 (11.13x)
      cfl_pred_cfl_left_w32_16bpc_c:     6758.0 ( 1.00x)
      cfl_pred_cfl_left_w32_16bpc_rvv:    773.9 ( 8.73x)
      cfl_pred_cfl_top_w4_8bpc_c:         532.3 ( 1.00x)
      cfl_pred_cfl_top_w4_8bpc_rvv:       392.6 ( 1.36x)
      cfl_pred_cfl_top_w4_16bpc_c:        440.4 ( 1.00x)
      cfl_pred_cfl_top_w4_16bpc_rvv:      399.6 ( 1.10x)
      cfl_pred_cfl_top_w8_8bpc_c:        1563.3 ( 1.00x)
      cfl_pred_cfl_top_w8_8bpc_rvv:       603.6 ( 2.59x)
      cfl_pred_cfl_top_w8_16bpc_c:       1271.6 ( 1.00x)
      cfl_pred_cfl_top_w8_16bpc_rvv:      626.1 ( 2.03x)
      cfl_pred_cfl_top_w16_8bpc_c:       3098.6 ( 1.00x)
      cfl_pred_cfl_top_w16_8bpc_rvv:      603.6 ( 5.13x)
      cfl_pred_cfl_top_w16_16bpc_c:      2562.8 ( 1.00x)
      cfl_pred_cfl_top_w16_16bpc_rvv:     626.0 ( 4.09x)
      cfl_pred_cfl_top_w32_8bpc_c:       8278.1 ( 1.00x)
      cfl_pred_cfl_top_w32_8bpc_rvv:      741.8 (11.16x)
      cfl_pred_cfl_top_w32_16bpc_c:      6799.1 ( 1.00x)
      cfl_pred_cfl_top_w32_16bpc_rvv:     775.0 ( 8.77x)
      cfl_pred_cfl_w4_8bpc_c:             559.8 ( 1.00x)
      cfl_pred_cfl_w4_8bpc_rvv:           421.7 ( 1.33x)
      cfl_pred_cfl_w4_16bpc_c:            470.2 ( 1.00x)
      cfl_pred_cfl_w4_16bpc_rvv:          451.3 ( 1.04x)
      cfl_pred_cfl_w8_8bpc_c:            1605.5 ( 1.00x)
      cfl_pred_cfl_w8_8bpc_rvv:           632.8 ( 2.54x)
      cfl_pred_cfl_w8_16bpc_c:           1308.5 ( 1.00x)
      cfl_pred_cfl_w8_16bpc_rvv:          677.9 ( 1.93x)
      cfl_pred_cfl_w16_8bpc_c:           3135.0 ( 1.00x)
      cfl_pred_cfl_w16_8bpc_rvv:          632.9 ( 4.95x)
      cfl_pred_cfl_w16_16bpc_c:          2625.9 ( 1.00x)
      cfl_pred_cfl_w16_16bpc_rvv:         677.9 ( 3.87x)
      cfl_pred_cfl_w32_8bpc_c:           8376.6 ( 1.00x)
      cfl_pred_cfl_w32_8bpc_rvv:          770.4 (10.87x)
      cfl_pred_cfl_w32_16bpc_c:          6866.4 ( 1.00x)
      cfl_pred_cfl_w32_16bpc_rvv:         822.7 ( 8.35x)
      d711f974
    • Bogdan Gligorijević's avatar
      riscv64/cdef: filter functions · 2f5bfc37
      Bogdan Gligorijević authored
      Benchmarks:
      - Kendryte K230:
      cdef_filter_4x4_01_8bpc_c:       1339.4 ( 1.00x)
      cdef_filter_4x4_01_8bpc_rvv:      836.2 ( 1.60x)
      cdef_filter_4x4_01_16bpc_c:      1369.1 ( 1.00x)
      cdef_filter_4x4_01_16bpc_rvv:     824.7 ( 1.66x)
      cdef_filter_4x4_10_8bpc_c:        872.8 ( 1.00x)
      cdef_filter_4x4_10_8bpc_rvv:      523.9 ( 1.67x)
      cdef_filter_4x4_10_16bpc_c:       938.2 ( 1.00x)
      cdef_filter_4x4_10_16bpc_rvv:     517.1 ( 1.81x)
      cdef_filter_4x4_11_8bpc_c:       2668.3 ( 1.00x)
      cdef_filter_4x4_11_8bpc_rvv:     1285.0 ( 2.08x)
      cdef_filter_4x4_11_16bpc_c:      2922.1 ( 1.00x)
      cdef_filter_4x4_11_16bpc_rvv:    1291.0 ( 2.26x)
      cdef_filter_4x8_01_8bpc_c:       2489.1 ( 1.00x)
      cdef_filter_4x8_01_8bpc_rvv:     1594.3 ( 1.56x)
      cdef_filter_4x8_01_16bpc_c:      2528.1 ( 1.00x)
      cdef_filter_4x8_01_16bpc_rvv:    1566.6 ( 1.61x)
      cdef_filter_4x8_10_8bpc_c:       1576.9 ( 1.00x)
      cdef_filter_4x8_10_8bpc_rvv:      967.1 ( 1.63x)
      cdef_filter_4x8_10_16bpc_c:      1641.3 ( 1.00x)
      cdef_filter_4x8_10_16bpc_rvv:     947.1 ( 1.73x)
      cdef_filter_4x8_11_8bpc_c:       5164.0 ( 1.00x)
      cdef_filter_4x8_11_8bpc_rvv:     2490.7 ( 2.07x)
      cdef_filter_4x8_11_16bpc_c:      5732.3 ( 1.00x)
      cdef_filter_4x8_11_16bpc_rvv:    2499.2 ( 2.29x)
      cdef_filter_8x8_01_8bpc_c:       4742.3 ( 1.00x)
      cdef_filter_8x8_01_8bpc_rvv:     1628.6 ( 2.91x)
      cdef_filter_8x8_01_16bpc_c:      4785.0 ( 1.00x)
      cdef_filter_8x8_01_16bpc_rvv:    1595.5 ( 3.00x)
      cdef_filter_8x8_10_8bpc_c:       2962.4 ( 1.00x)
      cdef_filter_8x8_10_8bpc_rvv:     1000.8 ( 2.96x)
      cdef_filter_8x8_10_16bpc_c:      3022.4 ( 1.00x)
      cdef_filter_8x8_10_16bpc_rvv:     975.7 ( 3.10x)
      cdef_filter_8x8_11_8bpc_c:      12623.9 ( 1.00x)
      cdef_filter_8x8_11_8bpc_rvv:     2525.4 ( 5.00x)
      cdef_filter_8x8_11_16bpc_c:     12470.7 ( 1.00x)
      cdef_filter_8x8_11_16bpc_rvv:    2528.2 ( 4.93x)
      
      - Banana Pi BPI-F3:
      cdef_filter_4x4_01_8bpc_c:       1281.2 ( 1.00x)
      cdef_filter_4x4_01_8bpc_rvv:      813.0 ( 1.58x)
      cdef_filter_4x4_01_16bpc_c:      1300.8 ( 1.00x)
      cdef_filter_4x4_01_16bpc_rvv:     808.9 ( 1.61x)
      cdef_filter_4x4_10_8bpc_c:        843.0 ( 1.00x)
      cdef_filter_4x4_10_8bpc_rvv:      498.4 ( 1.69x)
      cdef_filter_4x4_10_16bpc_c:       903.6 ( 1.00x)
      cdef_filter_4x4_10_16bpc_rvv:     497.9 ( 1.81x)
      cdef_filter_4x4_11_8bpc_c:       2614.1 ( 1.00x)
      cdef_filter_4x4_11_8bpc_rvv:     1219.6 ( 2.14x)
      cdef_filter_4x4_11_16bpc_c:      2795.6 ( 1.00x)
      cdef_filter_4x4_11_16bpc_rvv:    1243.1 ( 2.25x)
      cdef_filter_4x8_01_8bpc_c:       2405.4 ( 1.00x)
      cdef_filter_4x8_01_8bpc_rvv:     1548.5 ( 1.55x)
      cdef_filter_4x8_01_16bpc_c:      2402.7 ( 1.00x)
      cdef_filter_4x8_01_16bpc_rvv:    1542.7 ( 1.56x)
      cdef_filter_4x8_10_8bpc_c:       1522.0 ( 1.00x)
      cdef_filter_4x8_10_8bpc_rvv:      917.4 ( 1.66x)
      cdef_filter_4x8_10_16bpc_c:      1589.2 ( 1.00x)
      cdef_filter_4x8_10_16bpc_rvv:     915.9 ( 1.74x)
      cdef_filter_4x8_11_8bpc_c:       5050.7 ( 1.00x)
      cdef_filter_4x8_11_8bpc_rvv:     2358.7 ( 2.14x)
      cdef_filter_4x8_11_16bpc_c:      5510.5 ( 1.00x)
      cdef_filter_4x8_11_16bpc_rvv:    2411.6 ( 2.28x)
      cdef_filter_8x8_01_8bpc_c:       4558.3 ( 1.00x)
      cdef_filter_8x8_01_8bpc_rvv:     1579.7 ( 2.89x)
      cdef_filter_8x8_01_16bpc_c:      4551.1 ( 1.00x)
      cdef_filter_8x8_01_16bpc_rvv:    1571.1 ( 2.90x)
      cdef_filter_8x8_10_8bpc_c:       2869.3 ( 1.00x)
      cdef_filter_8x8_10_8bpc_rvv:      948.4 ( 3.03x)
      cdef_filter_8x8_10_16bpc_c:      2928.6 ( 1.00x)
      cdef_filter_8x8_10_16bpc_rvv:     944.2 ( 3.10x)
      cdef_filter_8x8_11_8bpc_c:      12317.5 ( 1.00x)
      cdef_filter_8x8_11_8bpc_rvv:     2389.7 ( 5.15x)
      cdef_filter_8x8_11_16bpc_c:     11950.6 ( 1.00x)
      cdef_filter_8x8_11_16bpc_rvv:    2440.1 ( 4.90x)
      2f5bfc37
    • Bogdan Gligorijević's avatar
      pal_idx_finish · f223436b
      Bogdan Gligorijević authored
      Benchmarks:
      
      - Kendryte K230:
      pal_idx_finish_w4_c:       122.5 ( 1.00x)
      pal_idx_finish_w4_rvv:     107.2 ( 1.14x)
      pal_idx_finish_w8_c:       302.8 ( 1.00x)
      pal_idx_finish_w8_rvv:     197.9 ( 1.53x)
      pal_idx_finish_w16_c:      868.2 ( 1.00x)
      pal_idx_finish_w16_rvv:    438.5 ( 1.98x)
      pal_idx_finish_w32_c:     1966.5 ( 1.00x)
      pal_idx_finish_w32_rvv:    833.0 ( 2.36x)
      pal_idx_finish_w64_c:     4737.5 ( 1.00x)
      pal_idx_finish_w64_rvv:   1818.3 ( 2.61x)
      
      - Banana Pi BPI-F3:
      pal_idx_finish_w4_c:       122.4 ( 1.00x)
      pal_idx_finish_w4_rvv:     132.0 ( 0.93x)
      pal_idx_finish_w8_c:       289.4 ( 1.00x)
      pal_idx_finish_w8_rvv:     195.8 ( 1.48x)
      pal_idx_finish_w16_c:      788.0 ( 1.00x)
      pal_idx_finish_w16_rvv:    430.6 ( 1.83x)
      pal_idx_finish_w32_c:     1699.2 ( 1.00x)
      pal_idx_finish_w32_rvv:    816.3 ( 2.08x)
      pal_idx_finish_w64_c:     3977.7 ( 1.00x)
      pal_idx_finish_w64_rvv:   1779.4 ( 2.24x)
      f223436b
    • Nathan E. Egge's avatar
      38f74bdc
  3. Oct 07, 2024
    • Henrik Gramner's avatar
      x86: Make AVX2 SGR gatherless · 7072e79f
      Henrik Gramner authored
      Instead of using gathers we can calculate the value of
      sgr_x_by_x[min(z, 255)] by doing 256 / (z + 1) in floating-point
      with some clipping for z == 0 and z >= 255.
      
      As the required precision of the division is fairly small it can be
      performed using an approximate reciprocal, which is significantly
      faster than a regular division.
      
      Gather instructions are slow on all AMD CPU:s, and on most Intel
      CPU:s ever since µcode updates were issued as a workaround for
      the Gather Data Sampling side channel vulnerability.
      7072e79f
  4. Oct 02, 2024
  5. Sep 30, 2024
    • jinbo's avatar
      loongarch: minor improvement on decode_symbol_adapt · ed004fe9
      jinbo authored and Hecai Yuan's avatar Hecai Yuan committed
      Change-Id: I78fe788113ff2487ba1ce2e7d0c7d7c78c5a8c58
      ed004fe9
    • Hecai Yuan's avatar
      loongarch: rewrite optimization functions in loongarch/itx.S · 62a51df1
      Hecai Yuan authored and Hecai Yuan's avatar Hecai Yuan committed
      Change-Id: I1566e8145d36296f2c76107cf15fc2cc7ac0ecc7
      62a51df1
    • guxiwei's avatar
      LoongArch: Add save_tmvs_lsx · 757f294a
      guxiwei authored and Hecai Yuan's avatar Hecai Yuan committed
      The performance data is as follows:
      save_tmvs_c:        3938.6 ( 1.00x)
      save_tmvs_lsx:      1355.3 ( 2.91x)
      757f294a
    • jinbo's avatar
      loongarch: refactor loopfilter · 3d96175d
      jinbo authored and Hecai Yuan's avatar Hecai Yuan committed
      bench performance before:
      lpf_h_sb_y_w16_8bpc_c:      117.0 ( 1.00x)
      lpf_h_sb_y_w16_8bpc_lsx:     33.9 ( 3.46x)
      lpf_v_sb_y_w16_8bpc_c:      132.1 ( 1.00x)
      lpf_v_sb_y_w16_8bpc_lsx:     59.7 ( 2.21x)
      
      bench performance after:
      lpf_h_sb_y_w16_8bpc_c:      114.9 ( 1.00x)
      lpf_h_sb_y_w16_8bpc_lsx:     32.0 ( 3.59x)
      lpf_v_sb_y_w16_8bpc_c:      132.5 ( 1.00x)
      lpf_v_sb_y_w16_8bpc_lsx:     28.1 ( 4.72x)
      
      Change-Id: Ie64e164a9416c438f6b3881ce18fb42e2ddd073d
      3d96175d
    • Hecai Yuan's avatar
      loongarch: add lasx implementation of sgr_3x3 for 8 bpc · 70582027
      Hecai Yuan authored and Hecai Yuan's avatar Hecai Yuan committed
      sgr_3x3_8bpc_c:                                   27233.1 ( 1.00x)
      sgr_3x3_8bpc_lsx:                                 12874.7 ( 2.12x)
      sgr_3x3_8bpc_lasx:                                10183.7 ( 2.67x)
      
      Change-Id: I2aa469e8560733d6191396186bf776a12ad6e4a3
      70582027
    • Hecai Yuan's avatar
      loongarch: rewirte warp_8x8/8x8t_lsx for 8 bpc · 96d6e472
      Hecai Yuan authored and Hecai Yuan's avatar Hecai Yuan committed
      before:
      warp_8x8_8bpc_c:                                    109.8 ( 1.00x)
      warp_8x8_8bpc_lsx:                                   44.6 ( 2.46x)
      warp_8x8t_8bpc_c:                                    97.5 ( 1.00x)
      warp_8x8t_8bpc_lsx:                                  43.7 ( 2.23x)
      
      after:
      warp_8x8_8bpc_c:                                    109.8 ( 1.00x)
      warp_8x8_8bpc_lsx:                                   39.2 ( 2.80x)
      warp_8x8t_8bpc_c:                                    97.5 ( 1.00x)
      warp_8x8t_8bpc_lsx:                                  37.9 ( 2.57x)
      
      Change-Id: I11728c2c30821b8e2b1c85208710dfe5d1c1269c
      96d6e472
    • jinbo's avatar
      loongarch: Refine prep_8tap_8bpc_lasx · b9e9a0ef
      jinbo authored and Hecai Yuan's avatar Hecai Yuan committed
      mct_8tap_regular_w8_h_8bpc_c:                  47.1 ( 1.00x)
      mct_8tap_regular_w8_h_8bpc_lsx:                 6.3 ( 7.46x)
      mct_8tap_regular_w8_h_8bpc_lasx:                4.4 (10.80x)
      mct_8tap_regular_w8_hv_8bpc_c:                118.9 ( 1.00x)
      mct_8tap_regular_w8_hv_8bpc_lsx:               19.2 ( 6.20x)
      mct_8tap_regular_w8_hv_8bpc_lasx:              13.7 ( 8.69x)
      mct_8tap_regular_w8_v_8bpc_c:                  60.3 ( 1.00x)
      mct_8tap_regular_w8_v_8bpc_lsx:                 5.4 (11.08x)
      mct_8tap_regular_w8_v_8bpc_lasx:                3.3 (18.33x)
      
      Change-Id: I1140f6ffbd738166f2581bc9111ebbdf6f9fa72c
      b9e9a0ef
    • Hecai Yuan's avatar
      loongarch: add lasx implementation of wiener filter for 8 bpc · af11a10a
      Hecai Yuan authored and Hecai Yuan's avatar Hecai Yuan committed
      wiener_5tap_8bpc_c:                               18382.0 ( 1.00x)
      wiener_5tap_8bpc_lsx:                              4166.9 ( 4.41x)
      wiener_5tap_8bpc_lasx:                             2832.2 ( 6.49x)
      wiener_7tap_8bpc_c:                               18339.6 ( 1.00x)
      wiener_7tap_8bpc_lsx:                              4168.3 ( 4.40x)
      wiener_7tap_8bpc_lasx:                             2832.5 ( 6.47x)
      
      Change-Id: I183a8cb008203fb61683b0543d9409d58d141a2e
      af11a10a
    • zhoupeng's avatar
      Loongarch: Optimized load_tmvs_c function by LSX · 90a9549b
      zhoupeng authored and Hecai Yuan's avatar Hecai Yuan committed
      load_tmvs_c:     9702.0 ( 1.00x)
      load_tmvs_lsx:   7857.0 ( 1.23x)
      90a9549b
    • pengxu's avatar
      Loongarch: Optimized ipred_z1 8bpc functions by LSX · 411fc219
      pengxu authored and Hecai Yuan's avatar Hecai Yuan committed
      intra_pred_z1_w4_8bpc_c:        16.5 ( 1.00x)
      intra_pred_z1_w4_8bpc_lsx:       7.1 ( 2.31x)
      intra_pred_z1_w8_8bpc_c:        31.9 ( 1.00x)
      intra_pred_z1_w8_8bpc_lsx:      10.0 ( 3.20x)
      intra_pred_z1_w16_8bpc_c:       80.1 ( 1.00x)
      intra_pred_z1_w16_8bpc_lsx:     20.2 ( 3.96x)
      intra_pred_z1_w32_8bpc_c:      185.8 ( 1.00x)
      intra_pred_z1_w32_8bpc_lsx:     40.8 ( 4.55x)
      intra_pred_z1_w64_8bpc_c:      511.1 ( 1.00x)
      intra_pred_z1_w64_8bpc_lsx:     99.0 ( 5.16x)
      
      Change-Id: Id7591e9b87e5b4d7fc3f438397e25dc6ca8e7f91
      411fc219
    • zhoupeng's avatar
      Loongarch: Optimized emu_edge_c function by LSX · 7c63bb1b
      zhoupeng authored and Hecai Yuan's avatar Hecai Yuan committed
      emu_edge_w4_8bpc_c:        9.0 ( 1.00x)
      emu_edge_w4_8bpc_lsx:      6.7 ( 1.34x)
      emu_edge_w8_8bpc_c:       12.9 ( 1.00x)
      emu_edge_w8_8bpc_lsx:      9.2 ( 1.40x)
      emu_edge_w16_8bpc_c:       20.0 ( 1.00x)
      emu_edge_w16_8bpc_lsx:     16.3 ( 1.23x)
      emu_edge_w32_8bpc_c:       44.6 ( 1.00x)
      emu_edge_w32_8bpc_lsx:     33.3 ( 1.34x)
      emu_edge_w64_8bpc_c:       79.9 ( 1.00x)
      emu_edge_w64_8bpc_lsx:     66.2 ( 1.21x)
      emu_edge_w128_8bpc_c:      193.9 ( 1.00x)
      emu_edge_w128_8bpc_lsx:    197.8 ( 0.98x)
      
      Change-Id: I180c94d311509740b03793419d5790a931532980
      7c63bb1b
    • guxiwei's avatar
      LoongArch64: Implement checked_call() · e3101ddc
      guxiwei authored and Hecai Yuan's avatar Hecai Yuan committed
      Now checkasm calls the test function 'func_new' through
      the wrapper 'checked_call' instead of calling it directly.
      The purpose of the wrapper is to check if 'func_new' correctly
      saves and restores static registers. The wrapper writes dirty
      values to the static registers, and after calling 'func_new',
      it checks if the dirty values in the static registers remain consistent.
      
      Change-Id: Ia9290b55ab0f2dd87801f6fd175813d3f717d851
      e3101ddc
    • pengxu's avatar
      Loongarch: Optimized ipred_filter 8bpc functions by LSX · 7f891597
      pengxu authored and Hecai Yuan's avatar Hecai Yuan committed
      intra_pred_filter_w4_8bpc_c:          17.9 ( 1.00x)
      intra_pred_filter_w4_8bpc_lsx:         8.9 ( 2.00x)
      intra_pred_filter_w8_8bpc_c:          55.3 ( 1.00x)
      intra_pred_filter_w8_8bpc_lsx:        23.8 ( 2.33x)
      intra_pred_filter_w16_8bpc_c:        109.4 ( 1.00x)
      intra_pred_filter_w16_8bpc_lsx:       49.1 ( 2.23x)
      intra_pred_filter_w32_8bpc_c:        270.2 ( 1.00x)
      intra_pred_filter_w32_8bpc_lsx:      126.1 ( 2.14x)
      
      Change-Id: Ic4c23cb1d54d5f8557c31cdfbbd54f8beaaa32c2
      7f891597
Loading