- Oct 12, 2024
-
-
Hecai Yuan authored
-
- Oct 09, 2024
-
-
Bogdan Gligorijević authored
Benchmarks: - Kendryte K230: warp_8x8_8bpc_c: 4549.7 ( 1.00x) warp_8x8_8bpc_rvv: 2504.7 ( 1.82x) warp_8x8t_8bpc_c: 4414.7 ( 1.00x) warp_8x8t_8bpc_rvv: 2465.7 ( 1.79x) - Banana Pi BPI-F3: warp_8x8_8bpc_c: 4431.2 ( 1.00x) warp_8x8_8bpc_rvv: 3297.4 ( 1.34x) warp_8x8t_8bpc_c: 4299.3 ( 1.00x) warp_8x8t_8bpc_rvv: 3255.7 ( 1.32x)
-
Niklas Haas authored
To avoid read-after-write. Speedup is about 1% for width=4 on a K230.
-
Niklas Haas authored
This code compromises between the performance of a dedicated kernel per VLEN/width pair, and the flexibility of a fully VLEN-dynamic loop, by using a single special case for w=4, and subdividing the rest into the unrolled four line fast path, and the general-purpose slow path (for large width on small VLEN). Kendryte K230 avg_w4_8bpc_c: 346.8 ( 1.00x) avg_w4_8bpc_rvv: 50.3 ( 6.90x) avg_w8_8bpc_c: 1054.9 ( 1.00x) avg_w8_8bpc_rvv: 139.1 ( 7.58x) avg_w16_8bpc_c: 3396.3 ( 1.00x) avg_w16_8bpc_rvv: 350.6 ( 9.69x) avg_w32_8bpc_c: 13734.3 ( 1.00x) avg_w32_8bpc_rvv: 1226.3 (11.20x) avg_w64_8bpc_c: 33260.9 ( 1.00x) avg_w64_8bpc_rvv: 3869.4 ( 8.60x) avg_w128_8bpc_c: 83441.3 ( 1.00x) avg_w128_8bpc_rvv: 9765.1 ( 8.54x) w_avg_w4_8bpc_c: 444.3 ( 1.00x) w_avg_w4_8bpc_rvv: 75.8 ( 5.86x) w_avg_w8_8bpc_c: 1365.6 ( 1.00x) w_avg_w8_8bpc_rvv: 208.8 ( 6.54x) w_avg_w16_8bpc_c: 4420.8 ( 1.00x) w_avg_w16_8bpc_rvv: 570.7 ( 7.75x) w_avg_w32_8bpc_c: 18010.9 ( 1.00x) w_avg_w32_8bpc_rvv: 2074.4 ( 8.68x) w_avg_w64_8bpc_c: 43050.4 ( 1.00x) w_avg_w64_8bpc_rvv: 5799.5 ( 7.42x) w_avg_w128_8bpc_c: 107153.6 ( 1.00x) w_avg_w128_8bpc_rvv: 14272.0 ( 7.51x) mask_w4_8bpc_c: 497.6 ( 1.00x) mask_w4_8bpc_rvv: 88.5 ( 5.63x) mask_w8_8bpc_c: 1528.5 ( 1.00x) mask_w8_8bpc_rvv: 253.1 ( 6.04x) mask_w16_8bpc_c: 4953.8 ( 1.00x) mask_w16_8bpc_rvv: 679.0 ( 7.30x) mask_w32_8bpc_c: 20298.3 ( 1.00x) mask_w32_8bpc_rvv: 3012.9 ( 6.74x) mask_w64_8bpc_c: 49718.8 ( 1.00x) mask_w64_8bpc_rvv: 7291.7 ( 6.82x) mask_w128_8bpc_c: 126740.3 ( 1.00x) mask_w128_8bpc_rvv: 18351.1 ( 6.91x)
-
Niklas Haas authored
-
Nathan E. Egge authored
Kendryte K230 blend_v_w2_8bpc_c: 221.4 ( 1.00x) blend_v_w2_8bpc_rvv: 147.7 ( 1.50x) blend_v_w4_8bpc_c: 945.3 ( 1.00x) blend_v_w4_8bpc_rvv: 243.3 ( 3.89x) blend_v_w8_8bpc_c: 1786.9 ( 1.00x) blend_v_w8_8bpc_rvv: 256.1 ( 6.98x) blend_v_w16_8bpc_c: 3472.1 ( 1.00x) blend_v_w16_8bpc_rvv: 351.1 ( 9.89x) blend_v_w32_8bpc_c: 6832.1 ( 1.00x) blend_v_w32_8bpc_rvv: 635.4 (10.75x) SpacemiT K1 blend_v_w2_8bpc_c: 218.0 ( 1.00x) blend_v_w2_8bpc_rvv: 144.3 ( 1.51x) blend_v_w4_8bpc_c: 921.7 ( 1.00x) blend_v_w4_8bpc_rvv: 237.1 ( 3.89x) blend_v_w8_8bpc_c: 1739.8 ( 1.00x) blend_v_w8_8bpc_rvv: 237.4 ( 7.33x) blend_v_w16_8bpc_c: 3376.6 ( 1.00x) blend_v_w16_8bpc_rvv: 296.3 (11.40x) blend_v_w32_8bpc_c: 6647.2 ( 1.00x) blend_v_w32_8bpc_rvv: 408.1 (16.29x)
-
Nathan E. Egge authored
Kendryte K230 blend_h_w2_8bpc_c: 165.9 ( 1.00x) blend_h_w2_8bpc_rvv: 83.8 ( 1.98x) blend_h_w4_8bpc_c: 295.2 ( 1.00x) blend_h_w4_8bpc_rvv: 83.8 ( 3.52x) blend_h_w8_8bpc_c: 557.9 ( 1.00x) blend_h_w8_8bpc_rvv: 92.5 ( 6.03x) blend_h_w16_8bpc_c: 1078.8 ( 1.00x) blend_h_w16_8bpc_rvv: 117.3 ( 9.19x) blend_h_w32_8bpc_c: 2117.8 ( 1.00x) blend_h_w32_8bpc_rvv: 200.5 (10.57x) blend_h_w64_8bpc_c: 4194.7 ( 1.00x) blend_h_w64_8bpc_rvv: 363.2 (11.55x) blend_h_w128_8bpc_c: 10271.4 ( 1.00x) blend_h_w128_8bpc_rvv: 844.5 (12.16x) SpacemiT K1 blend_h_w2_8bpc_c: 162.5 ( 1.00x) blend_h_w2_8bpc_rvv: 83.9 ( 1.94x) blend_h_w4_8bpc_c: 288.6 ( 1.00x) blend_h_w4_8bpc_rvv: 83.7 ( 3.45x) blend_h_w8_8bpc_c: 544.7 ( 1.00x) blend_h_w8_8bpc_rvv: 84.0 ( 6.48x) blend_h_w16_8bpc_c: 1052.8 ( 1.00x) blend_h_w16_8bpc_rvv: 102.9 (10.23x) blend_h_w32_8bpc_c: 2068.0 ( 1.00x) blend_h_w32_8bpc_rvv: 131.4 (15.73x) blend_h_w64_8bpc_c: 4093.7 ( 1.00x) blend_h_w64_8bpc_rvv: 220.3 (18.58x) blend_h_w128_8bpc_c: 10023.1 ( 1.00x) blend_h_w128_8bpc_rvv: 467.3 (21.45x)
-
Nathan E. Egge authored
Kendryte K230 blend_w4_8bpc_c: 204.8 ( 1.00x) blend_w4_8bpc_rvv: 59.8 ( 3.42x) blend_w8_8bpc_c: 608.9 ( 1.00x) blend_w8_8bpc_rvv: 87.2 ( 6.98x) blend_w16_8bpc_c: 2362.4 ( 1.00x) blend_w16_8bpc_rvv: 225.2 (10.49x) blend_w32_8bpc_c: 5990.4 ( 1.00x) blend_w32_8bpc_rvv: 518.3 (11.56x) SpacemiT K1 blend_w4_8bpc_c: 201.6 ( 1.00x) blend_w4_8bpc_rvv: 58.0 ( 3.48x) blend_w8_8bpc_c: 595.1 ( 1.00x) blend_w8_8bpc_rvv: 82.1 ( 7.25x) blend_w16_8bpc_c: 2308.8 ( 1.00x) blend_w16_8bpc_rvv: 189.0 (12.22x) blend_w32_8bpc_c: 5853.1 ( 1.00x) blend_w32_8bpc_rvv: 339.5 (17.24x)
-
Nathan E. Egge authored
SpacemiT K1 blend_v_w2_8bpc_c: 217.0 ( 1.00x) blend_v_w2_8bpc_rvv: 143.3 ( 1.51x) blend_v_w4_8bpc_c: 921.6 ( 1.00x) blend_v_w4_8bpc_rvv: 236.3 ( 3.90x) blend_v_w8_8bpc_c: 1738.2 ( 1.00x) blend_v_w8_8bpc_rvv: 238.1 ( 7.30x) blend_v_w16_8bpc_c: 3376.1 ( 1.00x) blend_v_w16_8bpc_rvv: 298.0 (11.33x) blend_v_w32_8bpc_c: 6648.0 ( 1.00x) blend_v_w32_8bpc_rvv: 409.5 (16.24x)
-
Nathan E. Egge authored
SpacemiT K1 blend_h_w2_8bpc_c: 161.8 ( 1.00x) blend_h_w2_8bpc_rvv: 83.5 ( 1.94x) blend_h_w4_8bpc_c: 288.4 ( 1.00x) blend_h_w4_8bpc_rvv: 83.7 ( 3.45x) blend_h_w8_8bpc_c: 543.9 ( 1.00x) blend_h_w8_8bpc_rvv: 84.5 ( 6.44x) blend_h_w16_8bpc_c: 1051.6 ( 1.00x) blend_h_w16_8bpc_rvv: 103.8 (10.13x) blend_h_w32_8bpc_c: 2066.0 ( 1.00x) blend_h_w32_8bpc_rvv: 133.8 (15.44x) blend_h_w64_8bpc_c: 4092.7 ( 1.00x) blend_h_w64_8bpc_rvv: 225.2 (18.18x) blend_h_w128_8bpc_c: 10011.3 ( 1.00x) blend_h_w128_8bpc_rvv: 474.7 (21.09x)
-
Nathan E. Egge authored
SpacemiT K1 blend_w4_8bpc_c: 201.3 ( 1.00x) blend_w4_8bpc_rvv: 59.3 ( 3.40x) blend_w8_8bpc_c: 595.1 ( 1.00x) blend_w8_8bpc_rvv: 84.1 ( 7.07x) blend_w16_8bpc_c: 2309.0 ( 1.00x) blend_w16_8bpc_rvv: 190.5 (12.12x) blend_w32_8bpc_c: 5854.7 ( 1.00x) blend_w32_8bpc_rvv: 341.6 (17.14x)
-
Nathan E. Egge authored
-
Nathan E. Egge authored
Kendryte K230 blend_v_w2_8bpc_c: 219.6 ( 1.00x) blend_v_w2_8bpc_rvv: 141.8 ( 1.55x) blend_v_w4_8bpc_c: 942.9 ( 1.00x) blend_v_w4_8bpc_rvv: 240.9 ( 3.91x) blend_v_w8_8bpc_c: 1783.5 ( 1.00x) blend_v_w8_8bpc_rvv: 254.7 ( 7.00x) blend_v_w16_8bpc_c: 3466.5 ( 1.00x) blend_v_w16_8bpc_rvv: 350.5 ( 9.89x) blend_v_w32_8bpc_c: 6825.2 ( 1.00x) blend_v_w32_8bpc_rvv: 635.1 (10.75x)
-
Nathan E. Egge authored
Kendryte K230 blend_h_w2_8bpc_c: 165.4 ( 1.00x) blend_h_w2_8bpc_rvv: 79.4 ( 2.08x) blend_h_w4_8bpc_c: 294.6 ( 1.00x) blend_h_w4_8bpc_rvv: 81.5 ( 3.61x) blend_h_w8_8bpc_c: 556.9 ( 1.00x) blend_h_w8_8bpc_rvv: 90.2 ( 6.17x) blend_h_w16_8bpc_c: 1077.6 ( 1.00x) blend_h_w16_8bpc_rvv: 116.1 ( 9.29x) blend_h_w32_8bpc_c: 2116.2 ( 1.00x) blend_h_w32_8bpc_rvv: 200.5 (10.55x) blend_h_w64_8bpc_c: 4191.8 ( 1.00x) blend_h_w64_8bpc_rvv: 363.3 (11.54x) blend_h_w128_8bpc_c: 10264.6 ( 1.00x) blend_h_w128_8bpc_rvv: 844.1 (12.16x)
-
Nathan E. Egge authored
Kendryte K230 blend_w4_8bpc_c: 204.5 ( 1.00x) blend_w4_8bpc_rvv: 56.4 ( 3.62x) blend_w8_8bpc_c: 608.6 ( 1.00x) blend_w8_8bpc_rvv: 87.3 ( 6.97x) blend_w16_8bpc_c: 2363.8 ( 1.00x) blend_w16_8bpc_rvv: 225.1 (10.50x) blend_w32_8bpc_c: 5990.3 ( 1.00x) blend_w32_8bpc_rvv: 518.8 (11.55x)
-
Bogdan Gligorijević authored
Benchmark pending
-
Bogdan Gligorijević authored
Current benchmark: - Kendryte K230: inv_txfm_add_16x16_dct_dct_0_8bpc_c: 1729.4 ( 1.00x) inv_txfm_add_16x16_dct_dct_0_8bpc_rvv: 153.2 (11.29x) - spacemiT K1: inv_txfm_add_16x16_dct_dct_0_8bpc_c: 1533.4 ( 1.00x) inv_txfm_add_16x16_dct_dct_0_8bpc_rvv: 176.8 ( 8.67x)
-
Bogdan Gligorijević authored
Performance comparison: - SpacemiT K1: Master branch: itx_16x16: inv_txfm_add_16x16_dct_dct_0_8bpc_c: 1534.1 ( 1.00x) 1534.9 ( 1.00x) inv_txfm_add_16x16_dct_dct_0_8bpc_rvv: 1173.6 ( 1.31x) 173.1 ( 8.87x) - Kendryte K230: Master branch: itx_16x16: inv_txfm_add_16x16_dct_dct_0_8bpc_c: 1576.0 ( 1.00x) 1579.1 ( 1.00x) inv_txfm_add_16x16_dct_dct_0_8bpc_rvv: 1095.5 ( 1.44x) 146.8 (10.75x)
-
Bogdan Gligorijević authored
Benchmarks: - Kendryte K230: intra_pred_paeth_w4_8bpc_c: 412.9 ( 1.00x) intra_pred_paeth_w4_8bpc_rvv: 688.0 ( 0.60x) intra_pred_paeth_w8_8bpc_c: 1206.6 ( 1.00x) intra_pred_paeth_w8_8bpc_rvv: 1094.3 ( 1.10x) intra_pred_paeth_w16_8bpc_c: 3889.7 ( 1.00x) intra_pred_paeth_w16_8bpc_rvv: 1796.7 ( 2.16x) intra_pred_paeth_w32_8bpc_c: 9797.2 ( 1.00x) intra_pred_paeth_w32_8bpc_rvv: 4323.9 ( 2.27x) intra_pred_paeth_w64_8bpc_c: 24242.5 ( 1.00x) intra_pred_paeth_w64_8bpc_rvv: 10739.8 ( 2.26x) - Banana Pi BPI-F3 intra_pred_paeth_w4_8bpc_c: 395.1 ( 1.00x) intra_pred_paeth_w4_8bpc_rvv: 705.4 ( 0.56x) intra_pred_paeth_w8_8bpc_c: 1184.9 ( 1.00x) intra_pred_paeth_w8_8bpc_rvv: 1125.3 ( 1.05x) intra_pred_paeth_w16_8bpc_c: 3807.8 ( 1.00x) intra_pred_paeth_w16_8bpc_rvv: 1850.8 ( 2.06x) intra_pred_paeth_w32_8bpc_c: 9985.1 ( 1.00x) intra_pred_paeth_w32_8bpc_rvv: 2235.5 ( 4.47x) intra_pred_paeth_w64_8bpc_c: 24040.4 ( 1.00x) intra_pred_paeth_w64_8bpc_rvv: 5450.0 ( 4.41x)
-
Bogdan Gligorijević authored
Benchmarks: - Kendryte K230: pal_pred_w4_8bpc_c: 115.6 ( 1.00x) pal_pred_w4_8bpc_rvv: 331.4 ( 0.35x) pal_pred_w4_16bpc_c: 140.8 ( 1.00x) pal_pred_w4_16bpc_rvv: 247.9 ( 0.57x) pal_pred_w8_8bpc_c: 334.9 ( 1.00x) pal_pred_w8_8bpc_rvv: 520.8 ( 0.64x) pal_pred_w8_16bpc_c: 412.7 ( 1.00x) pal_pred_w8_16bpc_rvv: 386.2 ( 1.07x) pal_pred_w16_8bpc_c: 1044.4 ( 1.00x) pal_pred_w16_8bpc_rvv: 842.8 ( 1.24x) pal_pred_w16_16bpc_c: 1300.3 ( 1.00x) pal_pred_w16_16bpc_rvv: 619.9 ( 2.10x) pal_pred_w32_8bpc_c: 2452.8 ( 1.00x) pal_pred_w32_8bpc_rvv: 1016.1 ( 2.41x) pal_pred_w32_16bpc_c: 3072.1 ( 1.00x) pal_pred_w32_16bpc_rvv: 1440.5 ( 2.13x) pal_pred_w64_8bpc_c: 6015.8 ( 1.00x) pal_pred_w64_8bpc_rvv: 2505.5 ( 2.40x) pal_pred_w64_16bpc_c: 7552.4 ( 1.00x) pal_pred_w64_16bpc_rvv: 3512.7 ( 2.15x) - Banana Pi BPI-F3: pal_pred_w4_8bpc_c: 102.2 ( 1.00x) pal_pred_w4_8bpc_rvv: 511.2 ( 0.20x) pal_pred_w4_16bpc_c: 137.7 ( 1.00x) pal_pred_w4_16bpc_rvv: 330.9 ( 0.42x) pal_pred_w8_8bpc_c: 289.2 ( 1.00x) pal_pred_w8_8bpc_rvv: 819.6 ( 0.35x) pal_pred_w8_16bpc_c: 402.6 ( 1.00x) pal_pred_w8_16bpc_rvv: 520.7 ( 0.77x) pal_pred_w16_8bpc_c: 894.5 ( 1.00x) pal_pred_w16_8bpc_rvv: 1326.6 ( 0.67x) pal_pred_w16_16bpc_c: 1268.6 ( 1.00x) pal_pred_w16_16bpc_rvv: 845.8 ( 1.50x) pal_pred_w32_8bpc_c: 2094.5 ( 1.00x) pal_pred_w32_8bpc_rvv: 1610.9 ( 1.30x) pal_pred_w32_16bpc_c: 2999.4 ( 1.00x) pal_pred_w32_16bpc_rvv: 1029.8 ( 2.91x) pal_pred_w64_8bpc_c: 5128.0 ( 1.00x) pal_pred_w64_8bpc_rvv: 2000.8 ( 2.56x) pal_pred_w64_16bpc_c: 7375.0 ( 1.00x) pal_pred_w64_16bpc_rvv: 2518.2 ( 2.93x)
-
Bogdan Gligorijević authored
Benchmarks: - Kendryte K230: intra_pred_smooth_w4_8bpc_c: 392.6 ( 1.00x) intra_pred_smooth_w4_8bpc_rvv: 311.3 ( 1.26x) intra_pred_smooth_w8_8bpc_c: 1204.1 ( 1.00x) intra_pred_smooth_w8_8bpc_rvv: 488.9 ( 2.46x) intra_pred_smooth_w16_8bpc_c: 3885.9 ( 1.00x) intra_pred_smooth_w16_8bpc_rvv: 796.6 ( 4.88x) intra_pred_smooth_w32_8bpc_c: 9305.7 ( 1.00x) intra_pred_smooth_w32_8bpc_rvv: 1806.7 ( 5.15x) intra_pred_smooth_w64_8bpc_c: 23043.0 ( 1.00x) intra_pred_smooth_w64_8bpc_rvv: 4344.3 ( 5.30x) - spacemiT K1: intra_pred_smooth_w4_8bpc_c: 384.1 ( 1.00x) intra_pred_smooth_w4_8bpc_rvv: 322.2 ( 1.19x) intra_pred_smooth_w8_8bpc_c: 1177.6 ( 1.00x) intra_pred_smooth_w8_8bpc_rvv: 507.1 ( 2.32x) intra_pred_smooth_w16_8bpc_c: 3801.2 ( 1.00x) intra_pred_smooth_w16_8bpc_rvv: 814.4 ( 4.67x) intra_pred_smooth_w32_8bpc_c: 9103.1 ( 1.00x) intra_pred_smooth_w32_8bpc_rvv: 980.8 ( 9.28x) intra_pred_smooth_w64_8bpc_c: 22540.1 ( 1.00x) intra_pred_smooth_w64_8bpc_rvv: 2319.3 ( 9.72x)
-
Bogdan Gligorijević authored
Benchmarks: - Kendryte K230: cfl_pred_cfl_128_w4_8bpc_c: 497.3 ( 1.00x) cfl_pred_cfl_128_w4_8bpc_rvv: 369.6 ( 1.35x) cfl_pred_cfl_128_w4_16bpc_c: 425.2 ( 1.00x) cfl_pred_cfl_128_w4_16bpc_rvv: 385.5 ( 1.10x) cfl_pred_cfl_128_w8_8bpc_c: 1544.2 ( 1.00x) cfl_pred_cfl_128_w8_8bpc_rvv: 584.2 ( 2.64x) cfl_pred_cfl_128_w8_16bpc_c: 1306.2 ( 1.00x) cfl_pred_cfl_128_w8_16bpc_rvv: 608.8 ( 2.15x) cfl_pred_cfl_128_w16_8bpc_c: 3085.6 ( 1.00x) cfl_pred_cfl_128_w16_8bpc_rvv: 584.2 ( 5.28x) cfl_pred_cfl_128_w16_16bpc_c: 2657.1 ( 1.00x) cfl_pred_cfl_128_w16_16bpc_rvv: 608.9 ( 4.36x) cfl_pred_cfl_128_w32_8bpc_c: 8405.6 ( 1.00x) cfl_pred_cfl_128_w32_8bpc_rvv: 1416.1 ( 5.94x) cfl_pred_cfl_128_w32_16bpc_c: 7199.9 ( 1.00x) cfl_pred_cfl_128_w32_16bpc_rvv: 1479.8 ( 4.87x) cfl_pred_cfl_left_w4_8bpc_c: 553.1 ( 1.00x) cfl_pred_cfl_left_w4_8bpc_rvv: 395.6 ( 1.40x) cfl_pred_cfl_left_w4_16bpc_c: 486.7 ( 1.00x) cfl_pred_cfl_left_w4_16bpc_rvv: 409.1 ( 1.19x) cfl_pred_cfl_left_w8_8bpc_c: 1610.8 ( 1.00x) cfl_pred_cfl_left_w8_8bpc_rvv: 610.4 ( 2.64x) cfl_pred_cfl_left_w8_16bpc_c: 1378.0 ( 1.00x) cfl_pred_cfl_left_w8_16bpc_rvv: 636.2 ( 2.17x) cfl_pred_cfl_left_w16_8bpc_c: 3154.4 ( 1.00x) cfl_pred_cfl_left_w16_8bpc_rvv: 610.4 ( 5.17x) cfl_pred_cfl_left_w16_16bpc_c: 2733.2 ( 1.00x) cfl_pred_cfl_left_w16_16bpc_rvv: 636.3 ( 4.30x) cfl_pred_cfl_left_w32_8bpc_c: 8451.7 ( 1.00x) cfl_pred_cfl_left_w32_8bpc_rvv: 1442.5 ( 5.86x) cfl_pred_cfl_left_w32_16bpc_c: 7267.2 ( 1.00x) cfl_pred_cfl_left_w32_16bpc_rvv: 1509.4 ( 4.81x) cfl_pred_cfl_top_w4_8bpc_c: 544.7 ( 1.00x) cfl_pred_cfl_top_w4_8bpc_rvv: 395.8 ( 1.38x) cfl_pred_cfl_top_w4_16bpc_c: 475.1 ( 1.00x) cfl_pred_cfl_top_w4_16bpc_rvv: 406.7 ( 1.17x) cfl_pred_cfl_top_w8_8bpc_c: 1599.3 ( 1.00x) cfl_pred_cfl_top_w8_8bpc_rvv: 610.4 ( 2.62x) cfl_pred_cfl_top_w8_16bpc_c: 1363.8 ( 1.00x) cfl_pred_cfl_top_w8_16bpc_rvv: 630.3 ( 2.16x) cfl_pred_cfl_top_w16_8bpc_c: 3161.0 ( 1.00x) cfl_pred_cfl_top_w16_8bpc_rvv: 610.5 ( 5.18x) cfl_pred_cfl_top_w16_16bpc_c: 2735.9 ( 1.00x) cfl_pred_cfl_top_w16_16bpc_rvv: 634.3 ( 4.31x) cfl_pred_cfl_top_w32_8bpc_c: 8564.4 ( 1.00x) cfl_pred_cfl_top_w32_8bpc_rvv: 1442.8 ( 5.94x) cfl_pred_cfl_top_w32_16bpc_c: 7294.9 ( 1.00x) cfl_pred_cfl_top_w32_16bpc_rvv: 1511.5 ( 4.83x) cfl_pred_cfl_w4_8bpc_c: 571.5 ( 1.00x) cfl_pred_cfl_w4_8bpc_rvv: 421.0 ( 1.36x) cfl_pred_cfl_w4_16bpc_c: 499.1 ( 1.00x) cfl_pred_cfl_w4_16bpc_rvv: 462.8 ( 1.08x) cfl_pred_cfl_w8_8bpc_c: 1642.0 ( 1.00x) cfl_pred_cfl_w8_8bpc_rvv: 635.8 ( 2.58x) cfl_pred_cfl_w8_16bpc_c: 1401.4 ( 1.00x) cfl_pred_cfl_w8_16bpc_rvv: 686.1 ( 2.04x) cfl_pred_cfl_w16_8bpc_c: 3204.3 ( 1.00x) cfl_pred_cfl_w16_8bpc_rvv: 635.8 ( 5.04x) cfl_pred_cfl_w16_16bpc_c: 2784.8 ( 1.00x) cfl_pred_cfl_w16_16bpc_rvv: 686.1 ( 4.06x) cfl_pred_cfl_w32_8bpc_c: 8623.9 ( 1.00x) cfl_pred_cfl_w32_8bpc_rvv: 1465.9 ( 5.88x) cfl_pred_cfl_w32_16bpc_c: 7357.8 ( 1.00x) cfl_pred_cfl_w32_16bpc_rvv: 1556.3 ( 4.73x) - Banana Pi BPI-F3: cfl_pred_cfl_128_w4_8bpc_c: 485.5 ( 1.00x) cfl_pred_cfl_128_w4_8bpc_rvv: 366.4 ( 1.33x) cfl_pred_cfl_128_w4_16bpc_c: 393.5 ( 1.00x) cfl_pred_cfl_128_w4_16bpc_rvv: 378.7 ( 1.04x) cfl_pred_cfl_128_w8_8bpc_c: 1507.9 ( 1.00x) cfl_pred_cfl_128_w8_8bpc_rvv: 577.4 ( 2.61x) cfl_pred_cfl_128_w8_16bpc_c: 1205.7 ( 1.00x) cfl_pred_cfl_128_w8_16bpc_rvv: 605.1 ( 1.99x) cfl_pred_cfl_128_w16_8bpc_c: 3019.3 ( 1.00x) cfl_pred_cfl_128_w16_8bpc_rvv: 577.4 ( 5.23x) cfl_pred_cfl_128_w16_16bpc_c: 2506.5 ( 1.00x) cfl_pred_cfl_128_w16_16bpc_rvv: 605.1 ( 4.14x) cfl_pred_cfl_128_w32_8bpc_c: 8170.0 ( 1.00x) cfl_pred_cfl_128_w32_8bpc_rvv: 715.6 (11.42x) cfl_pred_cfl_128_w32_16bpc_c: 6686.7 ( 1.00x) cfl_pred_cfl_128_w32_16bpc_rvv: 749.7 ( 8.92x) cfl_pred_cfl_left_w4_8bpc_c: 539.4 ( 1.00x) cfl_pred_cfl_left_w4_8bpc_rvv: 393.2 ( 1.37x) cfl_pred_cfl_left_w4_16bpc_c: 452.0 ( 1.00x) cfl_pred_cfl_left_w4_16bpc_rvv: 401.2 ( 1.13x) cfl_pred_cfl_left_w8_8bpc_c: 1572.4 ( 1.00x) cfl_pred_cfl_left_w8_8bpc_rvv: 604.1 ( 2.60x) cfl_pred_cfl_left_w8_16bpc_c: 1274.5 ( 1.00x) cfl_pred_cfl_left_w8_16bpc_rvv: 629.0 ( 2.03x) cfl_pred_cfl_left_w16_8bpc_c: 3096.0 ( 1.00x) cfl_pred_cfl_left_w16_8bpc_rvv: 604.1 ( 5.13x) cfl_pred_cfl_left_w16_16bpc_c: 2591.4 ( 1.00x) cfl_pred_cfl_left_w16_16bpc_rvv: 629.0 ( 4.12x) cfl_pred_cfl_left_w32_8bpc_c: 8266.0 ( 1.00x) cfl_pred_cfl_left_w32_8bpc_rvv: 742.4 (11.13x) cfl_pred_cfl_left_w32_16bpc_c: 6758.0 ( 1.00x) cfl_pred_cfl_left_w32_16bpc_rvv: 773.9 ( 8.73x) cfl_pred_cfl_top_w4_8bpc_c: 532.3 ( 1.00x) cfl_pred_cfl_top_w4_8bpc_rvv: 392.6 ( 1.36x) cfl_pred_cfl_top_w4_16bpc_c: 440.4 ( 1.00x) cfl_pred_cfl_top_w4_16bpc_rvv: 399.6 ( 1.10x) cfl_pred_cfl_top_w8_8bpc_c: 1563.3 ( 1.00x) cfl_pred_cfl_top_w8_8bpc_rvv: 603.6 ( 2.59x) cfl_pred_cfl_top_w8_16bpc_c: 1271.6 ( 1.00x) cfl_pred_cfl_top_w8_16bpc_rvv: 626.1 ( 2.03x) cfl_pred_cfl_top_w16_8bpc_c: 3098.6 ( 1.00x) cfl_pred_cfl_top_w16_8bpc_rvv: 603.6 ( 5.13x) cfl_pred_cfl_top_w16_16bpc_c: 2562.8 ( 1.00x) cfl_pred_cfl_top_w16_16bpc_rvv: 626.0 ( 4.09x) cfl_pred_cfl_top_w32_8bpc_c: 8278.1 ( 1.00x) cfl_pred_cfl_top_w32_8bpc_rvv: 741.8 (11.16x) cfl_pred_cfl_top_w32_16bpc_c: 6799.1 ( 1.00x) cfl_pred_cfl_top_w32_16bpc_rvv: 775.0 ( 8.77x) cfl_pred_cfl_w4_8bpc_c: 559.8 ( 1.00x) cfl_pred_cfl_w4_8bpc_rvv: 421.7 ( 1.33x) cfl_pred_cfl_w4_16bpc_c: 470.2 ( 1.00x) cfl_pred_cfl_w4_16bpc_rvv: 451.3 ( 1.04x) cfl_pred_cfl_w8_8bpc_c: 1605.5 ( 1.00x) cfl_pred_cfl_w8_8bpc_rvv: 632.8 ( 2.54x) cfl_pred_cfl_w8_16bpc_c: 1308.5 ( 1.00x) cfl_pred_cfl_w8_16bpc_rvv: 677.9 ( 1.93x) cfl_pred_cfl_w16_8bpc_c: 3135.0 ( 1.00x) cfl_pred_cfl_w16_8bpc_rvv: 632.9 ( 4.95x) cfl_pred_cfl_w16_16bpc_c: 2625.9 ( 1.00x) cfl_pred_cfl_w16_16bpc_rvv: 677.9 ( 3.87x) cfl_pred_cfl_w32_8bpc_c: 8376.6 ( 1.00x) cfl_pred_cfl_w32_8bpc_rvv: 770.4 (10.87x) cfl_pred_cfl_w32_16bpc_c: 6866.4 ( 1.00x) cfl_pred_cfl_w32_16bpc_rvv: 822.7 ( 8.35x)
-
Bogdan Gligorijević authored
Benchmarks: - Kendryte K230: cdef_filter_4x4_01_8bpc_c: 1339.4 ( 1.00x) cdef_filter_4x4_01_8bpc_rvv: 836.2 ( 1.60x) cdef_filter_4x4_01_16bpc_c: 1369.1 ( 1.00x) cdef_filter_4x4_01_16bpc_rvv: 824.7 ( 1.66x) cdef_filter_4x4_10_8bpc_c: 872.8 ( 1.00x) cdef_filter_4x4_10_8bpc_rvv: 523.9 ( 1.67x) cdef_filter_4x4_10_16bpc_c: 938.2 ( 1.00x) cdef_filter_4x4_10_16bpc_rvv: 517.1 ( 1.81x) cdef_filter_4x4_11_8bpc_c: 2668.3 ( 1.00x) cdef_filter_4x4_11_8bpc_rvv: 1285.0 ( 2.08x) cdef_filter_4x4_11_16bpc_c: 2922.1 ( 1.00x) cdef_filter_4x4_11_16bpc_rvv: 1291.0 ( 2.26x) cdef_filter_4x8_01_8bpc_c: 2489.1 ( 1.00x) cdef_filter_4x8_01_8bpc_rvv: 1594.3 ( 1.56x) cdef_filter_4x8_01_16bpc_c: 2528.1 ( 1.00x) cdef_filter_4x8_01_16bpc_rvv: 1566.6 ( 1.61x) cdef_filter_4x8_10_8bpc_c: 1576.9 ( 1.00x) cdef_filter_4x8_10_8bpc_rvv: 967.1 ( 1.63x) cdef_filter_4x8_10_16bpc_c: 1641.3 ( 1.00x) cdef_filter_4x8_10_16bpc_rvv: 947.1 ( 1.73x) cdef_filter_4x8_11_8bpc_c: 5164.0 ( 1.00x) cdef_filter_4x8_11_8bpc_rvv: 2490.7 ( 2.07x) cdef_filter_4x8_11_16bpc_c: 5732.3 ( 1.00x) cdef_filter_4x8_11_16bpc_rvv: 2499.2 ( 2.29x) cdef_filter_8x8_01_8bpc_c: 4742.3 ( 1.00x) cdef_filter_8x8_01_8bpc_rvv: 1628.6 ( 2.91x) cdef_filter_8x8_01_16bpc_c: 4785.0 ( 1.00x) cdef_filter_8x8_01_16bpc_rvv: 1595.5 ( 3.00x) cdef_filter_8x8_10_8bpc_c: 2962.4 ( 1.00x) cdef_filter_8x8_10_8bpc_rvv: 1000.8 ( 2.96x) cdef_filter_8x8_10_16bpc_c: 3022.4 ( 1.00x) cdef_filter_8x8_10_16bpc_rvv: 975.7 ( 3.10x) cdef_filter_8x8_11_8bpc_c: 12623.9 ( 1.00x) cdef_filter_8x8_11_8bpc_rvv: 2525.4 ( 5.00x) cdef_filter_8x8_11_16bpc_c: 12470.7 ( 1.00x) cdef_filter_8x8_11_16bpc_rvv: 2528.2 ( 4.93x) - Banana Pi BPI-F3: cdef_filter_4x4_01_8bpc_c: 1281.2 ( 1.00x) cdef_filter_4x4_01_8bpc_rvv: 813.0 ( 1.58x) cdef_filter_4x4_01_16bpc_c: 1300.8 ( 1.00x) cdef_filter_4x4_01_16bpc_rvv: 808.9 ( 1.61x) cdef_filter_4x4_10_8bpc_c: 843.0 ( 1.00x) cdef_filter_4x4_10_8bpc_rvv: 498.4 ( 1.69x) cdef_filter_4x4_10_16bpc_c: 903.6 ( 1.00x) cdef_filter_4x4_10_16bpc_rvv: 497.9 ( 1.81x) cdef_filter_4x4_11_8bpc_c: 2614.1 ( 1.00x) cdef_filter_4x4_11_8bpc_rvv: 1219.6 ( 2.14x) cdef_filter_4x4_11_16bpc_c: 2795.6 ( 1.00x) cdef_filter_4x4_11_16bpc_rvv: 1243.1 ( 2.25x) cdef_filter_4x8_01_8bpc_c: 2405.4 ( 1.00x) cdef_filter_4x8_01_8bpc_rvv: 1548.5 ( 1.55x) cdef_filter_4x8_01_16bpc_c: 2402.7 ( 1.00x) cdef_filter_4x8_01_16bpc_rvv: 1542.7 ( 1.56x) cdef_filter_4x8_10_8bpc_c: 1522.0 ( 1.00x) cdef_filter_4x8_10_8bpc_rvv: 917.4 ( 1.66x) cdef_filter_4x8_10_16bpc_c: 1589.2 ( 1.00x) cdef_filter_4x8_10_16bpc_rvv: 915.9 ( 1.74x) cdef_filter_4x8_11_8bpc_c: 5050.7 ( 1.00x) cdef_filter_4x8_11_8bpc_rvv: 2358.7 ( 2.14x) cdef_filter_4x8_11_16bpc_c: 5510.5 ( 1.00x) cdef_filter_4x8_11_16bpc_rvv: 2411.6 ( 2.28x) cdef_filter_8x8_01_8bpc_c: 4558.3 ( 1.00x) cdef_filter_8x8_01_8bpc_rvv: 1579.7 ( 2.89x) cdef_filter_8x8_01_16bpc_c: 4551.1 ( 1.00x) cdef_filter_8x8_01_16bpc_rvv: 1571.1 ( 2.90x) cdef_filter_8x8_10_8bpc_c: 2869.3 ( 1.00x) cdef_filter_8x8_10_8bpc_rvv: 948.4 ( 3.03x) cdef_filter_8x8_10_16bpc_c: 2928.6 ( 1.00x) cdef_filter_8x8_10_16bpc_rvv: 944.2 ( 3.10x) cdef_filter_8x8_11_8bpc_c: 12317.5 ( 1.00x) cdef_filter_8x8_11_8bpc_rvv: 2389.7 ( 5.15x) cdef_filter_8x8_11_16bpc_c: 11950.6 ( 1.00x) cdef_filter_8x8_11_16bpc_rvv: 2440.1 ( 4.90x)
-
Bogdan Gligorijević authored
Benchmarks: - Kendryte K230: pal_idx_finish_w4_c: 122.5 ( 1.00x) pal_idx_finish_w4_rvv: 107.2 ( 1.14x) pal_idx_finish_w8_c: 302.8 ( 1.00x) pal_idx_finish_w8_rvv: 197.9 ( 1.53x) pal_idx_finish_w16_c: 868.2 ( 1.00x) pal_idx_finish_w16_rvv: 438.5 ( 1.98x) pal_idx_finish_w32_c: 1966.5 ( 1.00x) pal_idx_finish_w32_rvv: 833.0 ( 2.36x) pal_idx_finish_w64_c: 4737.5 ( 1.00x) pal_idx_finish_w64_rvv: 1818.3 ( 2.61x) - Banana Pi BPI-F3: pal_idx_finish_w4_c: 122.4 ( 1.00x) pal_idx_finish_w4_rvv: 132.0 ( 0.93x) pal_idx_finish_w8_c: 289.4 ( 1.00x) pal_idx_finish_w8_rvv: 195.8 ( 1.48x) pal_idx_finish_w16_c: 788.0 ( 1.00x) pal_idx_finish_w16_rvv: 430.6 ( 1.83x) pal_idx_finish_w32_c: 1699.2 ( 1.00x) pal_idx_finish_w32_rvv: 816.3 ( 2.08x) pal_idx_finish_w64_c: 3977.7 ( 1.00x) pal_idx_finish_w64_rvv: 1779.4 ( 2.24x)
-
Nathan E. Egge authored
-
- Oct 07, 2024
-
-
Henrik Gramner authored
Instead of using gathers we can calculate the value of sgr_x_by_x[min(z, 255)] by doing 256 / (z + 1) in floating-point with some clipping for z == 0 and z >= 255. As the required precision of the division is fairly small it can be performed using an approximate reciprocal, which is significantly faster than a regular division. Gather instructions are slow on all AMD CPU:s, and on most Intel CPU:s ever since µcode updates were issued as a workaround for the Gather Data Sampling side channel vulnerability.
-
- Oct 02, 2024
-
-
Luca Barbato authored
-
- Sep 30, 2024
-
-
Change-Id: I78fe788113ff2487ba1ce2e7d0c7d7c78c5a8c58
-
Change-Id: I1566e8145d36296f2c76107cf15fc2cc7ac0ecc7
-
The performance data is as follows: save_tmvs_c: 3938.6 ( 1.00x) save_tmvs_lsx: 1355.3 ( 2.91x)
-
bench performance before: lpf_h_sb_y_w16_8bpc_c: 117.0 ( 1.00x) lpf_h_sb_y_w16_8bpc_lsx: 33.9 ( 3.46x) lpf_v_sb_y_w16_8bpc_c: 132.1 ( 1.00x) lpf_v_sb_y_w16_8bpc_lsx: 59.7 ( 2.21x) bench performance after: lpf_h_sb_y_w16_8bpc_c: 114.9 ( 1.00x) lpf_h_sb_y_w16_8bpc_lsx: 32.0 ( 3.59x) lpf_v_sb_y_w16_8bpc_c: 132.5 ( 1.00x) lpf_v_sb_y_w16_8bpc_lsx: 28.1 ( 4.72x) Change-Id: Ie64e164a9416c438f6b3881ce18fb42e2ddd073d
-
sgr_3x3_8bpc_c: 27233.1 ( 1.00x) sgr_3x3_8bpc_lsx: 12874.7 ( 2.12x) sgr_3x3_8bpc_lasx: 10183.7 ( 2.67x) Change-Id: I2aa469e8560733d6191396186bf776a12ad6e4a3
-
before: warp_8x8_8bpc_c: 109.8 ( 1.00x) warp_8x8_8bpc_lsx: 44.6 ( 2.46x) warp_8x8t_8bpc_c: 97.5 ( 1.00x) warp_8x8t_8bpc_lsx: 43.7 ( 2.23x) after: warp_8x8_8bpc_c: 109.8 ( 1.00x) warp_8x8_8bpc_lsx: 39.2 ( 2.80x) warp_8x8t_8bpc_c: 97.5 ( 1.00x) warp_8x8t_8bpc_lsx: 37.9 ( 2.57x) Change-Id: I11728c2c30821b8e2b1c85208710dfe5d1c1269c
-
mct_8tap_regular_w8_h_8bpc_c: 47.1 ( 1.00x) mct_8tap_regular_w8_h_8bpc_lsx: 6.3 ( 7.46x) mct_8tap_regular_w8_h_8bpc_lasx: 4.4 (10.80x) mct_8tap_regular_w8_hv_8bpc_c: 118.9 ( 1.00x) mct_8tap_regular_w8_hv_8bpc_lsx: 19.2 ( 6.20x) mct_8tap_regular_w8_hv_8bpc_lasx: 13.7 ( 8.69x) mct_8tap_regular_w8_v_8bpc_c: 60.3 ( 1.00x) mct_8tap_regular_w8_v_8bpc_lsx: 5.4 (11.08x) mct_8tap_regular_w8_v_8bpc_lasx: 3.3 (18.33x) Change-Id: I1140f6ffbd738166f2581bc9111ebbdf6f9fa72c
-
wiener_5tap_8bpc_c: 18382.0 ( 1.00x) wiener_5tap_8bpc_lsx: 4166.9 ( 4.41x) wiener_5tap_8bpc_lasx: 2832.2 ( 6.49x) wiener_7tap_8bpc_c: 18339.6 ( 1.00x) wiener_7tap_8bpc_lsx: 4168.3 ( 4.40x) wiener_7tap_8bpc_lasx: 2832.5 ( 6.47x) Change-Id: I183a8cb008203fb61683b0543d9409d58d141a2e
-
load_tmvs_c: 9702.0 ( 1.00x) load_tmvs_lsx: 7857.0 ( 1.23x)
-
intra_pred_z1_w4_8bpc_c: 16.5 ( 1.00x) intra_pred_z1_w4_8bpc_lsx: 7.1 ( 2.31x) intra_pred_z1_w8_8bpc_c: 31.9 ( 1.00x) intra_pred_z1_w8_8bpc_lsx: 10.0 ( 3.20x) intra_pred_z1_w16_8bpc_c: 80.1 ( 1.00x) intra_pred_z1_w16_8bpc_lsx: 20.2 ( 3.96x) intra_pred_z1_w32_8bpc_c: 185.8 ( 1.00x) intra_pred_z1_w32_8bpc_lsx: 40.8 ( 4.55x) intra_pred_z1_w64_8bpc_c: 511.1 ( 1.00x) intra_pred_z1_w64_8bpc_lsx: 99.0 ( 5.16x) Change-Id: Id7591e9b87e5b4d7fc3f438397e25dc6ca8e7f91
-
emu_edge_w4_8bpc_c: 9.0 ( 1.00x) emu_edge_w4_8bpc_lsx: 6.7 ( 1.34x) emu_edge_w8_8bpc_c: 12.9 ( 1.00x) emu_edge_w8_8bpc_lsx: 9.2 ( 1.40x) emu_edge_w16_8bpc_c: 20.0 ( 1.00x) emu_edge_w16_8bpc_lsx: 16.3 ( 1.23x) emu_edge_w32_8bpc_c: 44.6 ( 1.00x) emu_edge_w32_8bpc_lsx: 33.3 ( 1.34x) emu_edge_w64_8bpc_c: 79.9 ( 1.00x) emu_edge_w64_8bpc_lsx: 66.2 ( 1.21x) emu_edge_w128_8bpc_c: 193.9 ( 1.00x) emu_edge_w128_8bpc_lsx: 197.8 ( 0.98x) Change-Id: I180c94d311509740b03793419d5790a931532980
-
Now checkasm calls the test function 'func_new' through the wrapper 'checked_call' instead of calling it directly. The purpose of the wrapper is to check if 'func_new' correctly saves and restores static registers. The wrapper writes dirty values to the static registers, and after calling 'func_new', it checks if the dirty values in the static registers remain consistent. Change-Id: Ia9290b55ab0f2dd87801f6fd175813d3f717d851
-
intra_pred_filter_w4_8bpc_c: 17.9 ( 1.00x) intra_pred_filter_w4_8bpc_lsx: 8.9 ( 2.00x) intra_pred_filter_w8_8bpc_c: 55.3 ( 1.00x) intra_pred_filter_w8_8bpc_lsx: 23.8 ( 2.33x) intra_pred_filter_w16_8bpc_c: 109.4 ( 1.00x) intra_pred_filter_w16_8bpc_lsx: 49.1 ( 2.23x) intra_pred_filter_w32_8bpc_c: 270.2 ( 1.00x) intra_pred_filter_w32_8bpc_lsx: 126.1 ( 2.14x) Change-Id: Ic4c23cb1d54d5f8557c31cdfbbd54f8beaaa32c2
-