- Nov 15, 2019
-
-
A73 A53 Earlier Now Earlier Now intra_pred_dc_top_w64_8bpc_neon: 344.4 344.6 253.4 252.3
-
- Nov 12, 2019
-
-
Martin Storsjö authored
This was requested in the review of the arm32 version of the same.
-
Martin Storsjö authored
This removes one redundant instruction for loop filters smaller than 16.
-
Martin Storsjö authored
This doesn't change performance measurably, but eases potential future maintainance of the code.
-
Martin Storsjö authored
-
Martin Storsjö authored
The code is a fairly exact 1:1 port of the ARM64 code, but operating on 8 pixels at a time, instead of 16. Relative speedup over C code according to checkasm: Cortex A7 A8 A9 A53 A72 A73 lpf_h_sb_uv_w4_8bpc_neon: 1.36 1.40 1.25 1.71 1.55 1.59 lpf_h_sb_uv_w6_8bpc_neon: 2.18 2.11 1.74 2.65 2.32 2.34 lpf_h_sb_y_w4_8bpc_neon: 1.48 1.43 1.20 1.91 1.49 1.64 lpf_h_sb_y_w8_8bpc_neon: 2.34 2.05 1.78 2.84 2.35 2.69 lpf_h_sb_y_w16_8bpc_neon: 2.13 1.83 1.63 2.51 2.10 2.35 lpf_v_sb_uv_w4_8bpc_neon: 1.69 1.66 1.60 2.16 2.24 2.24 lpf_v_sb_uv_w6_8bpc_neon: 2.68 2.43 2.22 3.53 3.44 3.35 lpf_v_sb_y_w4_8bpc_neon: 1.74 1.74 1.43 2.34 2.14 2.18 lpf_v_sb_y_w8_8bpc_neon: 2.92 2.47 2.19 3.55 3.22 3.54 lpf_v_sb_y_w16_8bpc_neon: 2.62 2.19 1.98 3.25 2.80 3.10 Comparison to the original ARM64 assembly: ARM64: A53 A72 A73 lpf_h_sb_uv_w4_8bpc_neon: 702.5 518.2 529.1 lpf_h_sb_uv_w6_8bpc_neon: 1007.3 672.6 736.6 lpf_h_sb_y_w4_8bpc_neon: 1652.8 1261.2 1276.5 lpf_h_sb_y_w8_8bpc_neon: 2144.7 1559.8 1638.7 lpf_h_sb_y_w16_8bpc_neon: 2318.3 1757.2 1792.8 lpf_v_sb_uv_w4_8bpc_neon: 447.1 302.0 292.4 lpf_v_sb_uv_w6_8bpc_neon: 600.0 397.7 406.9 lpf_v_sb_y_w4_8bpc_neon: 1212.6 840.1 818.4 lpf_v_sb_y_w8_8bpc_neon: 1623.3 1167.4 1156.7 lpf_v_sb_y_w16_8bpc_neon: 1694.9 1237.9 1182.3 ARM32: lpf_h_sb_uv_w4_8bpc_neon: 821.2 501.1 500.8 lpf_h_sb_uv_w6_8bpc_neon: 1232.0 715.7 746.6 lpf_h_sb_y_w4_8bpc_neon: 2208.1 1373.2 1414.7 lpf_h_sb_y_w8_8bpc_neon: 3138.3 1843.1 1915.2 lpf_h_sb_y_w16_8bpc_neon: 3293.1 1842.5 1975.9 lpf_v_sb_uv_w4_8bpc_neon: 619.9 326.7 324.9 lpf_v_sb_uv_w6_8bpc_neon: 855.9 446.7 468.2 lpf_v_sb_y_w4_8bpc_neon: 1737.6 935.5 1007.0 lpf_v_sb_y_w8_8bpc_neon: 2346.7 1232.8 1298.3 lpf_v_sb_y_w16_8bpc_neon: 2353.4 1283.4 1379.9
-
- Nov 10, 2019
-
-
Martin Storsjö authored
Otherwise the macro would interfere with local labels 1 and 2 in the context where the macro is expanded.
-
- Nov 01, 2019
-
-
Ronald S. Bultje authored
Before: gen_grain_uv_ar2_8bpc_420_avx2: 29176.2 After: gen_grain_uv_ar2_8bpc_420_avx2: 26794.0
-
- Oct 28, 2019
-
-
Jean-Baptiste Kempf authored
-
- Oct 25, 2019
-
-
Henrik Gramner authored
-
Jean-Baptiste Kempf authored
-
- Oct 24, 2019
-
-
-
-
Victorien Le Couviour--Tuffet authored
Also slightly optimized the 32-bit SSSE3, especially by the removal of an XMM store/load. --------------------- x86_64: ------------------------------------------ wiener_chroma_8bpc_c: 193155.1 wiener_chroma_8bpc_sse2: 48973.4 wiener_chroma_8bpc_ssse3: 31486.3 --------------------- wiener_luma_8bpc_c: 192787.5 wiener_luma_8bpc_sse2: 48674.9 wiener_luma_8bpc_ssse3: 30446.3 ------------------------------------------ --------------------- x86_32: ------------------------------------------ wiener_chroma_8bpc_c: 309861.0 wiener_chroma_8bpc_sse2: 52345.9 wiener_chroma_8bpc_ssse3: 32983.2 --------------------- wiener_luma_8bpc_c: 317909.1 wiener_luma_8bpc_sse2: 52522.1 wiener_luma_8bpc_ssse3: 33323.1 ------------------------------------------
-
Victorien Le Couviour--Tuffet authored
--------------------- x86_64: ------------------------------------------ warp_8x8_8bpc_c: 1761.5 warp_8x8_8bpc_sse2: 583.0 warp_8x8_8bpc_ssse3: 329.3 --------------------- warp_8x8t_8bpc_c: 1694.3 warp_8x8t_8bpc_sse2: 577.6 warp_8x8t_8bpc_ssse3: 334.1 ------------------------------------------ --------------------- x86_32: ------------------------------------------ warp_8x8_8bpc_c: 1842.6 warp_8x8_8bpc_sse2: 677.1 warp_8x8_8bpc_ssse3: 394.9 --------------------- warp_8x8t_8bpc_c: 1741.1 warp_8x8t_8bpc_sse2: 648.5 warp_8x8t_8bpc_ssse3: 372.6 ------------------------------------------
-
-
-
The actual register used for this operand doesn't matter, but make it consistent with the others.
-
-
-
As the neon function processes two rows at a time, this has passed unnoticed.
-
The code is mostly a 1:1 port of the ARM64 code, with slightly worse scheduling due to fewer temporary registers available. The sgr_finish_filter1_neon function (used in the 3x3 and mix cases) processes 4 pixels at a time while the ARM64 version processes 8, due to not having enough registers available. Relative speedup over C code: Cortex A7 A8 A9 A53 A72 A73 selfguided_3x3_8bpc_neon: 2.12 2.89 1.79 2.61 2.03 3.87 selfguided_5x5_8bpc_neon: 2.50 3.41 2.16 3.14 2.74 4.64 selfguided_mix_8bpc_neon: 2.24 2.98 1.94 2.82 2.28 4.14 Comparison to the original ARM64 assembly: ARM64: Cortex A53 A72 A73 selfguided_3x3_8bpc_neon: 486215.5 359445.6 341317.7 selfguided_5x5_8bpc_neon: 351210.8 267427.2 243399.3 selfguided_mix_8bpc_neon: 820489.1 610909.8 569946.6 ARM32: selfguided_3x3_8bpc_neon: 542958.8 379448.8 353229.1 selfguided_5x5_8bpc_neon: 351299.6 263685.2 242415.9 selfguided_mix_8bpc_neon: 881587.6 629934.0 580121.2
-
-
- Oct 22, 2019
-
-
Only the upsamling code requires a size of width/height * 2, but upsampling is never performed when width or height is above 8.
-
-
- Oct 21, 2019
-
-
Pre-permuting the registers in INIT_*MM avx512 (AVX512_MM_PERMUTATION) is redondant. It causes the register mapping to be the same as without the initial AVX512_MM_PERMUTATION, with the user SWAPs applied. For example... INIT_YMM avx512 SWAP m0, m16 SAVE_MM_PERMUTATION ; do whatever LOAD_MM_PERMUTATION ... would result in m0 mapping to ymm16 instead of ymm0 and m1 to ymm1 instead of ymm17.
-
- Oct 18, 2019
-
-
Victorien Le Couviour--Tuffet authored
--------------------- x86_64: ------------------------------------------ cdef_filter_4x4_8bpc_c: 1376.0 cdef_filter_4x4_8bpc_sse2: 177.6 cdef_filter_4x4_8bpc_ssse3: 132.5 --------------------- cdef_filter_4x8_8bpc_c: 2725.0 cdef_filter_4x8_8bpc_sse2: 327.6 cdef_filter_4x8_8bpc_ssse3: 234.9 --------------------- cdef_filter_8x8_8bpc_c: 5938.8 cdef_filter_8x8_8bpc_sse2: 556.8 cdef_filter_8x8_8bpc_ssse3: 388.1 ------------------------------------------ --------------------- x86_32: ------------------------------------------ cdef_filter_4x4_8bpc_c: 1569.5 cdef_filter_4x4_8bpc_sse2: 201.9 cdef_filter_4x4_8bpc_ssse3: 162.3 --------------------- cdef_filter_4x8_8bpc_c: 3141.6 cdef_filter_4x8_8bpc_sse2: 368.3 cdef_filter_4x8_8bpc_ssse3: 283.4 --------------------- cdef_filter_8x8_8bpc_c: 6534.5 cdef_filter_8x8_8bpc_sse2: 666.7 cdef_filter_8x8_8bpc_ssse3: 503.5 ------------------------------------------
-
- Oct 16, 2019
-
-
Victorien Le Couviour--Tuffet authored
-
- Oct 11, 2019
-
-
Martin Storsjö authored
This should expose issues like #300, where there are issues with alignment of labels, referenced from debug info.
-
Martin Storsjö authored
If building with debug information enabled, binutils error out with "unaligned opcodes detected in executable segment", if there are symbols (even local ones that don't end up in the symbol table) that point to unaligned addresses in the text section. This fixes issue #300.
-
Jean-Baptiste Kempf authored
-
Martin Storsjö authored
-
- Oct 10, 2019
-
-
Luc Trudeau authored
-
Relative speedup over the C code: Cortex A53 A72 A73 cfl_ac_420_w4_8bpc_neon: 7.73 6.48 9.22 cfl_ac_420_w8_8bpc_neon: 6.70 5.56 6.95 cfl_ac_420_w16_8bpc_neon: 6.51 6.93 6.67 cfl_ac_422_w4_8bpc_neon: 9.25 7.70 9.75 cfl_ac_422_w8_8bpc_neon: 8.53 5.95 7.13 cfl_ac_422_w16_8bpc_neon: 7.08 6.87 6.06
-
Relative speedup over the C code: Cortex A53 A72 A73 cfl_pred_cfl_128_w4_8bpc_neon: 10.81 7.90 9.80 cfl_pred_cfl_128_w8_8bpc_neon: 18.38 11.15 13.24 cfl_pred_cfl_128_w16_8bpc_neon: 16.52 10.83 16.00 cfl_pred_cfl_128_w32_8bpc_neon: 3.27 3.60 3.70 cfl_pred_cfl_left_w4_8bpc_neon: 9.82 7.38 8.76 cfl_pred_cfl_left_w8_8bpc_neon: 17.22 10.63 11.97 cfl_pred_cfl_left_w16_8bpc_neon: 16.03 10.49 15.66 cfl_pred_cfl_left_w32_8bpc_neon: 3.28 3.61 3.72 cfl_pred_cfl_top_w4_8bpc_neon: 9.74 7.39 9.29 cfl_pred_cfl_top_w8_8bpc_neon: 17.48 10.89 12.58 cfl_pred_cfl_top_w16_8bpc_neon: 16.01 10.62 15.31 cfl_pred_cfl_top_w32_8bpc_neon: 3.25 3.62 3.75 cfl_pred_cfl_w4_8bpc_neon: 8.39 6.34 8.04 cfl_pred_cfl_w8_8bpc_neon: 15.99 10.12 12.42 cfl_pred_cfl_w16_8bpc_neon: 15.25 10.40 15.12 cfl_pred_cfl_w32_8bpc_neon: 3.23 3.58 3.71 The C code gets autovectorized for w >= 32, which is why the relative speedup looks strange (but the performance of the NEON functions is completely as expected).
-
Use a different layout of the filter_intra_taps depending on architecture; the current one is optimized for the x86 SIMD implementation. Relative speedups over the C code: Cortex A53 A72 A73 intra_pred_filter_w4_8bpc_neon: 6.38 2.81 4.43 intra_pred_filter_w8_8bpc_neon: 9.30 3.62 5.71 intra_pred_filter_w16_8bpc_neon: 9.85 3.98 6.42 intra_pred_filter_w32_8bpc_neon: 10.77 4.08 7.09
-
Relative speedups over the C code: Cortex A53 A72 A73 pal_pred_w4_8bpc_neon: 8.75 6.15 7.60 pal_pred_w8_8bpc_neon: 19.93 11.79 10.98 pal_pred_w16_8bpc_neon: 24.68 13.28 16.06 pal_pred_w32_8bpc_neon: 23.56 11.81 16.74 pal_pred_w64_8bpc_neon: 23.16 12.19 17.60
-
Relative speedups over the C code: Cortex A53 A72 A73 intra_pred_smooth_h_w4_8bpc_neon: 8.02 4.53 7.09 intra_pred_smooth_h_w8_8bpc_neon: 16.59 5.91 9.32 intra_pred_smooth_h_w16_8bpc_neon: 18.80 5.54 10.10 intra_pred_smooth_h_w32_8bpc_neon: 5.07 4.43 4.60 intra_pred_smooth_h_w64_8bpc_neon: 5.03 4.26 4.34 intra_pred_smooth_v_w4_8bpc_neon: 9.11 5.51 7.75 intra_pred_smooth_v_w8_8bpc_neon: 17.07 6.86 10.55 intra_pred_smooth_v_w16_8bpc_neon: 17.98 6.38 11.52 intra_pred_smooth_v_w32_8bpc_neon: 11.69 5.66 8.09 intra_pred_smooth_v_w64_8bpc_neon: 8.44 4.34 5.72 intra_pred_smooth_w4_8bpc_neon: 9.81 4.85 6.93 intra_pred_smooth_w8_8bpc_neon: 16.05 5.60 9.26 intra_pred_smooth_w16_8bpc_neon: 14.01 5.02 8.96 intra_pred_smooth_w32_8bpc_neon: 9.29 5.02 7.25 intra_pred_smooth_w64_8bpc_neon: 6.53 3.94 5.26
-
Relative speedups over the C code: Cortex A53 A72 A73 intra_pred_paeth_w4_8bpc_neon: 8.36 6.55 7.27 intra_pred_paeth_w8_8bpc_neon: 15.24 11.36 11.34 intra_pred_paeth_w16_8bpc_neon: 16.63 13.20 14.17 intra_pred_paeth_w32_8bpc_neon: 10.83 9.21 9.87 intra_pred_paeth_w64_8bpc_neon: 8.37 7.07 7.45
-
-