- Jun 18, 2021
-
-
The stack size calculation ended up being incorrect when the stack alignment was larger than 16 due to auto-generated alignment padding.
-
Matthias Dressel authored
-
- Jun 17, 2021
-
-
Henrik Gramner authored
-
- Jun 15, 2021
-
-
Particularly in code that makes heavy use of macros it's possible to end up with 3-operand instructions with a memory operand in src1. In the case of SSE this works fine due to automatic move insertions, but in AVX that fails since memory operands are only allowed in src2. The main purpose of this feature is to minimize the amount of code changes required to facilitate conversion of existing SSE code to AVX.
-
Speeds up decoding by ~4% when TMVP is disabled in the sequence header.
-
Ronald S. Bultje authored
-
- Jun 12, 2021
-
-
Relative speedup over C code: Cortex A7 A8 A9 A53 A72 A73 fguv_32x32xn_16bpc_420_csfl0_neon: 3.47 1.72 2.99 4.18 2.68 6.19 fguv_32x32xn_16bpc_420_csfl1_neon: 3.24 1.36 2.58 3.78 2.73 5.27 fguv_32x32xn_16bpc_422_csfl0_neon: 3.57 2.07 3.05 4.32 2.74 6.20 fguv_32x32xn_16bpc_422_csfl1_neon: 3.33 1.44 2.62 3.89 2.71 5.28 fguv_32x32xn_16bpc_444_csfl0_neon: 3.48 1.69 3.06 4.48 2.97 6.69 fguv_32x32xn_16bpc_444_csfl1_neon: 3.06 1.16 2.36 3.85 2.75 5.19 fgy_32x32xn_16bpc_neon: 2.89 1.05 2.29 3.49 2.49 3.15 Absolute numbers: Cortex A7 A8 A9 A53 A72 A73 fguv_32x32xn_16bpc_420_csfl0_neon: 6237.3 12701.0 6687.1 4525.8 3220.8 3195.4 fguv_32x32xn_16bpc_420_csfl1_neon: 5143.2 11684.8 5926.4 3857.2 2604.7 2556.5 fguv_32x32xn_16bpc_422_csfl0_neon: 6347.3 11005.2 6797.5 4582.4 3300.4 3250.5 fguv_32x32xn_16bpc_422_csfl1_neon: 5275.2 11594.8 5992.6 3931.1 2668.7 2607.3 fguv_32x32xn_16bpc_444_csfl0_neon: 5181.6 11310.0 5575.4 3629.7 2383.8 2530.0 fguv_32x32xn_16bpc_444_csfl1_neon: 4081.9 10958.8 4868.5 2962.9 1870.3 2034.2 fgy_32x32xn_16bpc_neon: 15439.1 43129.0 19406.6 11542.3 7463.9 7827.8 Corresponding numbers for arm64: Cortex A53 A72 A73 fguv_32x32xn_16bpc_420_csfl0_neon: 4019.2 3247.4 3259.6 fguv_32x32xn_16bpc_420_csfl1_neon: 3460.1 2628.7 2640.8 fguv_32x32xn_16bpc_422_csfl0_neon: 4034.4 3329.9 3287.5 fguv_32x32xn_16bpc_422_csfl1_neon: 3468.3 2749.3 2686.6 fguv_32x32xn_16bpc_444_csfl0_neon: 3117.7 2447.4 2539.8 fguv_32x32xn_16bpc_444_csfl1_neon: 2641.2 1977.2 2132.8 fgy_32x32xn_16bpc_neon: 9873.5 7605.7 7656.2
-
- Jun 11, 2021
-
-
Ronald S. Bultje authored
Currently 64-bit only.
-
Martin Storsjö authored
Relative speedup over C code: Cortex A7 A8 A9 A53 A72 A73 fguv_32x32xn_8bpc_420_csfl0_neon: 4.20 2.19 3.48 4.93 3.60 5.93 fguv_32x32xn_8bpc_420_csfl1_neon: 3.92 1.52 2.84 4.34 3.82 5.93 fguv_32x32xn_8bpc_422_csfl0_neon: 4.27 2.13 3.58 5.02 4.04 5.95 fguv_32x32xn_8bpc_422_csfl1_neon: 3.99 1.56 2.91 4.43 3.89 6.00 fguv_32x32xn_8bpc_444_csfl0_neon: 4.48 2.08 3.89 5.66 4.07 6.51 fguv_32x32xn_8bpc_444_csfl1_neon: 4.45 1.41 2.99 5.28 3.63 6.09 fgy_32x32xn_8bpc_neon: 3.61 1.10 2.62 4.35 3.06 3.74 Absolute numbers: Cortex A7 A8 A9 A53 A72 A73 fguv_32x32xn_8bpc_420_csfl0_neon: 5318.8 11167.7 6024.6 3909.9 2945.2 2993.5 fguv_32x32xn_8bpc_420_csfl1_neon: 4351.0 10929.7 5269.5 3316.8 2166.5 2256.9 fguv_32x32xn_8bpc_422_csfl0_neon: 5387.9 11746.7 6080.0 3945.8 2988.1 3046.3 fguv_32x32xn_8bpc_422_csfl1_neon: 4396.0 11083.2 5300.8 3354.9 2216.4 2291.4 fguv_32x32xn_8bpc_444_csfl0_neon: 4347.9 10595.0 5134.4 3079.1 2277.7 2392.9 fguv_32x32xn_8bpc_444_csfl1_neon: 3295.0 10518.2 4442.6 2476.3 1716.3 1829.2 fgy_32x32xn_8bpc_neon: 12376.2 41046.9 17259.7 9153.1 6610.4 7005.3 Corresponding numbers for arm64: Cortex A53 A72 A73 fguv_32x32xn_8bpc_420_csfl0_neon: 3822.9 2920.0 2935.7 fguv_32x32xn_8bpc_420_csfl1_neon: 3209.7 2231.7 2335.4 fguv_32x32xn_8bpc_422_csfl0_neon: 3807.9 2886.5 2966.7 fguv_32x32xn_8bpc_422_csfl1_neon: 3197.1 2187.9 2355.9 fguv_32x32xn_8bpc_444_csfl0_neon: 2757.8 2227.4 2334.4 fguv_32x32xn_8bpc_444_csfl1_neon: 2244.6 1719.1 1786.7 fgy_32x32xn_8bpc_neon: 8192.2 6563.3 6969.1
-
- Jun 10, 2021
-
-
Martin Storsjö authored
This should help catch issues like the one fixed in 185194be, by making sure that we call the benchmarked function at least once with the given parameters, even if not benchmarking. Otherwise the benchmark codepath is essentially dead untested code until somebody works on that piece of code.
-
Relative speedup over C code: Cortex A53 A82 A83 Apple M1 fguv_32x32xn_16bpc_420_csfl0_neon: 4.57 2.08 3.57 7.61 fguv_32x32xn_16bpc_420_csfl1_neon: 4.92 2.89 3.96 4.26 fguv_32x32xn_16bpc_422_csfl0_neon: 4.59 2.14 3.61 5.88 fguv_32x32xn_16bpc_422_csfl1_neon: 4.92 2.90 3.90 5.00 fguv_32x32xn_16bpc_444_csfl0_neon: 3.64 1.89 2.86 4.72 fguv_32x32xn_16bpc_444_csfl1_neon: 3.59 2.26 2.76 3.22
-
-
-
Previously we did 32 gathers even though only 16 are needed. Before: Cortex A53 A72 A73 Apple M1 fguv_32x32xn_8bpc_420_csfl0_neon: 5352.1 3985.0 4068.9 8.3 fguv_32x32xn_8bpc_420_csfl1_neon: 4738.2 3297.8 3633.0 8.2 fguv_32x32xn_8bpc_422_csfl0_neon: 5386.0 4036.8 4093.5 8.3 fguv_32x32xn_8bpc_422_csfl1_neon: 4779.9 3392.6 3641.6 8.2 fguv_32x32xn_8bpc_444_csfl0_neon: 3068.4 2422.0 2436.5 4.9 fguv_32x32xn_8bpc_444_csfl1_neon: 2558.3 1908.4 1926.6 4.4 After: fguv_32x32xn_8bpc_420_csfl0_neon: 4330.4 3118.5 3224.6 5.3 fguv_32x32xn_8bpc_420_csfl1_neon: 3731.8 2416.9 2619.6 4.7 fguv_32x32xn_8bpc_422_csfl0_neon: 4364.7 3129.3 3247.6 5.4 fguv_32x32xn_8bpc_422_csfl1_neon: 3762.5 2450.2 2661.8 4.7 fguv_32x32xn_8bpc_444_csfl0_neon: 3075.1 2376.4 2429.4 4.9 fguv_32x32xn_8bpc_444_csfl1_neon: 2564.5 1865.9 1952.8 4.4
-
- Jun 09, 2021
-
-
Ronald S. Bultje authored
-
Victorien Le Couviour--Tuffet authored
-
- Jun 07, 2021
-
-
Martin Storsjö authored
The pixel data as initialized by the test above only have proper pixels up to whatever random 'w' it used last.
-
- Jun 05, 2021
-
-
Ronald S. Bultje authored
-
Martin Storsjö authored
Before: Cortex A53 A72 A73 Apple M1 fgy_32x32xn_16bpc_neon: 10396.8 8150.8 8718.3 19.5 After: fgy_32x32xn_16bpc_neon: 9665.1 7558.8 7652.8 19.5
-
- Jun 04, 2021
-
-
Ronald S. Bultje authored
-
- Jun 02, 2021
-
-
Martin Storsjö authored
Clang 13 got support for warning about variables that are set but not used. We disable warnings for unused parameters, but in this case, the parameter variable is updated within the function too, which Clang warns about.
-
- May 31, 2021
-
-
- May 27, 2021
-
-
Vibhoothi authored
-
- May 25, 2021
-
-
Martin Storsjö authored
After multiplying two int8_t, the maximum possible output is -128*-128 = 16384. One can't add two such values in an int16_t (even if all the products of all other int8_t combinations can be). Previously the summing used 16 bit intermediates for the sum of two products and only lengtheted the result to 32 bit when accumulating three or more products. Before: Cortex A53 A72 A73 Apple M1 gen_grain_y_ar1_8bpc_neon: 112598.5 71309.2 74889.8 372.2 gen_grain_y_ar2_8bpc_neon: 139932.4 91442.3 95788.4 387.3 gen_grain_y_ar3_8bpc_neon: 185607.6 115691.6 131655.8 403.0 After: gen_grain_y_ar1_8bpc_neon: 112968.8 71897.9 76171.2 371.2 gen_grain_y_ar2_8bpc_neon: 142768.8 94517.9 97934.4 387.5 gen_grain_y_ar3_8bpc_neon: 191625.2 121083.0 135975.3 405.6
-
- May 18, 2021
-
-
Matthias Dressel authored
-
- May 16, 2021
-
-
Jean-Baptiste Kempf authored
-
- May 14, 2021
-
-
Martin Storsjö authored
Use the mvni instruction instead of setting the constant in a GPR first.
-
- May 13, 2021
-
-
Matthias Dressel authored
-
James Almer authored
-
In 16 bpc, the pixels are 16 bit integers, but valid pixels only are up to 12 bits, and the scaling buffer only contains 4096 elements. The src pixels are, normally, supposed to be valid pixels, but when processing blocks of 32 pixels at a time, it can operate on uninitialized pixels past the right edge. Before: Cortex A53 A72 A73 Apple M1 fgy_32x32xn_16bpc_neon: 10372.5 8194.4 8612.1 24.2 After: fgy_32x32xn_16bpc_neon: 10837.9 8469.5 8885.1 24.6
-
Jean-Baptiste Kempf authored
-
- May 12, 2021
-
-
Martin Storsjö authored
Relative speedup over C code: Cortex A53 A72 A73 Apple M1 fgy_32x32xn_16bpc_neon: 3.87 2.28 2.78 3.45
-
Martin Storsjö authored
Don't call them when targeting e.g. UWP. This requires building with a new enough SDK that does have the winapifamily.h header (and that it's included implicitly by regular platform headers); it's been available since the Windows 8.0 SDK (and since mingw-w64 v3.0.0) so it should be safe. Also rewrite the GetProcAddress call to avoid calling it if GetModuleHandleW(L"kernel32.dll") would return NULL for some reason.
-
- May 11, 2021
-
-
Ronald S. Bultje authored
-
- May 10, 2021
-
-