- Jun 20, 2020
-
-
Jean-Baptiste Kempf authored
-
Luc Trudeau authored
-
- Jun 19, 2020
-
-
Some specific Haswell CPU:s have a hardware bug where the popcnt instruction doesn't set zero flag correctly, which causes the wrong branch to be taken. popcnt also has a 3-cycle latency on Intel CPU:s, so doing the branch on the input value instead of the output reduces the amount of time wasted going down the wrong code path in case of branch mispredictions.
-
Martin Storsjö authored
Only use this in the cases when NEON can be used unconditionally without runtime detection (when __ARM_NEON is defined). The speedup over the C code is very modest for the smaller functions (and the NEON version actually is a little slower than the C code on Cortex A7 for adapt4), but the speedup is around 2x for adapt16. Cortex A7 A8 A9 A53 A72 A73 msac_decode_bool_c: 41.1 43.0 43.0 37.3 26.2 31.3 msac_decode_bool_neon: 40.2 42.0 37.2 32.8 19.9 25.5 msac_decode_bool_adapt_c: 65.1 70.4 58.5 54.3 33.2 40.8 msac_decode_bool_adapt_neon: 56.8 52.4 49.3 42.6 27.1 33.7 msac_decode_bool_equi_c: 36.9 37.2 42.8 32.6 22.7 42.3 msac_decode_bool_equi_neon: 34.9 35.1 36.4 29.7 19.5 36.4 msac_decode_symbol_adapt4_c: 114.2 139.0 111.6 99.9 65.5 83.5 msac_decode_symbol_adapt4_neon: 119.2 128.3 95.7 82.2 58.2 57.5 msac_decode_symbol_adapt8_c: 176.0 207.9 164.0 154.4 88.0 117.0 msac_decode_symbol_adapt8_neon: 128.3 130.3 110.7 85.1 59.9 61.4 msac_decode_symbol_adapt16_c: 292.1 320.5 256.4 246.4 129.1 173.3 msac_decode_symbol_adapt16_neon: 162.2 144.3 129.0 104.2 69.2 69.9 (Omitting msac_decode_hi_tok from the benchmark, as the "C" version measured there uses the NEON version of msac_decode_symbol_adapt4.)
-
- Jun 18, 2020
-
-
Martin Storsjö authored
The speedup (over the normal version, that just calls the existing assembly version of symbol_adapt4) is not very impressive on bigger cores, but looks decent on small cores. It's an improvement though, in any case. Cortex A53 A72 A73 msac_decode_hi_tok_c: 175.7 136.2 138.1 msac_decode_hi_tok_neon: 146.8 129.4 125.9
-
Martin Storsjö authored
-
Martin Storsjö authored
Include the letter prefix when calling the macro, making it slightly less obscure.
-
By multiplicating the performance counter value (within its own time base) by the intended target time base, and only then dividing, we reduce the available numeric range by the factor of the original time base times the new time base. On Windows 10 on ARM64, the performance counter frequency is 19200000 (on x86_64 in a virtual machine, it's 10000000), making the calculation overflow every (1 << 64) / (19200000 * 1000000000) = 960 seconds, i.e. 16 minutes - long before the actual uint64_t nanosecond return value wraps around.
-
Martin Storsjö authored
Even if we don't want to throttle decoding to realtime, and even if the file itself didn't contain a valid fps value, we may want to call the synchronize function to fetch the current elapsed decoding time, for displaying the fps value.
-
Jean-Baptiste Kempf authored
-
Victorien Le Couviour--Tuffet authored
Bilin scaled being very rarely used, add a new table entry to mc_subpel_filters, and jump to the put/prep_8tap_scaled code. AVX2 performance is obviously the same as the 8tap code, the speed up is much smaller though, as the C code is a true bilinear codepath, auto-vectorized. Yet, the AVX2 performance are always better.
-
Victorien Le Couviour--Tuffet authored
mct_scaled_8tap_regular_w4_8bpc_c: 872.1 mct_scaled_8tap_regular_w4_8bpc_avx2: 125.6 mct_scaled_8tap_regular_w4_dy1_8bpc_c: 886.3 mct_scaled_8tap_regular_w4_dy1_8bpc_avx2: 84.0 mct_scaled_8tap_regular_w4_dy2_8bpc_c: 1189.1 mct_scaled_8tap_regular_w4_dy2_8bpc_avx2: 84.7 mct_scaled_8tap_regular_w8_8bpc_c: 2261.0 mct_scaled_8tap_regular_w8_8bpc_avx2: 306.2 mct_scaled_8tap_regular_w8_dy1_8bpc_c: 2189.9 mct_scaled_8tap_regular_w8_dy1_8bpc_avx2: 233.8 mct_scaled_8tap_regular_w8_dy2_8bpc_c: 3060.3 mct_scaled_8tap_regular_w8_dy2_8bpc_avx2: 282.8 mct_scaled_8tap_regular_w16_8bpc_c: 4335.3 mct_scaled_8tap_regular_w16_8bpc_avx2: 680.7 mct_scaled_8tap_regular_w16_dy1_8bpc_c: 5137.2 mct_scaled_8tap_regular_w16_dy1_8bpc_avx2: 578.6 mct_scaled_8tap_regular_w16_dy2_8bpc_c: 7878.4 mct_scaled_8tap_regular_w16_dy2_8bpc_avx2: 774.6 mct_scaled_8tap_regular_w32_8bpc_c: 17871.9 mct_scaled_8tap_regular_w32_8bpc_avx2: 2954.8 mct_scaled_8tap_regular_w32_dy1_8bpc_c: 18594.7 mct_scaled_8tap_regular_w32_dy1_8bpc_avx2: 2073.9 mct_scaled_8tap_regular_w32_dy2_8bpc_c: 28696.0 mct_scaled_8tap_regular_w32_dy2_8bpc_avx2: 2852.1 mct_scaled_8tap_regular_w64_8bpc_c: 46967.5 mct_scaled_8tap_regular_w64_8bpc_avx2: 7527.5 mct_scaled_8tap_regular_w64_dy1_8bpc_c: 45564.2 mct_scaled_8tap_regular_w64_dy1_8bpc_avx2: 5262.9 mct_scaled_8tap_regular_w64_dy2_8bpc_c: 72793.3 mct_scaled_8tap_regular_w64_dy2_8bpc_avx2: 7535.9 mct_scaled_8tap_regular_w128_8bpc_c: 111190.8 mct_scaled_8tap_regular_w128_8bpc_avx2: 19386.8 mct_scaled_8tap_regular_w128_dy1_8bpc_c: 122625.0 mct_scaled_8tap_regular_w128_dy1_8bpc_avx2: 15376.1 mct_scaled_8tap_regular_w128_dy2_8bpc_c: 197120.6 mct_scaled_8tap_regular_w128_dy2_8bpc_avx2: 21871.0
-
- Jun 16, 2020
-
-
Colin Lee authored
-
Martin Storsjö authored
Add an .error case for windows if subtracting more than 8 KB, simplify the generic subtraction case.
-
Martin Storsjö authored
The transforms process vectors of up to 8 elements at a time, for transforms up to size 8; for larger transforms, it uses vectors of 4 elements. Overall, the speedup over C code seems to be around 8-14x for the larger transforms, and 10-19x for the smaller ones. Relative speedup over C code (built with GCC 7.5) for a few functions: Cortex A7 A8 A9 A53 A72 A73 inv_txfm_add_4x4_dct_dct_0_8bpc_neon: 3.83 3.42 2.57 3.36 2.97 7.47 inv_txfm_add_4x4_dct_dct_1_8bpc_neon: 7.25 13.53 8.38 8.82 7.96 12.37 inv_txfm_add_8x8_dct_dct_0_8bpc_neon: 4.78 6.61 4.82 4.65 5.27 9.76 inv_txfm_add_8x8_dct_dct_1_8bpc_neon: 10.20 19.07 13.07 14.69 11.45 15.50 inv_txfm_add_16x16_dct_dct_0_8bpc_neon: 4.26 5.06 3.00 3.74 4.05 4.49 inv_txfm_add_16x16_dct_dct_1_8bpc_neon: 10.51 16.02 13.57 14.03 12.86 18.16 inv_txfm_add_16x16_dct_dct_2_8bpc_neon: 7.95 11.75 9.09 10.64 10.06 14.07 inv_txfm_add_32x32_dct_dct_0_8bpc_neon: 5.31 5.58 3.14 4.18 4.80 4.57 inv_txfm_add_32x32_dct_dct_1_8bpc_neon: 12.66 16.07 14.34 16.00 15.24 21.32 inv_txfm_add_32x32_dct_dct_4_8bpc_neon: 8.25 10.69 8.90 10.59 10.41 14.39 inv_txfm_add_64x64_dct_dct_0_8bpc_neon: 4.69 5.97 3.17 3.96 4.57 4.34 inv_txfm_add_64x64_dct_dct_1_8bpc_neon: 11.47 12.68 10.18 14.73 14.20 17.95 inv_txfm_add_64x64_dct_dct_4_8bpc_neon: 8.84 10.13 7.94 11.25 10.58 13.88
-
- Jun 11, 2020
-
-
Matthias Dressel authored
-
Henrik Gramner authored
The struct is already zero-initialized when the function is called except for the checkasm test, so move the zeroing there instead.
-
Matthias Dressel authored
meson.source_root() returns the root of a parent project if dav1d is embedded as a subproject.
-
Victorien Le Couviour--Tuffet authored
--------------------- x86_64: ------------------------------------------ mct_8tap_regular_w4_h_8bpc_c: 302.3 mct_8tap_regular_w4_h_8bpc_sse2: 47.3 mct_8tap_regular_w4_h_8bpc_ssse3: 19.5 --------------------- mct_8tap_regular_w8_h_8bpc_c: 745.5 mct_8tap_regular_w8_h_8bpc_sse2: 235.2 mct_8tap_regular_w8_h_8bpc_ssse3: 70.4 --------------------- mct_8tap_regular_w16_h_8bpc_c: 1844.3 mct_8tap_regular_w16_h_8bpc_sse2: 755.6 mct_8tap_regular_w16_h_8bpc_ssse3: 225.9 --------------------- mct_8tap_regular_w32_h_8bpc_c: 6685.5 mct_8tap_regular_w32_h_8bpc_sse2: 2954.4 mct_8tap_regular_w32_h_8bpc_ssse3: 795.8 --------------------- mct_8tap_regular_w64_h_8bpc_c: 15633.5 mct_8tap_regular_w64_h_8bpc_sse2: 7120.4 mct_8tap_regular_w64_h_8bpc_ssse3: 1900.4 --------------------- mct_8tap_regular_w128_h_8bpc_c: 37772.1 mct_8tap_regular_w128_h_8bpc_sse2: 17698.1 mct_8tap_regular_w128_h_8bpc_ssse3: 4665.5 ------------------------------------------ mct_8tap_regular_w4_v_8bpc_c: 306.5 mct_8tap_regular_w4_v_8bpc_sse2: 71.7 mct_8tap_regular_w4_v_8bpc_ssse3: 37.9 --------------------- mct_8tap_regular_w8_v_8bpc_c: 923.3 mct_8tap_regular_w8_v_8bpc_sse2: 168.7 mct_8tap_regular_w8_v_8bpc_ssse3: 71.3 --------------------- mct_8tap_regular_w16_v_8bpc_c: 3040.1 mct_8tap_regular_w16_v_8bpc_sse2: 505.1 mct_8tap_regular_w16_v_8bpc_ssse3: 199.7 --------------------- mct_8tap_regular_w32_v_8bpc_c: 12354.8 mct_8tap_regular_w32_v_8bpc_sse2: 1942.0 mct_8tap_regular_w32_v_8bpc_ssse3: 714.2 --------------------- mct_8tap_regular_w64_v_8bpc_c: 29427.9 mct_8tap_regular_w64_v_8bpc_sse2: 4637.4 mct_8tap_regular_w64_v_8bpc_ssse3: 1829.2 --------------------- mct_8tap_regular_w128_v_8bpc_c: 72756.9 mct_8tap_regular_w128_v_8bpc_sse2: 11301.0 mct_8tap_regular_w128_v_8bpc_ssse3: 5020.6 ------------------------------------------ mct_8tap_regular_w4_hv_8bpc_c: 876.9 mct_8tap_regular_w4_hv_8bpc_sse2: 171.7 mct_8tap_regular_w4_hv_8bpc_ssse3: 112.2 --------------------- mct_8tap_regular_w8_hv_8bpc_c: 2215.1 mct_8tap_regular_w8_hv_8bpc_sse2: 730.2 mct_8tap_regular_w8_hv_8bpc_ssse3: 330.9 --------------------- mct_8tap_regular_w16_hv_8bpc_c: 6075.5 mct_8tap_regular_w16_hv_8bpc_sse2: 2252.1 mct_8tap_regular_w16_hv_8bpc_ssse3: 973.4 --------------------- mct_8tap_regular_w32_hv_8bpc_c: 22182.7 mct_8tap_regular_w32_hv_8bpc_sse2: 7692.6 mct_8tap_regular_w32_hv_8bpc_ssse3: 3599.8 --------------------- mct_8tap_regular_w64_hv_8bpc_c: 50876.8 mct_8tap_regular_w64_hv_8bpc_sse2: 18499.6 mct_8tap_regular_w64_hv_8bpc_ssse3: 8815.6 --------------------- mct_8tap_regular_w128_hv_8bpc_c: 122926.3 mct_8tap_regular_w128_hv_8bpc_sse2: 45120.0 mct_8tap_regular_w128_hv_8bpc_ssse3: 22085.7 ------------------------------------------
-
Victorien Le Couviour--Tuffet authored
--------------------- x86_64: ------------------------------------------ mct_bilinear_w4_h_8bpc_c: 98.9 mct_bilinear_w4_h_8bpc_sse2: 30.2 mct_bilinear_w4_h_8bpc_ssse3: 11.5 --------------------- mct_bilinear_w8_h_8bpc_c: 175.3 mct_bilinear_w8_h_8bpc_sse2: 57.0 mct_bilinear_w8_h_8bpc_ssse3: 19.7 --------------------- mct_bilinear_w16_h_8bpc_c: 396.2 mct_bilinear_w16_h_8bpc_sse2: 179.3 mct_bilinear_w16_h_8bpc_ssse3: 50.9 --------------------- mct_bilinear_w32_h_8bpc_c: 1311.2 mct_bilinear_w32_h_8bpc_sse2: 718.8 mct_bilinear_w32_h_8bpc_ssse3: 243.9 --------------------- mct_bilinear_w64_h_8bpc_c: 2892.7 mct_bilinear_w64_h_8bpc_sse2: 1746.0 mct_bilinear_w64_h_8bpc_ssse3: 568.0 --------------------- mct_bilinear_w128_h_8bpc_c: 7192.6 mct_bilinear_w128_h_8bpc_sse2: 4339.8 mct_bilinear_w128_h_8bpc_ssse3: 1619.2 ------------------------------------------ mct_bilinear_w4_v_8bpc_c: 129.7 mct_bilinear_w4_v_8bpc_sse2: 26.6 mct_bilinear_w4_v_8bpc_ssse3: 16.7 --------------------- mct_bilinear_w8_v_8bpc_c: 233.3 mct_bilinear_w8_v_8bpc_sse2: 55.0 mct_bilinear_w8_v_8bpc_ssse3: 24.7 --------------------- mct_bilinear_w16_v_8bpc_c: 498.9 mct_bilinear_w16_v_8bpc_sse2: 146.0 mct_bilinear_w16_v_8bpc_ssse3: 54.2 --------------------- mct_bilinear_w32_v_8bpc_c: 1562.2 mct_bilinear_w32_v_8bpc_sse2: 560.6 mct_bilinear_w32_v_8bpc_ssse3: 201.0 --------------------- mct_bilinear_w64_v_8bpc_c: 3221.3 mct_bilinear_w64_v_8bpc_sse2: 1380.6 mct_bilinear_w64_v_8bpc_ssse3: 499.3 --------------------- mct_bilinear_w128_v_8bpc_c: 7357.7 mct_bilinear_w128_v_8bpc_sse2: 3439.0 mct_bilinear_w128_v_8bpc_ssse3: 1489.1 ------------------------------------------ mct_bilinear_w4_hv_8bpc_c: 185.0 mct_bilinear_w4_hv_8bpc_sse2: 54.5 mct_bilinear_w4_hv_8bpc_ssse3: 22.1 --------------------- mct_bilinear_w8_hv_8bpc_c: 377.8 mct_bilinear_w8_hv_8bpc_sse2: 104.3 mct_bilinear_w8_hv_8bpc_ssse3: 35.8 --------------------- mct_bilinear_w16_hv_8bpc_c: 1159.4 mct_bilinear_w16_hv_8bpc_sse2: 311.0 mct_bilinear_w16_hv_8bpc_ssse3: 106.3 --------------------- mct_bilinear_w32_hv_8bpc_c: 4436.2 mct_bilinear_w32_hv_8bpc_sse2: 1230.7 mct_bilinear_w32_hv_8bpc_ssse3: 400.7 --------------------- mct_bilinear_w64_hv_8bpc_c: 10627.7 mct_bilinear_w64_hv_8bpc_sse2: 2934.2 mct_bilinear_w64_hv_8bpc_ssse3: 957.2 --------------------- mct_bilinear_w128_hv_8bpc_c: 26048.9 mct_bilinear_w128_hv_8bpc_sse2: 7590.3 mct_bilinear_w128_hv_8bpc_ssse3: 2947.0 ------------------------------------------
-
- Jun 10, 2020
-
-
Martin Storsjö authored
This allows skipping half of the first transforms if the input coefficients lie within the upper 4x4 (but checkasm only tests in increments of 8x8 at the moment). With checkasm modified to test in smaller increments, the speedup is like this: Before: Cortex A53 A72 A73 inv_txfm_add_16x8_dct_dct_1_10bpc_neon: 874.4 709.0 707.3 After: inv_txfm_add_16x8_dct_dct_1_10bpc_neon: 618.0 479.5 472.9
-
Martin Storsjö authored
-
- Jun 09, 2020
-
-
- Jun 07, 2020
-
-
Blacklisted some files not directly relevant to the codebase (such as tests, tools and debugging functions). The coverage HTML report gets attached as a build artifact, although unfortunately we can't link directly to the `index.html`. We also attach the coverage XML as a cobertura report, although I'm not sure if it does anything.
-
- Jun 04, 2020
-
-
Wan-Teh Chang authored
-
This is currently not used in dav1d (yet), but there's a need for it in rav1e, which shares this header with dav1d.
-
- Jun 01, 2020
-
-
Victorien Le Couviour--Tuffet authored
mc_scaled_8tap_regular_w2_8bpc_c: 764.4 mc_scaled_8tap_regular_w2_8bpc_avx2: 191.3 mc_scaled_8tap_regular_w2_dy1_8bpc_c: 705.8 mc_scaled_8tap_regular_w2_dy1_8bpc_avx2: 89.5 mc_scaled_8tap_regular_w2_dy2_8bpc_c: 964.0 mc_scaled_8tap_regular_w2_dy2_8bpc_avx2: 120.3 mc_scaled_8tap_regular_w4_8bpc_c: 1355.7 mc_scaled_8tap_regular_w4_8bpc_avx2: 180.9 mc_scaled_8tap_regular_w4_dy1_8bpc_c: 1233.2 mc_scaled_8tap_regular_w4_dy1_8bpc_avx2: 115.3 mc_scaled_8tap_regular_w4_dy2_8bpc_c: 1707.6 mc_scaled_8tap_regular_w4_dy2_8bpc_avx2: 117.9 mc_scaled_8tap_regular_w8_8bpc_c: 2483.2 mc_scaled_8tap_regular_w8_8bpc_avx2: 294.8 mc_scaled_8tap_regular_w8_dy1_8bpc_c: 2166.4 mc_scaled_8tap_regular_w8_dy1_8bpc_avx2: 222.0 mc_scaled_8tap_regular_w8_dy2_8bpc_c: 3133.7 mc_scaled_8tap_regular_w8_dy2_8bpc_avx2: 292.6 mc_scaled_8tap_regular_w16_8bpc_c: 5239.2 mc_scaled_8tap_regular_w16_8bpc_avx2: 729.9 mc_scaled_8tap_regular_w16_dy1_8bpc_c: 5156.5 mc_scaled_8tap_regular_w16_dy1_8bpc_avx2: 602.2 mc_scaled_8tap_regular_w16_dy2_8bpc_c: 8018.4 mc_scaled_8tap_regular_w16_dy2_8bpc_avx2: 783.1 mc_scaled_8tap_regular_w32_8bpc_c: 14745.0 mc_scaled_8tap_regular_w32_8bpc_avx2: 2205.0 mc_scaled_8tap_regular_w32_dy1_8bpc_c: 14862.3 mc_scaled_8tap_regular_w32_dy1_8bpc_avx2: 1721.3 mc_scaled_8tap_regular_w32_dy2_8bpc_c: 23607.6 mc_scaled_8tap_regular_w32_dy2_8bpc_avx2: 2325.7 mc_scaled_8tap_regular_w64_8bpc_c: 54891.7 mc_scaled_8tap_regular_w64_8bpc_avx2: 8351.4 mc_scaled_8tap_regular_w64_dy1_8bpc_c: 50249.0 mc_scaled_8tap_regular_w64_dy1_8bpc_avx2: 5864.4 mc_scaled_8tap_regular_w64_dy2_8bpc_c: 79400.1 mc_scaled_8tap_regular_w64_dy2_8bpc_avx2: 8295.7 mc_scaled_8tap_regular_w128_8bpc_c: 121046.8 mc_scaled_8tap_regular_w128_8bpc_avx2: 21809.1 mc_scaled_8tap_regular_w128_dy1_8bpc_c: 133720.4 mc_scaled_8tap_regular_w128_dy1_8bpc_avx2: 16197.8 mc_scaled_8tap_regular_w128_dy2_8bpc_c: 218774.8 mc_scaled_8tap_regular_w128_dy2_8bpc_avx2: 22993.1
-
- May 28, 2020
-
-
Steve Lhomme authored
posix_memalign is defined as a built-in in gcc in msys2 but it's not available when linking with the Universal C Runtime. _aligned_malloc is available in the UCRT. That should only affect builds targeting Windows since _aligned_malloc is a MS thing.
-
- May 26, 2020
-
-
Henrik Gramner authored
Eliminate store forwarding stalls. Use shorter instruction encodings where possible. Misc. tweaks.
-
This one correctly sets the subsampling mode based on whether or not the plane is actually subsampled, and also infers PL_CHROMA_UNKNOWN as PL_CHROMA_TOP_LEFT in such cases.
-
- May 25, 2020
-
-
Niklas Haas authored
libplacebo v66 got helper functions that make preserving the aspect ratio in this case trivial. But we still need to make sure to clear the FBO to black if the image doesn't cover it fully.
-
- May 20, 2020
-
-
Niklas Haas authored
Returning out of this function when pl_render_image() fails is the wrong thing to do, since that leaves the swapchain frame acquired but never submitted. Instead, just clear the target FBO to blank red (to make it clear that something went wrong) and continue on with presentation.
-
- May 19, 2020
-
-
Jean-Baptiste Kempf authored
-
- May 18, 2020
-
-
Niklas Haas authored
Annoying minor differences in this struct layout mean we can't just memcpy the entire thing. Oh well. Note: technically, PL_API_VER 33 added this API, but PL_API_VER 63 is the minimum version of libplacebo that doesn't have glaring bugs when generating chroma grain, so we require that as a minimum instead. (I tested this version on some 4:2:2 and 4:2:0, 8-bit and 10-bit grain samples I had lying around and made sure the output was identical up to differences in rounding / dithering.)
-
Niklas Haas authored
Generalize the code to set the right pl_image metadata based on the values signaled in the Dav1dPictureParameters / Dav1dSequenceHeader. Some values are not mapped, in which case stdout will be spammed. Whatever. Hopefully somebody sees that error spam and opens a bug report for libplacebo to implement it.
-
Niklas Haas authored
Having the pl_image generation live in upload_planes() rather than render() will make it easier to set the correct pl_image metadata based on the Dav1dPicture headers moving forwards. Rename the function to make more sense, semantically. Reduce some code duplication by turning per-plane fields into arrays wherever appropriate. As an aside, also apply the correct chroma location rather than hard-coding it as PL_CHROMA_LEFT.
-
Niklas Haas authored
This is turned into a const array in upstream libplacebo, which generates warnings due to the implicit cast. Rewrite the code to have the mutable array live inside a separate variable `extensions` and only set `iparams.extensions` to this, rather than directly manipulating it.
-
- May 16, 2020
-
-
Signed-off-by:
Marvin Scholz <epirat07@gmail.com>
-