- Jul 04, 2020
-
-
Jean-Baptiste Kempf authored
Removes files from top-level
-
- Jul 02, 2020
-
-
Martin Storsjö authored
This matches was is implemented for arm64 so far. Align the dav1d_sm_weights table to allow aligned loads from it. Relative speedups over C code (vs potentially autovectorized code, built with Clang): Cortex A7 A8 A9 A53 A72 A73 intra_pred_paeth_w4_8bpc_neon: 4.81 7.61 5.82 5.50 5.61 6.94 intra_pred_paeth_w8_8bpc_neon: 7.83 11.95 9.51 11.05 8.90 10.51 intra_pred_paeth_w16_8bpc_neon: 4.86 4.49 3.90 4.60 3.76 3.54 intra_pred_paeth_w32_8bpc_neon: 4.55 4.03 3.52 4.27 3.30 3.21 intra_pred_paeth_w64_8bpc_neon: 4.38 3.72 3.32 3.95 3.08 3.00 intra_pred_smooth_h_w4_8bpc_neon: 5.74 10.80 5.32 6.79 4.77 6.48 intra_pred_smooth_h_w8_8bpc_neon: 10.59 17.95 9.39 16.03 6.94 8.98 intra_pred_smooth_h_w16_8bpc_neon: 2.81 3.19 2.12 3.70 2.90 3.59 intra_pred_smooth_h_w32_8bpc_neon: 2.63 2.41 1.86 3.44 2.24 2.66 intra_pred_smooth_h_w64_8bpc_neon: 2.42 2.52 1.79 3.24 1.81 2.11 intra_pred_smooth_v_w4_8bpc_neon: 4.15 7.99 3.46 4.63 3.83 4.39 intra_pred_smooth_v_w8_8bpc_neon: 7.31 12.42 7.04 10.00 4.26 6.20 intra_pred_smooth_v_w16_8bpc_neon: 3.70 3.44 2.53 3.33 2.76 3.21 intra_pred_smooth_v_w32_8bpc_neon: 3.91 3.74 2.70 3.51 2.50 2.96 intra_pred_smooth_v_w64_8bpc_neon: 4.03 3.94 2.80 3.64 2.36 2.80 intra_pred_smooth_w4_8bpc_neon: 4.09 7.74 4.54 4.79 3.26 5.10 intra_pred_smooth_w8_8bpc_neon: 5.63 8.93 6.62 8.28 3.73 6.04 intra_pred_smooth_w16_8bpc_neon: 3.97 3.40 3.32 3.74 3.01 3.77 intra_pred_smooth_w32_8bpc_neon: 3.75 3.14 3.07 3.28 2.65 3.17 intra_pred_smooth_w64_8bpc_neon: 3.60 3.04 2.93 2.97 2.35 2.85 intra_pred_filter_w4_8bpc_neon: 5.54 6.43 4.90 7.26 3.44 4.61 intra_pred_filter_w8_8bpc_neon: 7.05 7.15 5.50 10.05 4.29 6.02 intra_pred_filter_w16_8bpc_neon: 7.36 6.46 5.27 11.51 4.75 6.70 intra_pred_filter_w32_8bpc_neon: 7.56 6.32 5.01 12.34 4.47 6.97 pal_pred_w4_8bpc_neon: 5.47 7.76 4.40 5.20 8.32 7.03 pal_pred_w8_8bpc_neon: 11.11 14.12 8.44 13.95 11.88 12.43 pal_pred_w16_8bpc_neon: 14.38 20.95 9.84 17.43 14.77 13.56 pal_pred_w32_8bpc_neon: 12.91 19.85 10.87 19.03 14.63 14.62 pal_pred_w64_8bpc_neon: 14.01 19.23 10.82 19.82 16.23 16.32 cfl_ac_420_w4_8bpc_neon: 8.11 13.41 7.92 9.26 10.55 9.36 cfl_ac_420_w8_8bpc_neon: 7.77 15.71 7.69 8.94 9.76 8.56 cfl_ac_420_w16_8bpc_neon: 7.72 13.71 8.30 9.05 9.81 9.02 cfl_ac_422_w4_8bpc_neon: 8.85 15.80 8.26 10.97 13.04 10.00 cfl_ac_422_w8_8bpc_neon: 8.77 16.96 7.57 10.46 12.16 9.92 cfl_ac_422_w16_8bpc_neon: 8.28 14.91 7.16 9.69 10.57 9.18 cfl_ac_444_w4_8bpc_neon: 7.47 14.13 7.50 9.76 11.11 9.39 cfl_ac_444_w8_8bpc_neon: 6.81 15.46 5.27 9.11 12.09 9.76 cfl_ac_444_w16_8bpc_neon: 6.11 13.68 4.62 8.17 10.78 8.92 cfl_ac_444_w32_8bpc_neon: 5.71 12.11 4.28 7.53 9.53 8.52 cfl_pred_cfl_128_w4_8bpc_neon: 7.46 12.63 8.48 8.03 7.64 9.29 cfl_pred_cfl_128_w8_8bpc_neon: 5.05 5.16 3.79 4.64 5.07 4.42 cfl_pred_cfl_128_w16_8bpc_neon: 4.44 5.17 3.65 4.20 4.41 4.74 cfl_pred_cfl_128_w32_8bpc_neon: 4.51 5.25 3.67 4.29 4.39 4.73 cfl_pred_cfl_left_w4_8bpc_neon: 6.60 11.74 7.75 6.91 7.44 9.14 cfl_pred_cfl_left_w8_8bpc_neon: 4.92 5.15 3.80 4.41 5.44 4.81 cfl_pred_cfl_left_w16_8bpc_neon: 4.40 5.26 3.66 4.10 4.63 4.94 cfl_pred_cfl_left_w32_8bpc_neon: 4.50 5.31 3.68 4.25 4.43 4.82 cfl_pred_cfl_top_w4_8bpc_neon: 7.00 11.88 7.88 7.50 7.43 9.68 cfl_pred_cfl_top_w8_8bpc_neon: 4.96 5.07 3.78 4.51 5.31 4.75 cfl_pred_cfl_top_w16_8bpc_neon: 4.42 5.31 3.69 4.16 4.60 4.93 cfl_pred_cfl_top_w32_8bpc_neon: 4.52 5.36 3.71 4.29 4.47 4.83 cfl_pred_cfl_w4_8bpc_neon: 5.92 10.54 7.25 6.21 6.79 8.33 cfl_pred_cfl_w8_8bpc_neon: 4.67 5.16 3.77 4.14 5.20 4.71 cfl_pred_cfl_w16_8bpc_neon: 4.29 5.29 3.70 3.97 4.53 4.86 cfl_pred_cfl_w32_8bpc_neon: 4.47 5.34 3.72 4.20 4.42 4.83
-
Martin Storsjö authored
Do the horizontal summing in the same way as for other cases of 32 pixel summing. This doesn't seem to affect the runtime significantly though (checkasm benchmarks vary by a couple cycles), but it's 5 instructions shorter at least.
-
Martin Storsjö authored
-
Martin Storsjö authored
This matches the arm64 original. The comment isn't about the condition, but about the state after the conditional branch.
-
Martin Storsjö authored
These came from matching some parts too closely to the arm64 version (where the summation can be done efficiently with uaddlv by zeroing the upper half of the register). Before: Cortex A7 A8 A9 A53 A72 A73 intra_pred_dc_w4_8bpc_neon: 124.5 65.1 90.2 100.4 48.1 50.4 After: intra_pred_dc_w4_8bpc_neon: 120.3 60.7 83.6 94.0 44.1 47.9
-
Martin Storsjö authored
This speeds things up a bit on older cores. Also do a load that duplicates the input over the whole register instead of just loading a single lane in iprev_v_w4. This can be a bit faster on Cortex A8. Before: Cortex A7 A8 A9 A53 A72 A73 intra_pred_v_w4_8bpc_neon: 54.0 38.4 46.4 47.7 20.4 18.1 intra_pred_h_w4_8bpc_neon: 66.3 43.1 55.0 57.0 27.9 22.2 intra_pred_h_w8_8bpc_neon: 81.0 60.2 76.7 66.5 31.1 30.1 intra_pred_dc_left_w4_8bpc_neon: 91.0 49.0 72.8 77.7 35.4 38.5 intra_pred_dc_left_w8_8bpc_neon: 103.8 73.5 90.2 84.7 42.8 47.1 intra_pred_dc_left_w16_8bpc_neon: 156.1 101.8 186.1 119.4 77.7 92.6 intra_pred_dc_left_w32_8bpc_neon: 270.5 200.5 381.6 191.7 152.6 170.3 intra_pred_dc_left_w64_8bpc_neon: 560.7 439.1 877.0 375.4 333.5 343.6 After: intra_pred_v_w4_8bpc_neon: 53.9 38.0 46.4 47.7 19.8 19.2 intra_pred_h_w4_8bpc_neon: 66.5 39.2 52.6 57.0 27.7 22.2 intra_pred_h_w8_8bpc_neon: 80.5 55.8 72.9 66.5 31.4 30.1 intra_pred_dc_left_w4_8bpc_neon: 91.0 48.2 71.8 77.7 34.9 38.6 intra_pred_dc_left_w8_8bpc_neon: 103.8 69.6 89.2 84.7 43.2 47.3 intra_pred_dc_left_w16_8bpc_neon: 182.3 99.9 184.9 118.8 77.7 85.8 intra_pred_dc_left_w32_8bpc_neon: 355.4 198.9 380.1 190.6 152.9 161.0 intra_pred_dc_left_w64_8bpc_neon: 517.5 437.4 876.9 375.7 333.3 347.7
-
Martin Storsjö authored
Relative speedup over C code: Cortex A53 A72 A73 cfl_ac_444_w4_16bpc_neon: 8.03 9.41 10.48 cfl_ac_444_w8_16bpc_neon: 10.17 10.54 10.38 cfl_ac_444_w16_16bpc_neon: 10.73 10.38 9.73 cfl_ac_444_w32_16bpc_neon: 10.18 9.43 9.77
-
Martin Storsjö authored
Relative speedup over C code: Cortex A53 A72 A73 cfl_ac_444_w4_8bpc_neon: 8.72 8.75 10.50 cfl_ac_444_w8_8bpc_neon: 13.10 10.77 11.23 cfl_ac_444_w16_8bpc_neon: 13.08 9.95 10.49 cfl_ac_444_w32_8bpc_neon: 12.58 9.43 10.63
-
Martin Storsjö authored
The branch target is directly afterwards, so the branch isn't needed.
-
Martin Storsjö authored
It became unused in 38629906.
-
Martin Storsjö authored
Before: Cortex A53 A72 A73 intra_pred_filter_w16_8bpc_neon: 540.2 573.8 580.2 intra_pred_filter_w32_8bpc_neon: 1223.1 1364.1 1292.9 After: intra_pred_filter_w16_8bpc_neon: 531.4 559.8 565.4 intra_pred_filter_w32_8bpc_neon: 1243.0 1308.6 1270.9 This does give a minor slowdown for the w32 case on A53, but helps on w16 and quite notably in all cases on A72 and A73. Doing the same modification on ipred16.S doesn't give quite as clear gains (the gains on A72 and A73 are smaller, and the regression on A53 on on w32 is a bit bigger), so not doing the same adjustment there.
-
Martin Storsjö authored
-
Martin Storsjö authored
-
- Jul 01, 2020
-
-
-
%{:} macro operand ranges were broken in nasm 2.15 which causes errors when compiling, so avoid using those for now. Some new warnings regarding use of empty macro parameters has also been added, adjust some x86inc code to silence those.
-
- Jun 29, 2020
-
-
Meson does not yet normalises arm64 to the aarch64 in the reference table. To workaround this, in addition to the cpu_family check the cpu field.
-
Since 46d092ae the demuxer is no longer detected from extension but rather by probing.
-
-
- Jun 25, 2020
-
-
- Jun 24, 2020
-
-
- Jun 23, 2020
-
-
Broadcasting a memory operand is binary flag, you either broadcast or you don't, and there's only a single possible element size for any given instruction. The instruction syntax however requires the broadcast semanticts to be explicitly defined, which is an issue when using macros to template code for multiple register widhts. Add some helper defines to alleviate the issue.
-
Ronald S. Bultje authored
The shift-amount can be up to 56, and left-shifting 32-bit integers by values >=32 is undefined behaviour. Therefore, use 64-bit integers instead. Also slightly rewrite so we only call dav1d_get_bits() once for the combined more|bits value, and mask the relevant portions out instead of reading twice. Lastly, move the overflow check out of the loop (as suggested by @wtc) Fixes #341.
-
- Jun 21, 2020
-
-
-
-
dav1d_init_get_bits() initializes c->eof to 0, which implies c->ptr < c->ptr_end, or equivalently sz > 0.
-
- Jun 20, 2020
-
-
Jean-Baptiste Kempf authored
-
Luc Trudeau authored
-
- Jun 19, 2020
-
-
Some specific Haswell CPU:s have a hardware bug where the popcnt instruction doesn't set zero flag correctly, which causes the wrong branch to be taken. popcnt also has a 3-cycle latency on Intel CPU:s, so doing the branch on the input value instead of the output reduces the amount of time wasted going down the wrong code path in case of branch mispredictions.
-
Martin Storsjö authored
Only use this in the cases when NEON can be used unconditionally without runtime detection (when __ARM_NEON is defined). The speedup over the C code is very modest for the smaller functions (and the NEON version actually is a little slower than the C code on Cortex A7 for adapt4), but the speedup is around 2x for adapt16. Cortex A7 A8 A9 A53 A72 A73 msac_decode_bool_c: 41.1 43.0 43.0 37.3 26.2 31.3 msac_decode_bool_neon: 40.2 42.0 37.2 32.8 19.9 25.5 msac_decode_bool_adapt_c: 65.1 70.4 58.5 54.3 33.2 40.8 msac_decode_bool_adapt_neon: 56.8 52.4 49.3 42.6 27.1 33.7 msac_decode_bool_equi_c: 36.9 37.2 42.8 32.6 22.7 42.3 msac_decode_bool_equi_neon: 34.9 35.1 36.4 29.7 19.5 36.4 msac_decode_symbol_adapt4_c: 114.2 139.0 111.6 99.9 65.5 83.5 msac_decode_symbol_adapt4_neon: 119.2 128.3 95.7 82.2 58.2 57.5 msac_decode_symbol_adapt8_c: 176.0 207.9 164.0 154.4 88.0 117.0 msac_decode_symbol_adapt8_neon: 128.3 130.3 110.7 85.1 59.9 61.4 msac_decode_symbol_adapt16_c: 292.1 320.5 256.4 246.4 129.1 173.3 msac_decode_symbol_adapt16_neon: 162.2 144.3 129.0 104.2 69.2 69.9 (Omitting msac_decode_hi_tok from the benchmark, as the "C" version measured there uses the NEON version of msac_decode_symbol_adapt4.)
-
- Jun 18, 2020
-
-
Martin Storsjö authored
The speedup (over the normal version, that just calls the existing assembly version of symbol_adapt4) is not very impressive on bigger cores, but looks decent on small cores. It's an improvement though, in any case. Cortex A53 A72 A73 msac_decode_hi_tok_c: 175.7 136.2 138.1 msac_decode_hi_tok_neon: 146.8 129.4 125.9
-
Martin Storsjö authored
-
Martin Storsjö authored
Include the letter prefix when calling the macro, making it slightly less obscure.
-
By multiplicating the performance counter value (within its own time base) by the intended target time base, and only then dividing, we reduce the available numeric range by the factor of the original time base times the new time base. On Windows 10 on ARM64, the performance counter frequency is 19200000 (on x86_64 in a virtual machine, it's 10000000), making the calculation overflow every (1 << 64) / (19200000 * 1000000000) = 960 seconds, i.e. 16 minutes - long before the actual uint64_t nanosecond return value wraps around.
-
Martin Storsjö authored
Even if we don't want to throttle decoding to realtime, and even if the file itself didn't contain a valid fps value, we may want to call the synchronize function to fetch the current elapsed decoding time, for displaying the fps value.
-
Jean-Baptiste Kempf authored
-
Victorien Le Couviour--Tuffet authored
Bilin scaled being very rarely used, add a new table entry to mc_subpel_filters, and jump to the put/prep_8tap_scaled code. AVX2 performance is obviously the same as the 8tap code, the speed up is much smaller though, as the C code is a true bilinear codepath, auto-vectorized. Yet, the AVX2 performance are always better.
-
Victorien Le Couviour--Tuffet authored
mct_scaled_8tap_regular_w4_8bpc_c: 872.1 mct_scaled_8tap_regular_w4_8bpc_avx2: 125.6 mct_scaled_8tap_regular_w4_dy1_8bpc_c: 886.3 mct_scaled_8tap_regular_w4_dy1_8bpc_avx2: 84.0 mct_scaled_8tap_regular_w4_dy2_8bpc_c: 1189.1 mct_scaled_8tap_regular_w4_dy2_8bpc_avx2: 84.7 mct_scaled_8tap_regular_w8_8bpc_c: 2261.0 mct_scaled_8tap_regular_w8_8bpc_avx2: 306.2 mct_scaled_8tap_regular_w8_dy1_8bpc_c: 2189.9 mct_scaled_8tap_regular_w8_dy1_8bpc_avx2: 233.8 mct_scaled_8tap_regular_w8_dy2_8bpc_c: 3060.3 mct_scaled_8tap_regular_w8_dy2_8bpc_avx2: 282.8 mct_scaled_8tap_regular_w16_8bpc_c: 4335.3 mct_scaled_8tap_regular_w16_8bpc_avx2: 680.7 mct_scaled_8tap_regular_w16_dy1_8bpc_c: 5137.2 mct_scaled_8tap_regular_w16_dy1_8bpc_avx2: 578.6 mct_scaled_8tap_regular_w16_dy2_8bpc_c: 7878.4 mct_scaled_8tap_regular_w16_dy2_8bpc_avx2: 774.6 mct_scaled_8tap_regular_w32_8bpc_c: 17871.9 mct_scaled_8tap_regular_w32_8bpc_avx2: 2954.8 mct_scaled_8tap_regular_w32_dy1_8bpc_c: 18594.7 mct_scaled_8tap_regular_w32_dy1_8bpc_avx2: 2073.9 mct_scaled_8tap_regular_w32_dy2_8bpc_c: 28696.0 mct_scaled_8tap_regular_w32_dy2_8bpc_avx2: 2852.1 mct_scaled_8tap_regular_w64_8bpc_c: 46967.5 mct_scaled_8tap_regular_w64_8bpc_avx2: 7527.5 mct_scaled_8tap_regular_w64_dy1_8bpc_c: 45564.2 mct_scaled_8tap_regular_w64_dy1_8bpc_avx2: 5262.9 mct_scaled_8tap_regular_w64_dy2_8bpc_c: 72793.3 mct_scaled_8tap_regular_w64_dy2_8bpc_avx2: 7535.9 mct_scaled_8tap_regular_w128_8bpc_c: 111190.8 mct_scaled_8tap_regular_w128_8bpc_avx2: 19386.8 mct_scaled_8tap_regular_w128_dy1_8bpc_c: 122625.0 mct_scaled_8tap_regular_w128_dy1_8bpc_avx2: 15376.1 mct_scaled_8tap_regular_w128_dy2_8bpc_c: 197120.6 mct_scaled_8tap_regular_w128_dy2_8bpc_avx2: 21871.0
-