- Oct 07, 2019
-
-
-
Ronald S. Bultje authored
gen_grain_uv_ar0_8bpc_420_c: 30131.8 gen_grain_uv_ar0_8bpc_420_avx2: 6600.4 gen_grain_uv_ar1_8bpc_420_c: 46110.5 gen_grain_uv_ar1_8bpc_420_avx2: 17887.2 gen_grain_uv_ar2_8bpc_420_c: 73593.2 gen_grain_uv_ar2_8bpc_420_avx2: 26918.6 gen_grain_uv_ar3_8bpc_420_c: 114499.3 gen_grain_uv_ar3_8bpc_420_avx2: 29804.6
-
Martin Storsjö authored
Before: Cortex A53 A72 A73 warp_8x8_8bpc_neon: 1997.3 1170.1 1199.9 warp_8x8t_8bpc_neon: 1982.4 1171.5 1192.6 After: warp_8x8_8bpc_neon: 1954.6 1159.2 1153.3 warp_8x8t_8bpc_neon: 1938.5 1146.2 1136.7
-
- Oct 03, 2019
-
-
Prior checks were done at the sbrow level. This now allows to call dav1d_lr_sbrow and dav1d_lr_copy_lpf only when there's something for them to do.
-
- Oct 02, 2019
-
-
Martin Storsjö authored
-
Henrik Gramner authored
-
Henrik Gramner authored
The existing code was using 16-bit intermediate precision for certain calculations which is insufficient for some esoteric edge cases.
-
Henrik Gramner authored
--list-functions now prints a list of all function names. Uses stdout for easy grepping/piping. Can be combined with the --test option to only list functions within a specific test. Also rename --list to --list-tests and make it print to stdout as well for consistency.
-
- Oct 01, 2019
-
-
-
Ronald S. Bultje authored
-
Martin Storsjö authored
Relative speedups over the C code: Cortex A53 A72 A73 intra_pred_dc_128_w4_8bpc_neon: 2.08 1.47 2.17 intra_pred_dc_128_w8_8bpc_neon: 3.33 2.49 4.03 intra_pred_dc_128_w16_8bpc_neon: 3.93 3.86 3.75 intra_pred_dc_128_w32_8bpc_neon: 3.14 3.79 2.90 intra_pred_dc_128_w64_8bpc_neon: 3.68 1.97 2.42 intra_pred_dc_left_w4_8bpc_neon: 2.41 1.70 2.23 intra_pred_dc_left_w8_8bpc_neon: 3.53 2.41 3.32 intra_pred_dc_left_w16_8bpc_neon: 3.87 3.54 3.34 intra_pred_dc_left_w32_8bpc_neon: 4.10 3.60 2.76 intra_pred_dc_left_w64_8bpc_neon: 3.72 2.00 2.39 intra_pred_dc_top_w4_8bpc_neon: 2.27 1.66 2.07 intra_pred_dc_top_w8_8bpc_neon: 3.83 2.69 3.43 intra_pred_dc_top_w16_8bpc_neon: 3.66 3.60 3.20 intra_pred_dc_top_w32_8bpc_neon: 3.92 3.54 2.66 intra_pred_dc_top_w64_8bpc_neon: 3.60 1.98 2.30 intra_pred_dc_w4_8bpc_neon: 2.29 1.42 2.16 intra_pred_dc_w8_8bpc_neon: 3.56 2.83 3.05 intra_pred_dc_w16_8bpc_neon: 3.46 3.37 3.15 intra_pred_dc_w32_8bpc_neon: 3.79 3.41 2.74 intra_pred_dc_w64_8bpc_neon: 3.52 2.01 2.41 intra_pred_h_w4_8bpc_neon: 10.34 5.74 5.94 intra_pred_h_w8_8bpc_neon: 12.13 6.33 6.43 intra_pred_h_w16_8bpc_neon: 10.66 7.31 5.85 intra_pred_h_w32_8bpc_neon: 6.28 4.18 2.88 intra_pred_h_w64_8bpc_neon: 3.96 1.85 1.75 intra_pred_v_w4_8bpc_neon: 11.44 6.12 7.57 intra_pred_v_w8_8bpc_neon: 14.76 7.58 7.95 intra_pred_v_w16_8bpc_neon: 11.34 6.28 5.88 intra_pred_v_w32_8bpc_neon: 6.56 3.33 3.34 intra_pred_v_w64_8bpc_neon: 4.57 1.24 1.97
-
- Sep 30, 2019
-
-
Victorien Le Couviour--Tuffet authored
------------------------------------------ x86_64: warp_8x8_8bpc_c: 1773.4 x86_32: warp_8x8_8bpc_c: 1740.4 ---------- x86_64: warp_8x8_8bpc_ssse3: 317.5 x86_32: warp_8x8_8bpc_ssse3: 378.4 ---------- x86_64: warp_8x8_8bpc_sse4: 303.7 x86_32: warp_8x8_8bpc_sse4: 367.7 ---------- x86_64: warp_8x8_8bpc_avx2: 224.9 --------------------- --------------------- x86_64: warp_8x8t_8bpc_c: 1664.6 x86_32: warp_8x8t_8bpc_c: 1674.0 ---------- x86_64: warp_8x8t_8bpc_ssse3: 320.7 x86_32: warp_8x8t_8bpc_ssse3: 379.5 ---------- x86_64: warp_8x8t_8bpc_sse4: 304.8 x86_32: warp_8x8t_8bpc_sse4: 369.8 ---------- x86_64: warp_8x8t_8bpc_avx2: 228.5 ------------------------------------------
-
- Sep 29, 2019
-
-
Martin Storsjö authored
Don't add two 16 bit coefficients in 16 bit, if the result isn't supposed to be clipped. This fixes mismatches for some samples, see issue #299. Before: Cortex A53 A72 A73 inv_txfm_add_4x4_dct_dct_1_8bpc_neon: 93.0 52.8 49.5 inv_txfm_add_8x8_dct_dct_1_8bpc_neon: 260.0 186.0 196.4 inv_txfm_add_16x16_dct_dct_2_8bpc_neon: 1371.0 953.4 1028.6 inv_txfm_add_32x32_dct_dct_4_8bpc_neon: 7363.2 4887.5 5135.8 inv_txfm_add_64x64_dct_dct_4_8bpc_neon: 25029.0 17492.3 18404.5 After: inv_txfm_add_4x4_dct_dct_1_8bpc_neon: 105.0 58.7 55.2 inv_txfm_add_8x8_dct_dct_1_8bpc_neon: 294.0 211.5 209.9 inv_txfm_add_16x16_dct_dct_2_8bpc_neon: 1495.8 1050.4 1070.6 inv_txfm_add_32x32_dct_dct_4_8bpc_neon: 7866.7 5197.8 5321.4 inv_txfm_add_64x64_dct_dct_4_8bpc_neon: 25807.2 18619.3 18526.9
-
Martin Storsjö authored
The scaled form 2896>>4 shouldn't be necessary with valid bistreams.
-
Martin Storsjö authored
Even though smull+smlal does two multiplications instead of one, the combination seems to be better handled by actual cores. Before: Cortex A53 A72 A73 inv_txfm_add_8x8_adst_adst_1_8bpc_neon: 356.0 279.2 278.0 inv_txfm_add_16x16_adst_adst_2_8bpc_neon: 1785.0 1329.5 1308.8 After: inv_txfm_add_8x8_adst_adst_1_8bpc_neon: 360.0 253.2 269.3 inv_txfm_add_16x16_adst_adst_2_8bpc_neon: 1793.1 1300.9 1254.0 (In this particular cases, it seems like it is a minor regression on A53, which is probably more due to having to change the ordering of some instructions, due to how smull+smlal+smull2+smlal2 overwrites the second output register sooner than an addl+addl2 would have, but in general, smull+smlal seems to be equally good or better than addl+mul on A53 as well.)
-
- Sep 27, 2019
-
-
Right now this just allocates a new buffer for every frame, uses it, then discards it immediately. This is not optimal, either dav1d should start reusing buffers internally or we need to pool them in dav1dplay. As it stands, this is not really a performance gain. I'll have to investigate why, but my suspicion is that seeing any gains might require reusing buffers somewhere. Note: Thrashing buffers is not as bad as it seems, initially. Not only does libplacebo pool and reuse GPU memory and buffer state objects internally, but this also absolves us from having to do any manual polling to figure out when the buffer is reusable again. Creating, using and immediately destroying buffers actually isn't as bad an approach as it might otherwise seem. It's entirely possible that this is only bad because of lock contention. As said, I'll have to investigate further...
-
Useful to test the effects of performance changes to the decoding/rendering loop as a whole.
-
Only meaningful with libplacebo. The defaults are higher quality than SDL so it's an unfair comparison and definitely too much for slow iGPUs at 4K res. Make the defaults fast/dumb processing only, and guard the debanding/dithering/upscaling/etc. behind a new --highquality flag.
-
- Sep 19, 2019
-
-
Victorien Le Couviour--Tuffet authored
------------------------------------------ x86_64: lpf_h_sb_uv_w4_8bpc_c: 430.6 x86_32: lpf_h_sb_uv_w4_8bpc_c: 788.6 x86_64: lpf_h_sb_uv_w4_8bpc_ssse3: 322.0 x86_32: lpf_h_sb_uv_w4_8bpc_ssse3: 302.4 --------------------- x86_64: lpf_h_sb_uv_w6_8bpc_c: 981.9 x86_32: lpf_h_sb_uv_w6_8bpc_c: 1579.6 x86_64: lpf_h_sb_uv_w6_8bpc_ssse3: 421.5 x86_32: lpf_h_sb_uv_w6_8bpc_ssse3: 431.6 --------------------- x86_64: lpf_h_sb_y_w4_8bpc_c: 3001.7 x86_32: lpf_h_sb_y_w4_8bpc_c: 7021.3 x86_64: lpf_h_sb_y_w4_8bpc_ssse3: 466.3 x86_32: lpf_h_sb_y_w4_8bpc_ssse3: 564.7 --------------------- x86_64: lpf_h_sb_y_w8_8bpc_c: 4457.7 x86_32: lpf_h_sb_y_w8_8bpc_c: 3657.8 x86_64: lpf_h_sb_y_w8_8bpc_ssse3: 818.9 x86_32: lpf_h_sb_y_w8_8bpc_ssse3: 927.9 --------------------- x86_64: lpf_h_sb_y_w16_8bpc_c: 1967.9 x86_32: lpf_h_sb_y_w16_8bpc_c: 3343.5 x86_64: lpf_h_sb_y_w16_8bpc_ssse3: 1836.7 x86_32: lpf_h_sb_y_w16_8bpc_ssse3: 1975.0 --------------------- x86_64: lpf_v_sb_uv_w4_8bpc_c: 369.4 x86_32: lpf_v_sb_uv_w4_8bpc_c: 793.6 x86_64: lpf_v_sb_uv_w4_8bpc_ssse3: 110.9 x86_32: lpf_v_sb_uv_w4_8bpc_ssse3: 133.0 --------------------- x86_64: lpf_v_sb_uv_w6_8bpc_c: 769.6 x86_32: lpf_v_sb_uv_w6_8bpc_c: 1576.7 x86_64: lpf_v_sb_uv_w6_8bpc_ssse3: 222.2 x86_32: lpf_v_sb_uv_w6_8bpc_ssse3: 232.2 --------------------- x86_64: lpf_v_sb_y_w4_8bpc_c: 772.4 x86_32: lpf_v_sb_y_w4_8bpc_c: 2596.5 x86_64: lpf_v_sb_y_w4_8bpc_ssse3: 179.8 x86_32: lpf_v_sb_y_w4_8bpc_ssse3: 234.7 --------------------- x86_64: lpf_v_sb_y_w8_8bpc_c: 1660.2 x86_32: lpf_v_sb_y_w8_8bpc_c: 3979.9 x86_64: lpf_v_sb_y_w8_8bpc_ssse3: 468.3 x86_32: lpf_v_sb_y_w8_8bpc_ssse3: 580.9 --------------------- x86_64: lpf_v_sb_y_w16_8bpc_c: 1889.6 x86_32: lpf_v_sb_y_w16_8bpc_c: 4728.7 x86_64: lpf_v_sb_y_w16_8bpc_ssse3: 1142.0 x86_32: lpf_v_sb_y_w16_8bpc_ssse3: 1174.8 ------------------------------------------
-
--------------------- x86_64: ------------------------------------------ lpf_h_sb_uv_w4_8bpc_c: 430.6 lpf_h_sb_uv_w4_8bpc_ssse3: 322.0 lpf_h_sb_uv_w4_8bpc_avx2: 200.4 --------------------- lpf_h_sb_uv_w6_8bpc_c: 981.9 lpf_h_sb_uv_w6_8bpc_ssse3: 421.5 lpf_h_sb_uv_w6_8bpc_avx2: 270.0 --------------------- lpf_h_sb_y_w4_8bpc_c: 3001.7 lpf_h_sb_y_w4_8bpc_ssse3: 466.3 lpf_h_sb_y_w4_8bpc_avx2: 383.1 --------------------- lpf_h_sb_y_w8_8bpc_c: 4457.7 lpf_h_sb_y_w8_8bpc_ssse3: 818.9 lpf_h_sb_y_w8_8bpc_avx2: 537.0 --------------------- lpf_h_sb_y_w16_8bpc_c: 1967.9 lpf_h_sb_y_w16_8bpc_ssse3: 1836.7 lpf_h_sb_y_w16_8bpc_avx2: 1078.2 --------------------- lpf_v_sb_uv_w4_8bpc_c: 369.4 lpf_v_sb_uv_w4_8bpc_ssse3: 110.9 lpf_v_sb_uv_w4_8bpc_avx2: 58.1 --------------------- lpf_v_sb_uv_w6_8bpc_c: 769.6 lpf_v_sb_uv_w6_8bpc_ssse3: 222.2 lpf_v_sb_uv_w6_8bpc_avx2: 117.8 --------------------- lpf_v_sb_y_w4_8bpc_c: 772.4 lpf_v_sb_y_w4_8bpc_ssse3: 179.8 lpf_v_sb_y_w4_8bpc_avx2: 173.6 --------------------- lpf_v_sb_y_w8_8bpc_c: 1660.2 lpf_v_sb_y_w8_8bpc_ssse3: 468.3 lpf_v_sb_y_w8_8bpc_avx2: 345.8 --------------------- lpf_v_sb_y_w16_8bpc_c: 1889.6 lpf_v_sb_y_w16_8bpc_ssse3: 1142.0 lpf_v_sb_y_w16_8bpc_avx2: 568.1 ------------------------------------------
-
- Sep 10, 2019
-
-
Ronald S. Bultje authored
fguv_32x32xn_8bpc_420_csfl0_c: 8945.4 fguv_32x32xn_8bpc_420_csfl0_avx2: 1001.6 fguv_32x32xn_8bpc_420_csfl1_c: 6363.4 fguv_32x32xn_8bpc_420_csfl1_avx2: 1299.5
-
Ronald S. Bultje authored
This would affect the output in samples with an odd width and horizontal chroma subsampling. The check does not exist in libaom, and might cause mismatches. This causes issues in the sample from #210, which uses super-resolution and has odd width. To work around this, make super-resolution's resize() always write an even number of pixels. This should not interfere with SIMD in the future.
-
Ronald S. Bultje authored
fgy_32x32xn_8bpc_c: 16181.8 fgy_32x32xn_8bpc_avx2: 3231.4 gen_grain_y_ar0_8bpc_c: 108857.6 gen_grain_y_ar0_8bpc_avx2: 22826.7 gen_grain_y_ar1_8bpc_c: 168239.8 gen_grain_y_ar1_8bpc_avx2: 72117.2 gen_grain_y_ar2_8bpc_c: 266165.9 gen_grain_y_ar2_8bpc_avx2: 126281.8 gen_grain_y_ar3_8bpc_c: 448139.4 gen_grain_y_ar3_8bpc_avx2: 137047.1
-
Ronald S. Bultje authored
-
Ronald S. Bultje authored
-
- Sep 06, 2019
-
-
James Almer authored
Both values can be independently coded in the bitstream, and are not always equal to frame_width and frame_height.
-
- Sep 05, 2019
-
-
Henrik Gramner authored
For some reason the MSVC CRT _wassert() function is not flagged as __declspec(noreturn), so when using those headers the compiler will expect execution to continue after an assertion has been triggered and will therefore complain about the use of uninitialized variables when compiled in debug mode in certain code paths. Reorder some case statements as a workaround.
-
For w <= 32 we can't process more than two rows per loop iteration. Credit to OSS-Fuzz.
-
- Sep 04, 2019
-
-
16-bit precision is sufficient for the second pass, but the first pass requires 32-bit precision to correctly handle some esoteric edge cases.
-
Martin Storsjö authored
See issue #295, this fixes it for arm64. Before: Cortex A53 A72 A73 inv_txfm_add_4x4_adst_adst_1_8bpc_neon: 103.0 63.2 65.2 inv_txfm_add_4x8_adst_adst_1_8bpc_neon: 197.0 145.0 134.2 inv_txfm_add_8x8_adst_adst_1_8bpc_neon: 332.0 248.0 247.1 inv_txfm_add_16x16_adst_adst_2_8bpc_neon: 1676.8 1197.0 1186.8 After: inv_txfm_add_4x4_adst_adst_1_8bpc_neon: 103.0 76.4 67.0 inv_txfm_add_4x8_adst_adst_1_8bpc_neon: 205.0 155.0 143.8 inv_txfm_add_8x8_adst_adst_1_8bpc_neon: 358.0 269.0 276.2 inv_txfm_add_16x16_adst_adst_2_8bpc_neon: 1785.2 1347.8 1312.1 This would probably only be needed for adst in the first pass, but the additional code complexity from splitting the implementations (as we currently don't have transforms differentiated between first and second pass) isn't necessarily worth it (the speedup over C code is still 8-10x).
-
Henrik Gramner authored
__assume() doesn't work correctly in clang-cl versions prior to 7.0.0 which causes bogus warnings regarding use of uninitialized variables to be printed. Avoid that by using __builtin_unreachable() instead.
-
Henrik Gramner authored
clang-cl doesn't like function calls in __assume statements, even trivial inline ones.
-
This large constant needs a movw instruction, which newer binutils can figure out, but older versions need stated explicitly. This fixes #296.
-
- Sep 03, 2019
-
-
Janne Grunau authored
The chroma part of pal_idx potentially conflicts during intra reconstruction with edge_{8,16}bpc. Fixes out of range pixel values caused by invalid palette indices in clusterfuzz-testcase-minimized-dav1d_fuzzer_mt-5076736684851200. Fixes #294. Reported as integer overflows in boxsum5sqr with undefined behavior sanitizer. Credits to oss-fuzz.
-
- Sep 01, 2019
-
-
Janne Grunau authored
-
- Aug 30, 2019
-
-
Ronald S. Bultje authored
Fixes libaom/dav1d mismatch in av1-1-b10-23-film_grain-50.ivf.
-
Ronald S. Bultje authored
-
- calculate chroma grain based on src (not dst) luma pixels; - division should precede multiplication in delta calculation. Together, these fix differences in film grain reconstruction between libaom and dav1d for various generated samples.
-
- Aug 29, 2019
-
-
B Krishnan Iyer authored
-
Martin Storsjö authored
Use the so far unused lr register instead of r10.
-