- Jan 01, 2021
-
-
Janne Grunau authored
-
- Dec 16, 2020
-
-
-
Martin Storsjö authored
Checkasm benchmarks: Cortex A7 A8 A53 A72 A73 emu_edge_w4_16bpc_neon: 375.0 312.6 268.3 159.3 170.0 emu_edge_w8_16bpc_neon: 619.3 425.5 435.5 249.5 291.1 emu_edge_w16_16bpc_neon: 719.1 568.3 506.9 324.2 314.4 emu_edge_w32_16bpc_neon: 2112.2 1677.7 1396.2 1050.5 1009.6 emu_edge_w64_16bpc_neon: 5046.8 4322.5 3693.7 3953.8 2682.8 emu_edge_w128_16bpc_neon: 16311.1 14341.3 12877.8 26183.5 8924.9 Corresponding numbers for arm64, for comparison: Cortex A53 A72 A73 emu_edge_w4_16bpc_neon: 302.5 174.9 159.2 emu_edge_w8_16bpc_neon: 344.6 292.3 273.2 emu_edge_w16_16bpc_neon: 601.0 461.2 316.8 emu_edge_w32_16bpc_neon: 974.2 1274.7 960.5 emu_edge_w64_16bpc_neon: 2853.1 3527.6 2633.5 emu_edge_w128_16bpc_neon: 14633.5 26776.6 7236.0
-
Martin Storsjö authored
Checkasm numbers: Cortex A7 A8 A53 A72 A73 w_mask_420_w4_16bpc_neon: 350.3 216.4 215.4 141.7 134.5 w_mask_420_w8_16bpc_neon: 926.7 590.9 529.1 373.8 354.5 w_mask_420_w16_16bpc_neon: 2956.7 1880.4 1654.8 1186.1 1134.1 w_mask_420_w32_16bpc_neon: 11489.3 7426.4 6314.1 4599.8 4398.6 w_mask_420_w64_16bpc_neon: 28175.9 17898.1 16002.8 11079.0 10551.8 w_mask_420_w128_16bpc_neon: 71599.4 44630.9 40696.9 28057.3 27836.5 w_mask_422_w4_16bpc_neon: 339.0 210.1 206.7 137.3 134.7 w_mask_422_w8_16bpc_neon: 887.2 573.3 499.6 361.6 353.5 w_mask_422_w16_16bpc_neon: 2918.0 1841.6 1593.0 1194.0 1157.9 w_mask_422_w32_16bpc_neon: 11313.8 7238.7 6043.4 4577.1 4469.6 w_mask_422_w64_16bpc_neon: 27746.5 17427.2 15386.9 11082.6 10693.8 w_mask_422_w128_16bpc_neon: 70521.4 43864.9 39209.3 29045.7 28305.5 w_mask_444_w4_16bpc_neon: 325.6 202.9 198.4 135.2 129.3 w_mask_444_w8_16bpc_neon: 860.7 534.9 474.8 358.0 352.2 w_mask_444_w16_16bpc_neon: 2764.3 1714.4 1517.8 1160.6 1133.1 w_mask_444_w32_16bpc_neon: 10719.8 6738.3 5746.7 4458.6 4347.1 w_mask_444_w64_16bpc_neon: 26407.9 16224.1 14783.9 10784.3 10371.4 w_mask_444_w128_16bpc_neon: 67226.1 41060.1 37823.1 41696.1 27722.2 Corresponding numbers for arm64, for comparison: Cortex A53 A72 A73 w_mask_420_w4_16bpc_neon: 173.6 123.6 120.3 w_mask_420_w8_16bpc_neon: 484.0 344.0 329.4 w_mask_420_w16_16bpc_neon: 1436.3 1025.7 1028.7 w_mask_420_w32_16bpc_neon: 5597.0 3994.8 3981.2 w_mask_420_w64_16bpc_neon: 13953.4 9700.8 9579.9 w_mask_420_w128_16bpc_neon: 35833.7 25519.3 24277.8 w_mask_422_w4_16bpc_neon: 159.4 111.7 114.2 w_mask_422_w8_16bpc_neon: 453.4 326.2 326.7 w_mask_422_w16_16bpc_neon: 1398.2 1063.3 1052.6 w_mask_422_w32_16bpc_neon: 5532.7 4143.0 4026.3 w_mask_422_w64_16bpc_neon: 13885.3 9978.0 9689.8 w_mask_422_w128_16bpc_neon: 35763.3 25822.4 24610.9 w_mask_444_w4_16bpc_neon: 152.9 110.0 112.8 w_mask_444_w8_16bpc_neon: 437.2 332.0 325.8 w_mask_444_w16_16bpc_neon: 1399.3 1068.9 1041.7 w_mask_444_w32_16bpc_neon: 5410.9 4139.7 4136.9 w_mask_444_w64_16bpc_neon: 13648.7 10011.8 10004.6 w_mask_444_w128_16bpc_neon: 35639.6 26910.8 25631.0
-
Martin Storsjö authored
Checkasm numbers: Cortex A7 A8 A53 A72 A73 blend_h_w2_16bpc_neon: 190.0 163.0 135.5 67.4 71.2 blend_h_w4_16bpc_neon: 204.4 119.1 140.3 61.2 74.9 blend_h_w8_16bpc_neon: 247.6 126.2 159.5 86.1 88.4 blend_h_w16_16bpc_neon: 391.6 186.5 230.7 134.9 149.4 blend_h_w32_16bpc_neon: 734.9 354.2 454.1 248.1 270.9 blend_h_w64_16bpc_neon: 1290.8 611.7 801.1 456.6 491.3 blend_h_w128_16bpc_neon: 2876.4 1354.2 1788.6 1083.4 1092.0 blend_v_w2_16bpc_neon: 264.4 325.2 206.8 107.6 123.0 blend_v_w4_16bpc_neon: 471.8 358.7 356.9 187.0 229.9 blend_v_w8_16bpc_neon: 616.9 365.3 445.4 218.2 248.5 blend_v_w16_16bpc_neon: 928.3 517.1 629.1 325.0 358.0 blend_v_w32_16bpc_neon: 1771.6 790.1 1106.1 631.2 584.7 blend_w4_16bpc_neon: 128.8 66.6 95.5 33.5 42.0 blend_w8_16bpc_neon: 238.7 118.0 156.8 76.5 84.5 blend_w16_16bpc_neon: 809.7 360.9 482.3 268.5 298.3 blend_w32_16bpc_neon: 2015.7 916.6 1177.0 682.1 730.9 Corresponding numbers for arm64, for comparison: Cortex A53 A72 A73 blend_h_w2_16bpc_neon: 109.3 83.1 56.8 blend_h_w4_16bpc_neon: 114.1 61.1 62.3 blend_h_w8_16bpc_neon: 133.3 80.8 81.0 blend_h_w16_16bpc_neon: 215.6 132.7 149.5 blend_h_w32_16bpc_neon: 390.4 253.9 235.8 blend_h_w64_16bpc_neon: 715.8 455.8 454.0 blend_h_w128_16bpc_neon: 1649.7 1034.7 1066.2 blend_v_w2_16bpc_neon: 185.9 176.3 178.3 blend_v_w4_16bpc_neon: 338.3 184.4 234.3 blend_v_w8_16bpc_neon: 427.0 214.5 252.7 blend_v_w16_16bpc_neon: 680.4 358.1 389.2 blend_v_w32_16bpc_neon: 1100.7 615.5 690.1 blend_w4_16bpc_neon: 76.0 32.3 32.1 blend_w8_16bpc_neon: 134.4 76.3 71.5 blend_w16_16bpc_neon: 476.3 268.8 301.5 blend_w32_16bpc_neon: 1226.8 659.9 782.8
-
Martin Storsjö authored
-
Martin Storsjö authored
-
Martin Storsjö authored
-
Martin Storsjö authored
This is one cycle faster, when the other lanes don't need to be preserved, on some (old) cores.
-
Martin Storsjö authored
-
Martin Storsjö authored
-
Martin Storsjö authored
-
- Dec 15, 2020
-
-
The default 30 second timeout may be insufficient when running under certain sanitizers, especially on slower CPUs.
-
Henrik Gramner authored
-
Martin Storsjö authored
This reverts commit 920079ed. Upstream meson reverted the breaking change, see https://github.com/mesonbuild/meson/issues/7493#issuecomment-729020325, so currently the forward-compatibility code was producing warnings in builds with newer meson instead.
-
- Dec 12, 2020
-
-
Martin Storsjö authored
This operates on 4 pixels as a time, while the arm64 version operated on 8 pixels at a time. As the registers only fit one single 4 pixel wide slice (with one single set of input parameters and mask bits), the high level logic for calculating those input parameters is done with GPRs and scalar instructions instead of SIMD as in the other implementations.
-
Martin Storsjö authored
As the arm64 16 bpc loopfilter operates on a 8 pixel region at a time, inspect 2 bits (corresponding to 4 pixels each) from these registers, as we also shift them down by 2 bits at the end of the loop. This should allow skipping the loopfilter altogether (or using a smaller filter) in more cases.
-
Martin Storsjö authored
-
Martin Storsjö authored
-
Henrik Gramner authored
The previous implementation did two separate passes in the horizontal and vertical directions, with the intermediate values being stored in a buffer on the stack. This caused bad cache thrashing. By interleaving the horizontal and vertical passes in combination with a ring buffer for storing only a few rows at a time the performance is improved by a significant amount. Also split the function into 7-tap and 5-tap versions. The latter is faster and fairly common (always for chroma, sometimes for luma).
-
Henrik Gramner authored
It contains both SSE2 and SSSE3 code.
-
Henrik Gramner authored
Combine horizontal and vertical filter pointers into a single parameter when calling the wiener DSP function. Eliminate the +128 filter coefficient handling where possible.
-
Henrik Gramner authored
Reduces memory usage by 96 bytes per sb.
-
Henrik Gramner authored
-
- Dec 11, 2020
-
-
Covers the use case of keeping a reference to a Dav1dPicture after closing the decoder.
-
- Dec 10, 2020
-
-
9057d286 had the side effect of causing references to buffers allocated using memory pools to no longer be valid after closing the decoder. Restore this functionality by making buffer pools reference counted.
-
- Dec 01, 2020
-
-
Martin Storsjö authored
Checkasm numbers: Cortex A7 A8 A53 A72 A73 selfguided_3x3_10bpc_neon: 919127.6 717942.8 565717.8 404748.0 372179.8 selfguided_5x5_10bpc_neon: 640310.8 511873.4 370653.3 273593.7 256403.2 selfguided_mix_10bpc_neon: 1533887.0 1252389.5 922111.1 659033.4 613410.6 Corresponding numbers for arm64, for comparison: Cortex A53 A72 A73 selfguided_3x3_10bpc_neon: 500706.0 367199.2 345261.2 selfguided_5x5_10bpc_neon: 361403.3 270550.0 249955.3 selfguided_mix_10bpc_neon: 846172.4 623590.3 578404.8
-
Martin Storsjö authored
looprestoration_common.S contains functions that can be used as is with one single instantiation of the functions for both 8 and 16 bpc. This file will be built once, regardless of which bitdepths are enabled. looprestoration_tmpl.S contains functions where the source can be shared and templated between 8 and 16 bpc. This will be included by the separate 8/16bpc implementaton files.
-
Martin Storsjö authored
A number of other similar comments were updated to say pixels when the 16 bpc code was written originally, but these were missed.
-
Martin Storsjö authored
Make it consistent with the weighted1 function.
-
Martin Storsjö authored
For the existing 8 bpc support, there's no stack argument to load into r8.
-
Martin Storsjö authored
-
Martin Storsjö authored
Instead of calculating squares of pixels once, and shifting and adding the precalculated squares, just do multiply-accumulate of the pixels that are shifted anyway for the non-squared sum. This results in more multiplications in total, but fewer instructions, and multiplications aren't that much more expensive than regular arithmetic operations anyway. On Cortex A53 and A72, this is a fairly substantial gain, on A73 it's a very marginal gain. The runtimes for the box3/5_h functions themselves are reduced by around 16-20%, and the overall runtime for SGR is reduced by around 2-8%. Before: Cortex A53 A72 A73 selfguided_3x3_10bpc_neon: 513086.5 385767.7 348774.3 selfguided_5x5_10bpc_neon: 378108.6 291133.5 253251.4 selfguided_mix_10bpc_neon: 876833.1 662801.0 586387.4 After: Cortex A53 A72 A73 selfguided_3x3_10bpc_neon: 502734.0 363754.5 343199.8 selfguided_5x5_10bpc_neon: 361696.4 265848.2 249476.8 selfguided_mix_10bpc_neon: 852683.8 615848.6 577615.0
-
- Nov 28, 2020
-
-
- Nov 23, 2020
-
-
Jean-Baptiste Kempf authored
-
- Nov 22, 2020
-
-
Add buffer pools for miscellaneous smaller buffers that are repeatedly being freed and reallocated. Also improve dav1d_ref_create() by consolidating two separate memory allocations into a single one.
-
- Nov 20, 2020
-
-
Martin Storsjö authored
Checkasm benchmarks: Cortex A7 A8 A53 A72 A73 warp_8x8_16bpc_neon: 4062.6 2109.4 2462.0 1338.9 1391.1 warp_8x8t_16bpc_neon: 3996.3 2102.4 2412.0 1273.8 1368.9 Corresponding numbers for arm64, for comparison: Cortex A53 A72 A73 warp_8x8_16bpc_neon: 2037.0 1148.8 1222.0 warp_8x8t_16bpc_neon: 2008.0 1120.4 1200.9
-
Martin Storsjö authored
Use a shared template file for assembly functions that can be templated into 8 and 16 bpc forms, just like in the arm64 version. Checkasm benchmarks: Cortex A7 A8 A53 A72 A73 cdef_dir_16bpc_neon: 975.9 853.2 555.2 378.7 386.9 cdef_filter_4x4_16bpc_neon: 746.9 521.7 481.2 333.0 340.8 cdef_filter_4x8_16bpc_neon: 1300.0 885.5 816.3 582.7 599.5 cdef_filter_8x8_16bpc_neon: 2282.5 1415.0 1417.6 1059.0 1076.3 Corresponding numbers for arm64, for comparison: Cortex A53 A72 A73 cdef_dir_16bpc_neon: 418.0 306.7 310.7 cdef_filter_4x4_16bpc_neon: 453.4 282.9 297.4 cdef_filter_4x8_16bpc_neon: 807.5 514.2 533.8 cdef_filter_8x8_16bpc_neon: 1425.2 924.4 942.0
-
Martin Storsjö authored
-