- Feb 18, 2020
-
-
Janne Grunau authored
-
- Feb 17, 2020
-
-
Martin Storsjö authored
This increases the code size by around 3 KB on arm64. Before: ARM32: Cortex A7 A8 A9 A53 A72 A73 cdef_filter_4x4_8bpc_neon: 807.1 517.0 617.7 506.6 429.9 357.8 cdef_filter_4x8_8bpc_neon: 1407.9 899.3 1054.6 862.3 726.5 628.1 cdef_filter_8x8_8bpc_neon: 2394.9 1456.8 1676.8 1461.2 1084.4 1101.2 ARM64: cdef_filter_4x4_8bpc_neon: 460.7 301.8 308.0 cdef_filter_4x8_8bpc_neon: 831.6 547.0 555.2 cdef_filter_8x8_8bpc_neon: 1454.6 935.6 960.4 After: ARM32: cdef_filter_4x4_8bpc_neon: 669.3 541.3 524.4 424.9 322.7 298.1 cdef_filter_4x8_8bpc_neon: 1159.1 922.9 881.1 709.2 538.3 514.1 cdef_filter_8x8_8bpc_neon: 1888.8 1285.4 1358.5 1152.9 839.3 871.2 ARM64: cdef_filter_4x4_8bpc_neon: 383.6 262.1 259.9 cdef_filter_4x8_8bpc_neon: 684.9 472.2 464.7 cdef_filter_8x8_8bpc_neon: 1160.0 756.8 788.0 (The checkasm benchmark averages three different cases; the fully edged case is one of those three, while it's the most common case in actual video. The difference is much bigger if only benchmarking that particular case.) This actually apparently makes the code a little bit slower for the w=4 cases on Cortex A8, while it's a significant speedup on all other cores.
-
Martin Storsjö authored
The signedness of elements doesn't matter for vsub; match the vsub.i16 next to it.
-
- Feb 16, 2020
-
-
Henrik Gramner authored
Console output is incredibly slow on Windows, which is aggravated by the lack of line buffering. As a result, a significant percentage of overall runtime is actually spent displaying the decoding progress. Doing the line buffering manually alleviates most of the issue.
-
- Feb 13, 2020
-
-
Martin Storsjö authored
These were missed in 361a3c8e.
-
- Feb 11, 2020
-
-
Martin Storsjö authored
This only supports 10 bpc, not 12 bpc, as the sum and tmp buffers can be int16_t for 10 bpc, but need to be int32_t for 12 bpc. Make actual templates out of the functions in looprestoration_tmpl.S, and add box3/5_h to looprestoration16.S. Extend dav1d_sgr_calc_abX_neon with a mandatory bitdepth_max parameter (which is passed even in 8bpc mode), add a define to bitdepth.h for passing such a parameter in all modes. This makes this function a few instructions slower in 8bpc mode than it was before (overall impact seems to be around 1% of the total runtime of SGR), but allows using the same actual function instantiation for all modes, saving a bit of code size. Examples of checkasm runtimes: Cortex A53 A72 A73 selfguided_3x3_10bpc_neon: 516755.8 389412.7 349058.7 selfguided_5x5_10bpc_neon: 380699.9 293486.6 254591.6 selfguided_mix_10bpc_neon: 878142.3 667495.9 587844.6 Corresponding 8 bpc numbers for comparison: selfguided_3x3_8bpc_neon: 491058.1 361473.4 347705.9 selfguided_5x5_8bpc_neon: 352655.0 266423.7 248192.2 selfguided_mix_8bpc_neon: 826094.1 612372.2 581943.1
-
Martin Storsjö authored
looprestoration_common.S contains functions that can be used as is with one single instantiation of the functions for both 8 and 16 bpc. This file will be built once, regardless of which bitdepths are enabled. looprestoration_tmpl.S contains functions where the source can be shared and templated between 8 and 16 bpc. This will be included by the separate 8/16bpc implementaton files.
-
Martin Storsjö authored
Don't add it to dav1d_sgr_calc_ab1/2_neon and box3/5_v, as the same concrete function implementations can be shared for both 8 and 16 bpc for those functions.
-
Martin Storsjö authored
This allows using completely different codepaths for 10 and 12 bpc, or just adding SIMD functions for either of them.
-
Martin Storsjö authored
Set flags further from the branch instructions that use them.
-
Martin Storsjö authored
For 8bpc and 10bpc, int16_t is enough here, and for 12bpc, other intermediate int16_t buffers also need to be made of size coef anyway.
-
Martin Storsjö authored
-
- Feb 10, 2020
-
-
Jean-Baptiste Kempf authored
-
Only copy as much as really is needed/used.
-
It was already done this way for w32/64. Not doing it for w16 as it didn't help there (and instead gave a small slowdown due to the two setup instructions). This gives a small speedup on in-order cores like A53. Before: Cortex A53 A72 A73 avg_w4_8bpc_neon: 60.9 25.6 29.0 avg_w8_8bpc_neon: 143.6 52.8 64.0 After: avg_w4_8bpc_neon: 56.7 26.7 28.5 avg_w8_8bpc_neon: 137.2 54.5 64.4
-
This shortens the source by 40 lines, and gives a significant speedup on A53, a small speedup on A72 and a very minor slowdown for avg/w_avg on A73. Before: Cortex A53 A72 A73 avg_w4_8bpc_neon: 67.4 26.1 25.4 avg_w8_8bpc_neon: 158.7 56.3 59.1 avg_w16_8bpc_neon: 382.9 154.1 160.7 w_avg_w4_8bpc_neon: 99.9 43.6 39.4 w_avg_w8_8bpc_neon: 253.2 98.3 99.0 w_avg_w16_8bpc_neon: 543.1 285.0 301.8 mask_w4_8bpc_neon: 110.6 51.4 45.1 mask_w8_8bpc_neon: 295.0 129.9 114.0 mask_w16_8bpc_neon: 654.6 365.8 369.7 After: avg_w4_8bpc_neon: 60.8 26.3 29.0 avg_w8_8bpc_neon: 142.8 52.9 64.1 avg_w16_8bpc_neon: 378.2 153.4 160.8 w_avg_w4_8bpc_neon: 78.7 41.0 40.9 w_avg_w8_8bpc_neon: 190.6 90.1 105.1 w_avg_w16_8bpc_neon: 531.1 279.3 301.4 mask_w4_8bpc_neon: 86.6 47.2 44.9 mask_w8_8bpc_neon: 222.0 114.3 114.9 mask_w16_8bpc_neon: 639.5 356.0 369.8
-
Jean-Baptiste Kempf authored
-
- Feb 08, 2020
-
-
Martin Storsjö authored
Checkasm benchmark numbers: Cortex A53 A72 A73 warp_8x8_16bpc_neon: 2029.9 1150.5 1225.2 warp_8x8t_16bpc_neon: 2007.6 1129.0 1192.3 Corresponding numbers for 8bpc for comparison: warp_8x8_8bpc_neon: 1863.8 1052.8 1106.2 warp_8x8t_8bpc_neon: 1847.4 1048.3 1099.8
-
- Feb 07, 2020
-
-
Martin Storsjö authored
As some functions are made for both 8bpc and 16bpc from a shared template, those functions are moved to a separate assembly file which is included. That assembly file (cdef_tmpl.S) isn't intended to be assembled on its own (just like utils.S), but if it is assembled, it should produce an empty object file. Checkasm benchmarks: Cortex A53 A72 A73 cdef_dir_16bpc_neon: 422.7 305.5 314.0 cdef_filter_4x4_16bpc_neon: 452.9 282.7 296.6 cdef_filter_4x8_16bpc_neon: 800.9 515.3 534.1 cdef_filter_8x8_16bpc_neon: 1417.1 922.7 942.6 Corresponding numbers for 8bpc for comparison: cdef_dir_8bpc_neon: 394.7 268.8 281.8 cdef_filter_4x4_8bpc_neon: 461.5 300.9 307.7 cdef_filter_4x8_8bpc_neon: 831.6 546.1 555.6 cdef_filter_8x8_8bpc_neon: 1454.6 934.0 960.0
-
Martin Storsjö authored
-
- Feb 06, 2020
-
-
Henrik Gramner authored
-
Henrik Gramner authored
The ar_coeff_shift element needs to have a 16-byte alignment on x86.
-
Martin Storsjö authored
Checkasm benchmarks: Cortex A53 A72 A73 wiener_chroma_16bpc_neon: 190288.4 129369.5 127284.1 wiener_luma_16bpc_neon: 195108.4 129387.8 127042.7 The corresponding numbers for 8 bpc for comparison: wiener_chroma_8bpc_neon: 150586.9 101647.1 97709.9 wiener_luma_8bpc_neon: 146297.4 101593.2 97670.5
-
Martin Storsjö authored
Use HIGHBD_DECL_SUFFIX and HIGHBD_TAIL_SUFFIX where necessary, add a missing sizeof(pixel).
-
Martin Storsjö authored
-
Martin Storsjö authored
-
Martin Storsjö authored
Examples of checkasm benchmarks: Cortex A53 A72 A73 mc_8tap_regular_w8_0_16bpc_neon: 96.8 49.6 62.5 mc_8tap_regular_w8_h_16bpc_neon: 570.3 388.0 467.2 mc_8tap_regular_w8_hv_16bpc_neon: 1035.8 776.7 891.3 mc_8tap_regular_w8_v_16bpc_neon: 400.6 285.0 278.3 mc_bilinear_w8_0_16bpc_neon: 90.0 44.8 57.8 mc_bilinear_w8_h_16bpc_neon: 191.2 158.7 156.4 mc_bilinear_w8_hv_16bpc_neon: 295.9 234.6 244.9 mc_bilinear_w8_v_16bpc_neon: 147.2 98.7 89.2 mct_8tap_regular_w8_0_16bpc_neon: 139.4 78.7 84.9 mct_8tap_regular_w8_h_16bpc_neon: 612.5 396.8 479.1 mct_8tap_regular_w8_hv_16bpc_neon: 1112.4 814.6 963.2 mct_8tap_regular_w8_v_16bpc_neon: 461.8 370.8 353.4 mct_bilinear_w8_0_16bpc_neon: 135.6 76.2 80.5 mct_bilinear_w8_h_16bpc_neon: 211.3 159.4 141.7 mct_bilinear_w8_hv_16bpc_neon: 325.7 237.2 227.2 mct_bilinear_w8_v_16bpc_neon: 180.7 135.9 129.5 For comparison, the corresponding numbers for 8 bpc: mc_8tap_regular_w8_0_8bpc_neon: 78.6 41.0 39.5 mc_8tap_regular_w8_h_8bpc_neon: 371.2 299.6 348.3 mc_8tap_regular_w8_hv_8bpc_neon: 817.1 675.0 726.5 mc_8tap_regular_w8_v_8bpc_neon: 243.7 260.4 253.0 mc_bilinear_w8_0_8bpc_neon: 74.8 35.4 36.1 mc_bilinear_w8_h_8bpc_neon: 179.9 69.9 79.2 mc_bilinear_w8_hv_8bpc_neon: 210.8 132.4 144.8 mc_bilinear_w8_v_8bpc_neon: 141.6 64.9 65.4 mct_8tap_regular_w8_0_8bpc_neon: 101.7 54.4 59.5 mct_8tap_regular_w8_h_8bpc_neon: 391.3 329.1 358.3 mct_8tap_regular_w8_hv_8bpc_neon: 880.4 754.9 829.4 mct_8tap_regular_w8_v_8bpc_neon: 270.8 300.8 277.4 mct_bilinear_w8_0_8bpc_neon: 97.6 54.0 55.4 mct_bilinear_w8_h_8bpc_neon: 173.3 73.5 79.5 mct_bilinear_w8_hv_8bpc_neon: 228.3 163.0 174.0 mct_bilinear_w8_v_8bpc_neon: 128.9 72.5 63.3
-
- Feb 05, 2020
- Feb 04, 2020
-
-
Martin Storsjö authored
avg_w4_16bpc_neon: 78.2 43.2 48.9 avg_w8_16bpc_neon: 199.1 108.7 123.1 avg_w16_16bpc_neon: 615.6 339.9 373.9 avg_w32_16bpc_neon: 2313.0 1390.6 1490.6 avg_w64_16bpc_neon: 5783.6 3119.5 3653.0 avg_w128_16bpc_neon: 15444.6 8168.7 8907.9 w_avg_w4_16bpc_neon: 120.1 87.8 92.4 w_avg_w8_16bpc_neon: 321.6 252.4 263.1 w_avg_w16_16bpc_neon: 1017.5 794.5 831.2 w_avg_w32_16bpc_neon: 3911.4 3154.7 3306.5 w_avg_w64_16bpc_neon: 9977.9 7794.9 8022.3 w_avg_w128_16bpc_neon: 25721.5 19274.6 20041.7 mask_w4_16bpc_neon: 139.5 96.5 104.3 mask_w8_16bpc_neon: 376.0 283.9 300.1 mask_w16_16bpc_neon: 1217.2 906.7 950.0 mask_w32_16bpc_neon: 4811.1 3669.0 3901.3 mask_w64_16bpc_neon: 12036.4 8918.4 9244.8 mask_w128_16bpc_neon: 30888.8 21999.0 23206.7 For comparison, these are the corresponding numbers for 8bpc: avg_w4_8bpc_neon: 56.7 26.2 28.5 avg_w8_8bpc_neon: 137.2 52.8 64.3 avg_w16_8bpc_neon: 377.9 151.5 161.6 avg_w32_8bpc_neon: 1528.9 614.5 633.9 avg_w64_8bpc_neon: 3792.5 1814.3 1518.3 avg_w128_8bpc_neon: 10685.3 5220.4 3879.9 w_avg_w4_8bpc_neon: 75.2 53.0 41.1 w_avg_w8_8bpc_neon: 186.7 120.1 105.2 w_avg_w16_8bpc_neon: 531.6 314.1 302.1 w_avg_w32_8bpc_neon: 2138.4 1120.4 1171.5 w_avg_w64_8bpc_neon: 5151.9 2910.5 2857.1 w_avg_w128_8bpc_neon: 13945.0 7330.5 7389.1 mask_w4_8bpc_neon: 82.0 47.2 45.1 mask_w8_8bpc_neon: 213.5 115.4 115.8 mask_w16_8bpc_neon: 639.8 356.2 370.1 mask_w32_8bpc_neon: 2566.9 1489.8 1435.0 mask_w64_8bpc_neon: 6727.6 3822.8 3425.2 mask_w128_8bpc_neon: 17893.0 9622.6 9161.3
-
Martin Storsjö authored
-
Martin Storsjö authored
-
- Feb 03, 2020
-
-
On many AMD CPU:s cmov instructions that depends on multiple flags require an additional µop, so prefer using cmov variants that only depends on a single flag where possible.
-
Shave off a few instructions, or save a few bytes, in various places. Also change some instructions to use appropriately sized registers.
-
Using signed comparisons could theoretically cause the wrong result in some obscure corner cases on x86-32 with PAE. x86-64 should be fine with either, but unsigned is technically more correct.
-
- Feb 01, 2020
-
-
Henrik Gramner authored
Avoids some pointer chasing and simplifies the DSP code, at the cost of making the initialization a little bit more complicated. Also reduces memory usage by a small amount due to properly sizing the buffers instead of always allocating enough space for 4:4:4.
-
Henrik Gramner authored
-
Henrik Gramner authored
We specify most strides in bytes, but since C defines offsets in multiples of sizeof(type) we use the PXSTRIDE() macro to downshift the strides by one in high-bit depth templated files. This however means that the compiler is required to mask away the least significant bit, because it could in theory be non-zero. Avoid that by telling the compiler (when compiled in release mode) that the lsb is in fact guaranteed to always be zero.
-
- Jan 29, 2020
-
-
Henrik Gramner authored
Required for AVX-512.
-