- Feb 11, 2020
-
-
David Smith authored
-
David Smith authored
-
David Smith authored
-
David Smith authored
-
- Feb 10, 2020
-
-
Jean-Baptiste Kempf authored
-
Only copy as much as really is needed/used.
-
It was already done this way for w32/64. Not doing it for w16 as it didn't help there (and instead gave a small slowdown due to the two setup instructions). This gives a small speedup on in-order cores like A53. Before: Cortex A53 A72 A73 avg_w4_8bpc_neon: 60.9 25.6 29.0 avg_w8_8bpc_neon: 143.6 52.8 64.0 After: avg_w4_8bpc_neon: 56.7 26.7 28.5 avg_w8_8bpc_neon: 137.2 54.5 64.4
-
This shortens the source by 40 lines, and gives a significant speedup on A53, a small speedup on A72 and a very minor slowdown for avg/w_avg on A73. Before: Cortex A53 A72 A73 avg_w4_8bpc_neon: 67.4 26.1 25.4 avg_w8_8bpc_neon: 158.7 56.3 59.1 avg_w16_8bpc_neon: 382.9 154.1 160.7 w_avg_w4_8bpc_neon: 99.9 43.6 39.4 w_avg_w8_8bpc_neon: 253.2 98.3 99.0 w_avg_w16_8bpc_neon: 543.1 285.0 301.8 mask_w4_8bpc_neon: 110.6 51.4 45.1 mask_w8_8bpc_neon: 295.0 129.9 114.0 mask_w16_8bpc_neon: 654.6 365.8 369.7 After: avg_w4_8bpc_neon: 60.8 26.3 29.0 avg_w8_8bpc_neon: 142.8 52.9 64.1 avg_w16_8bpc_neon: 378.2 153.4 160.8 w_avg_w4_8bpc_neon: 78.7 41.0 40.9 w_avg_w8_8bpc_neon: 190.6 90.1 105.1 w_avg_w16_8bpc_neon: 531.1 279.3 301.4 mask_w4_8bpc_neon: 86.6 47.2 44.9 mask_w8_8bpc_neon: 222.0 114.3 114.9 mask_w16_8bpc_neon: 639.5 356.0 369.8
-
Jean-Baptiste Kempf authored
-
- Feb 08, 2020
-
-
Martin Storsjö authored
Checkasm benchmark numbers: Cortex A53 A72 A73 warp_8x8_16bpc_neon: 2029.9 1150.5 1225.2 warp_8x8t_16bpc_neon: 2007.6 1129.0 1192.3 Corresponding numbers for 8bpc for comparison: warp_8x8_8bpc_neon: 1863.8 1052.8 1106.2 warp_8x8t_8bpc_neon: 1847.4 1048.3 1099.8
-
- Feb 07, 2020
-
-
Martin Storsjö authored
As some functions are made for both 8bpc and 16bpc from a shared template, those functions are moved to a separate assembly file which is included. That assembly file (cdef_tmpl.S) isn't intended to be assembled on its own (just like utils.S), but if it is assembled, it should produce an empty object file. Checkasm benchmarks: Cortex A53 A72 A73 cdef_dir_16bpc_neon: 422.7 305.5 314.0 cdef_filter_4x4_16bpc_neon: 452.9 282.7 296.6 cdef_filter_4x8_16bpc_neon: 800.9 515.3 534.1 cdef_filter_8x8_16bpc_neon: 1417.1 922.7 942.6 Corresponding numbers for 8bpc for comparison: cdef_dir_8bpc_neon: 394.7 268.8 281.8 cdef_filter_4x4_8bpc_neon: 461.5 300.9 307.7 cdef_filter_4x8_8bpc_neon: 831.6 546.1 555.6 cdef_filter_8x8_8bpc_neon: 1454.6 934.0 960.0
-
Martin Storsjö authored
-
- Feb 06, 2020
-
-
Henrik Gramner authored
-
Henrik Gramner authored
The ar_coeff_shift element needs to have a 16-byte alignment on x86.
-
Martin Storsjö authored
Checkasm benchmarks: Cortex A53 A72 A73 wiener_chroma_16bpc_neon: 190288.4 129369.5 127284.1 wiener_luma_16bpc_neon: 195108.4 129387.8 127042.7 The corresponding numbers for 8 bpc for comparison: wiener_chroma_8bpc_neon: 150586.9 101647.1 97709.9 wiener_luma_8bpc_neon: 146297.4 101593.2 97670.5
-
Martin Storsjö authored
Use HIGHBD_DECL_SUFFIX and HIGHBD_TAIL_SUFFIX where necessary, add a missing sizeof(pixel).
-
Martin Storsjö authored
-
Martin Storsjö authored
-
Martin Storsjö authored
Examples of checkasm benchmarks: Cortex A53 A72 A73 mc_8tap_regular_w8_0_16bpc_neon: 96.8 49.6 62.5 mc_8tap_regular_w8_h_16bpc_neon: 570.3 388.0 467.2 mc_8tap_regular_w8_hv_16bpc_neon: 1035.8 776.7 891.3 mc_8tap_regular_w8_v_16bpc_neon: 400.6 285.0 278.3 mc_bilinear_w8_0_16bpc_neon: 90.0 44.8 57.8 mc_bilinear_w8_h_16bpc_neon: 191.2 158.7 156.4 mc_bilinear_w8_hv_16bpc_neon: 295.9 234.6 244.9 mc_bilinear_w8_v_16bpc_neon: 147.2 98.7 89.2 mct_8tap_regular_w8_0_16bpc_neon: 139.4 78.7 84.9 mct_8tap_regular_w8_h_16bpc_neon: 612.5 396.8 479.1 mct_8tap_regular_w8_hv_16bpc_neon: 1112.4 814.6 963.2 mct_8tap_regular_w8_v_16bpc_neon: 461.8 370.8 353.4 mct_bilinear_w8_0_16bpc_neon: 135.6 76.2 80.5 mct_bilinear_w8_h_16bpc_neon: 211.3 159.4 141.7 mct_bilinear_w8_hv_16bpc_neon: 325.7 237.2 227.2 mct_bilinear_w8_v_16bpc_neon: 180.7 135.9 129.5 For comparison, the corresponding numbers for 8 bpc: mc_8tap_regular_w8_0_8bpc_neon: 78.6 41.0 39.5 mc_8tap_regular_w8_h_8bpc_neon: 371.2 299.6 348.3 mc_8tap_regular_w8_hv_8bpc_neon: 817.1 675.0 726.5 mc_8tap_regular_w8_v_8bpc_neon: 243.7 260.4 253.0 mc_bilinear_w8_0_8bpc_neon: 74.8 35.4 36.1 mc_bilinear_w8_h_8bpc_neon: 179.9 69.9 79.2 mc_bilinear_w8_hv_8bpc_neon: 210.8 132.4 144.8 mc_bilinear_w8_v_8bpc_neon: 141.6 64.9 65.4 mct_8tap_regular_w8_0_8bpc_neon: 101.7 54.4 59.5 mct_8tap_regular_w8_h_8bpc_neon: 391.3 329.1 358.3 mct_8tap_regular_w8_hv_8bpc_neon: 880.4 754.9 829.4 mct_8tap_regular_w8_v_8bpc_neon: 270.8 300.8 277.4 mct_bilinear_w8_0_8bpc_neon: 97.6 54.0 55.4 mct_bilinear_w8_h_8bpc_neon: 173.3 73.5 79.5 mct_bilinear_w8_hv_8bpc_neon: 228.3 163.0 174.0 mct_bilinear_w8_v_8bpc_neon: 128.9 72.5 63.3
-
- Feb 05, 2020
- Feb 04, 2020
-
-
Martin Storsjö authored
avg_w4_16bpc_neon: 78.2 43.2 48.9 avg_w8_16bpc_neon: 199.1 108.7 123.1 avg_w16_16bpc_neon: 615.6 339.9 373.9 avg_w32_16bpc_neon: 2313.0 1390.6 1490.6 avg_w64_16bpc_neon: 5783.6 3119.5 3653.0 avg_w128_16bpc_neon: 15444.6 8168.7 8907.9 w_avg_w4_16bpc_neon: 120.1 87.8 92.4 w_avg_w8_16bpc_neon: 321.6 252.4 263.1 w_avg_w16_16bpc_neon: 1017.5 794.5 831.2 w_avg_w32_16bpc_neon: 3911.4 3154.7 3306.5 w_avg_w64_16bpc_neon: 9977.9 7794.9 8022.3 w_avg_w128_16bpc_neon: 25721.5 19274.6 20041.7 mask_w4_16bpc_neon: 139.5 96.5 104.3 mask_w8_16bpc_neon: 376.0 283.9 300.1 mask_w16_16bpc_neon: 1217.2 906.7 950.0 mask_w32_16bpc_neon: 4811.1 3669.0 3901.3 mask_w64_16bpc_neon: 12036.4 8918.4 9244.8 mask_w128_16bpc_neon: 30888.8 21999.0 23206.7 For comparison, these are the corresponding numbers for 8bpc: avg_w4_8bpc_neon: 56.7 26.2 28.5 avg_w8_8bpc_neon: 137.2 52.8 64.3 avg_w16_8bpc_neon: 377.9 151.5 161.6 avg_w32_8bpc_neon: 1528.9 614.5 633.9 avg_w64_8bpc_neon: 3792.5 1814.3 1518.3 avg_w128_8bpc_neon: 10685.3 5220.4 3879.9 w_avg_w4_8bpc_neon: 75.2 53.0 41.1 w_avg_w8_8bpc_neon: 186.7 120.1 105.2 w_avg_w16_8bpc_neon: 531.6 314.1 302.1 w_avg_w32_8bpc_neon: 2138.4 1120.4 1171.5 w_avg_w64_8bpc_neon: 5151.9 2910.5 2857.1 w_avg_w128_8bpc_neon: 13945.0 7330.5 7389.1 mask_w4_8bpc_neon: 82.0 47.2 45.1 mask_w8_8bpc_neon: 213.5 115.4 115.8 mask_w16_8bpc_neon: 639.8 356.2 370.1 mask_w32_8bpc_neon: 2566.9 1489.8 1435.0 mask_w64_8bpc_neon: 6727.6 3822.8 3425.2 mask_w128_8bpc_neon: 17893.0 9622.6 9161.3
-
Martin Storsjö authored
-
Martin Storsjö authored
-
- Feb 03, 2020
-
-
On many AMD CPU:s cmov instructions that depends on multiple flags require an additional µop, so prefer using cmov variants that only depends on a single flag where possible.
-
Shave off a few instructions, or save a few bytes, in various places. Also change some instructions to use appropriately sized registers.
-
Using signed comparisons could theoretically cause the wrong result in some obscure corner cases on x86-32 with PAE. x86-64 should be fine with either, but unsigned is technically more correct.
-
- Feb 01, 2020
-
-
Henrik Gramner authored
Avoids some pointer chasing and simplifies the DSP code, at the cost of making the initialization a little bit more complicated. Also reduces memory usage by a small amount due to properly sizing the buffers instead of always allocating enough space for 4:4:4.
-
Henrik Gramner authored
-
Henrik Gramner authored
We specify most strides in bytes, but since C defines offsets in multiples of sizeof(type) we use the PXSTRIDE() macro to downshift the strides by one in high-bit depth templated files. This however means that the compiler is required to mask away the least significant bit, because it could in theory be non-zero. Avoid that by telling the compiler (when compiled in release mode) that the lsb is in fact guaranteed to always be zero.
-
- Jan 29, 2020
-
-
Henrik Gramner authored
Required for AVX-512.
-
Martin Storsjö authored
Before: ARM32: Cortex A7 A8 A9 A53 A72 A73 cdef_filter_4x4_8bpc_neon: 964.6 599.5 707.9 601.2 465.1 405.2 cdef_filter_4x8_8bpc_neon: 1726.0 1066.2 1238.7 1041.7 798.6 725.3 cdef_filter_8x8_8bpc_neon: 2974.4 1671.8 1943.9 1806.1 1229.8 1242.1 ARM64: cdef_filter_4x4_8bpc_neon: 569.2 337.8 348.7 cdef_filter_4x8_8bpc_neon: 1031.1 623.3 633.6 cdef_filter_8x8_8bpc_neon: 1847.5 1097.7 1117.5 After: ARM32: Cortex A7 A8 A9 A53 A72 A73 cdef_filter_4x4_8bpc_neon: 798.4 524.2 617.3 506.8 432.4 361.1 cdef_filter_4x8_8bpc_neon: 1394.7 910.4 1054.0 863.6 730.2 632.2 cdef_filter_8x8_8bpc_neon: 2364.6 1453.8 1675.1 1466.0 1086.4 1107.7 ARM64: cdef_filter_4x4_8bpc_neon: 461.7 303.1 308.6 cdef_filter_4x8_8bpc_neon: 833.0 547.5 556.0 cdef_filter_8x8_8bpc_neon: 1459.3 934.1 967.9
-
Martin Storsjö authored
-
- Jan 28, 2020
-
-
- Jan 27, 2020
-
-
The main feature is splitting the main filter code into three different code paths depending on the strength values. Clipping is only required when both the primary and secondary strengths are non-zero, which is an uncommon case. Being able to skip that complexity in the common cases is significantly faster.
-
-
- Jan 25, 2020
-
-
If the primary strengths for both luma and chroma are zero the direction is always zero. If both the primary and secondary luma strengths are zero the entire luma filtering process is a no-op.
-
- Jan 21, 2020
-
-
Henrik Gramner authored
Required for the AVX-512 instructions added in Ice Lake.
-
Konstantin Pavlov authored
The image now includes nasm 2.14.02 which is needed to assemble AVX512 code.
-