- Mar 05, 2020
-
-
Jean-Baptiste Kempf authored
-
Checkasm numbers: Cortex A53 A72 A73 w_mask_420_w4_16bpc_neon: 173.6 123.5 120.3 w_mask_420_w8_16bpc_neon: 484.2 344.1 329.5 w_mask_420_w16_16bpc_neon: 1411.2 1027.4 1035.1 w_mask_420_w32_16bpc_neon: 5561.5 4093.2 3980.1 w_mask_420_w64_16bpc_neon: 13809.6 9856.5 9581.0 w_mask_420_w128_16bpc_neon: 35614.7 25553.8 24284.4 w_mask_422_w4_16bpc_neon: 159.4 112.2 114.2 w_mask_422_w8_16bpc_neon: 453.4 326.1 326.7 w_mask_422_w16_16bpc_neon: 1394.6 1062.3 1050.2 w_mask_422_w32_16bpc_neon: 5485.8 4219.6 4027.3 w_mask_422_w64_16bpc_neon: 13701.2 10079.6 9692.6 w_mask_422_w128_16bpc_neon: 35455.3 25892.5 24625.9 w_mask_444_w4_16bpc_neon: 153.0 112.3 112.7 w_mask_444_w8_16bpc_neon: 437.2 331.8 325.8 w_mask_444_w16_16bpc_neon: 1395.1 1069.1 1041.7 w_mask_444_w32_16bpc_neon: 5370.1 4213.5 4138.1 w_mask_444_w64_16bpc_neon: 13482.6 10190.5 10004.6 w_mask_444_w128_16bpc_neon: 35583.7 26911.2 25638.8 Corresponding numbers for 8 bpc for comparison: w_mask_420_w4_8bpc_neon: 126.6 79.1 87.7 w_mask_420_w8_8bpc_neon: 343.9 195.0 211.5 w_mask_420_w16_8bpc_neon: 886.3 540.3 577.7 w_mask_420_w32_8bpc_neon: 3558.6 2152.4 2216.7 w_mask_420_w64_8bpc_neon: 8894.9 5161.2 5297.0 w_mask_420_w128_8bpc_neon: 22520.1 13514.5 13887.2 w_mask_422_w4_8bpc_neon: 112.9 68.2 77.0 w_mask_422_w8_8bpc_neon: 314.4 175.5 208.7 w_mask_422_w16_8bpc_neon: 835.5 565.0 608.3 w_mask_422_w32_8bpc_neon: 3381.3 2231.8 2287.6 w_mask_422_w64_8bpc_neon: 8499.4 5343.6 5460.8 w_mask_422_w128_8bpc_neon: 21823.3 14206.5 14249.1 w_mask_444_w4_8bpc_neon: 104.6 65.8 72.7 w_mask_444_w8_8bpc_neon: 290.4 173.7 196.6 w_mask_444_w16_8bpc_neon: 831.4 586.7 591.7 w_mask_444_w32_8bpc_neon: 3320.8 2300.6 2251.0 w_mask_444_w64_8bpc_neon: 8300.0 5480.5 5346.8 w_mask_444_w128_8bpc_neon: 21633.8 15981.3 14384.8
-
Janne Grunau authored
Switches build-debian (for avx2 checkasm coverage) and test-win64 and test-debian-unaligned-stack (for testing asm '%if's). Refs #330, #333
-
- Mar 04, 2020
-
-
-
Martin Storsjö authored
Checkasm numbers: Cortex A53 A72 A73 blend_h_w2_16bpc_neon: 109.3 83.1 56.7 blend_h_w4_16bpc_neon: 114.1 61.4 62.3 blend_h_w8_16bpc_neon: 133.3 80.8 81.1 blend_h_w16_16bpc_neon: 215.6 132.7 149.5 blend_h_w32_16bpc_neon: 390.4 254.2 235.8 blend_h_w64_16bpc_neon: 719.1 456.3 453.8 blend_h_w128_16bpc_neon: 1646.1 1112.3 1065.9 blend_v_w2_16bpc_neon: 185.9 175.9 180.0 blend_v_w4_16bpc_neon: 338.0 183.4 232.1 blend_v_w8_16bpc_neon: 426.5 213.8 250.6 blend_v_w16_16bpc_neon: 678.2 357.8 382.6 blend_v_w32_16bpc_neon: 1098.3 686.2 695.6 blend_w4_16bpc_neon: 75.7 31.5 32.0 blend_w8_16bpc_neon: 134.0 75.0 75.8 blend_w16_16bpc_neon: 467.9 267.3 310.0 blend_w32_16bpc_neon: 1201.9 658.7 779.7 Corresponding numbers for 8bpc for comparison: blend_h_w2_8bpc_neon: 104.1 55.9 60.8 blend_h_w4_8bpc_neon: 108.9 58.7 48.2 blend_h_w8_8bpc_neon: 99.3 64.4 67.4 blend_h_w16_8bpc_neon: 145.2 93.4 85.1 blend_h_w32_8bpc_neon: 262.2 157.5 148.6 blend_h_w64_8bpc_neon: 466.7 278.9 256.6 blend_h_w128_8bpc_neon: 1054.2 624.7 571.0 blend_v_w2_8bpc_neon: 170.5 106.6 113.4 blend_v_w4_8bpc_neon: 333.0 189.9 225.9 blend_v_w8_8bpc_neon: 314.9 199.0 203.5 blend_v_w16_8bpc_neon: 476.9 300.8 241.1 blend_v_w32_8bpc_neon: 766.9 430.4 415.1 blend_w4_8bpc_neon: 66.7 35.4 26.0 blend_w8_8bpc_neon: 110.7 47.9 48.1 blend_w16_8bpc_neon: 299.4 161.8 162.3 blend_w32_8bpc_neon: 725.8 417.0 432.8
-
Martin Storsjö authored
Use a post-increment with a register on the last increment, avoiding a separate increment. Avoid processing the last 8 pixels in the w32 case when we only output 24 pixels. Before: ARM32 Cortex A7 A8 A9 A53 A72 A73 blend_v_w4_8bpc_neon: 450.4 574.7 538.7 374.6 199.3 260.5 blend_v_w8_8bpc_neon: 559.6 351.3 552.5 357.6 214.8 204.3 blend_v_w16_8bpc_neon: 926.3 511.6 787.9 593.0 271.0 246.8 blend_v_w32_8bpc_neon: 1482.5 917.0 1149.5 991.9 354.0 368.9 ARM64 blend_v_w4_8bpc_neon: 351.1 200.0 224.1 blend_v_w8_8bpc_neon: 333.0 212.4 203.8 blend_v_w16_8bpc_neon: 495.2 302.0 247.0 blend_v_w32_8bpc_neon: 840.0 557.8 514.0 After: ARM32 blend_v_w4_8bpc_neon: 435.5 575.8 537.6 356.2 198.3 259.5 blend_v_w8_8bpc_neon: 545.2 347.9 553.5 339.1 207.8 204.2 blend_v_w16_8bpc_neon: 913.7 511.0 788.1 573.7 275.4 243.3 blend_v_w32_8bpc_neon: 1445.3 951.2 1079.1 920.4 352.2 361.6 ARM64 blend_v_w4_8bpc_neon: 333.0 191.3 225.9 blend_v_w8_8bpc_neon: 314.9 199.3 203.5 blend_v_w16_8bpc_neon: 476.9 301.3 241.1 blend_v_w32_8bpc_neon: 766.9 432.8 416.9
-
Martin Storsjö authored
-
Martin Storsjö authored
-
Martin Storsjö authored
For loads where we load/store a full or half register (instead of a lanewise load/store), the lane specification in itself doesn't matter, only its size. This doesn't change the generated code, but makes it more readable.
-
- Mar 03, 2020
-
-
Jean-Baptiste Kempf authored
-
Janne Grunau authored
-
- Mar 02, 2020
-
-
Checkasm runtimes: Cortex A53 A72 A73 lpf_h_sb_uv_w4_16bpc_neon: 919.0 795.0 714.9 lpf_h_sb_uv_w6_16bpc_neon: 1267.7 1116.2 1081.9 lpf_h_sb_y_w4_16bpc_neon: 1500.2 1543.9 1778.5 lpf_h_sb_y_w8_16bpc_neon: 2216.1 2183.0 2568.1 lpf_h_sb_y_w16_16bpc_neon: 2641.8 2630.4 2639.4 lpf_v_sb_uv_w4_16bpc_neon: 836.5 572.7 667.3 lpf_v_sb_uv_w6_16bpc_neon: 1130.8 709.1 955.5 lpf_v_sb_y_w4_16bpc_neon: 1271.6 1434.4 1272.1 lpf_v_sb_y_w8_16bpc_neon: 1818.0 1759.1 1664.6 lpf_v_sb_y_w16_16bpc_neon: 1998.6 2115.8 1586.6 Corresponding numbers for 8 bpc for comparison: lpf_h_sb_uv_w4_8bpc_neon: 799.4 632.8 695.4 lpf_h_sb_uv_w6_8bpc_neon: 1067.3 613.6 767.5 lpf_h_sb_y_w4_8bpc_neon: 1490.5 1179.1 1018.9 lpf_h_sb_y_w8_8bpc_neon: 1892.9 1382.0 1172.0 lpf_h_sb_y_w16_8bpc_neon: 2117.4 1625.4 1739.0 lpf_v_sb_uv_w4_8bpc_neon: 447.1 447.7 446.0 lpf_v_sb_uv_w6_8bpc_neon: 522.1 529.0 513.1 lpf_v_sb_y_w4_8bpc_neon: 1043.7 785.0 775.9 lpf_v_sb_y_w8_8bpc_neon: 1500.4 1115.9 881.2 lpf_v_sb_y_w16_8bpc_neon: 1493.5 1371.4 1248.5
-
-
-
- Feb 25, 2020
-
-
Requires meson 0.51 for oss-fuzz and 0.49 for the fuzzing binaries in general due to the use of the 'kwargs' keyword argument.
-
-
- Feb 24, 2020
-
-
Henrik Gramner authored
-
Henrik Gramner authored
-
Henrik Gramner authored
-
Henrik Gramner authored
-
Victorien Le Couviour--Tuffet authored
Add 2 seperate code paths for pri/sec strengths equal 0. Having both strengths not equal to 0 is uncommon, branching to skip unnecessary computations is therefore beneficial. ------------------------------------------ before: cdef_filter_4x4_8bpc_avx2: 93.8 after: cdef_filter_4x4_8bpc_avx2: 71.7 --------------------- before: cdef_filter_4x8_8bpc_avx2: 161.5 after: cdef_filter_4x8_8bpc_avx2: 116.3 --------------------- before: cdef_filter_8x8_8bpc_avx2: 221.8 after: cdef_filter_8x8_8bpc_avx2: 156.4 ------------------------------------------
-
Victorien Le Couviour--Tuffet authored
--------------------- fully edged blocks perf ------------------------------------------ before: cdef_filter_4x4_8bpc_avx2: 91.0 after: cdef_filter_4x4_8bpc_avx2: 75.7 --------------------- before: cdef_filter_4x8_8bpc_avx2: 154.6 after: cdef_filter_4x8_8bpc_avx2: 131.8 --------------------- before: cdef_filter_8x8_8bpc_avx2: 214.1 after: cdef_filter_8x8_8bpc_avx2: 195.9 ------------------------------------------
-
Change the input buffer randomization algorithm to more readily trigger issues with both under- and overflows in cdef_filter.
-
- Feb 21, 2020
-
-
Luc Trudeau authored
-
Luc Trudeau authored
Muxer and demuxers arrays are now statically initialized
-
- Feb 20, 2020
-
-
Luc Trudeau authored
-
- Feb 18, 2020
-
-
Janne Grunau authored
-
- Feb 17, 2020
-
-
Martin Storsjö authored
This increases the code size by around 3 KB on arm64. Before: ARM32: Cortex A7 A8 A9 A53 A72 A73 cdef_filter_4x4_8bpc_neon: 807.1 517.0 617.7 506.6 429.9 357.8 cdef_filter_4x8_8bpc_neon: 1407.9 899.3 1054.6 862.3 726.5 628.1 cdef_filter_8x8_8bpc_neon: 2394.9 1456.8 1676.8 1461.2 1084.4 1101.2 ARM64: cdef_filter_4x4_8bpc_neon: 460.7 301.8 308.0 cdef_filter_4x8_8bpc_neon: 831.6 547.0 555.2 cdef_filter_8x8_8bpc_neon: 1454.6 935.6 960.4 After: ARM32: cdef_filter_4x4_8bpc_neon: 669.3 541.3 524.4 424.9 322.7 298.1 cdef_filter_4x8_8bpc_neon: 1159.1 922.9 881.1 709.2 538.3 514.1 cdef_filter_8x8_8bpc_neon: 1888.8 1285.4 1358.5 1152.9 839.3 871.2 ARM64: cdef_filter_4x4_8bpc_neon: 383.6 262.1 259.9 cdef_filter_4x8_8bpc_neon: 684.9 472.2 464.7 cdef_filter_8x8_8bpc_neon: 1160.0 756.8 788.0 (The checkasm benchmark averages three different cases; the fully edged case is one of those three, while it's the most common case in actual video. The difference is much bigger if only benchmarking that particular case.) This actually apparently makes the code a little bit slower for the w=4 cases on Cortex A8, while it's a significant speedup on all other cores.
-
Martin Storsjö authored
The signedness of elements doesn't matter for vsub; match the vsub.i16 next to it.
-
- Feb 16, 2020
-
-
Henrik Gramner authored
Console output is incredibly slow on Windows, which is aggravated by the lack of line buffering. As a result, a significant percentage of overall runtime is actually spent displaying the decoding progress. Doing the line buffering manually alleviates most of the issue.
-
- Feb 13, 2020
-
-
Martin Storsjö authored
These were missed in 361a3c8e.
-
- Feb 11, 2020
-
-
Martin Storsjö authored
This only supports 10 bpc, not 12 bpc, as the sum and tmp buffers can be int16_t for 10 bpc, but need to be int32_t for 12 bpc. Make actual templates out of the functions in looprestoration_tmpl.S, and add box3/5_h to looprestoration16.S. Extend dav1d_sgr_calc_abX_neon with a mandatory bitdepth_max parameter (which is passed even in 8bpc mode), add a define to bitdepth.h for passing such a parameter in all modes. This makes this function a few instructions slower in 8bpc mode than it was before (overall impact seems to be around 1% of the total runtime of SGR), but allows using the same actual function instantiation for all modes, saving a bit of code size. Examples of checkasm runtimes: Cortex A53 A72 A73 selfguided_3x3_10bpc_neon: 516755.8 389412.7 349058.7 selfguided_5x5_10bpc_neon: 380699.9 293486.6 254591.6 selfguided_mix_10bpc_neon: 878142.3 667495.9 587844.6 Corresponding 8 bpc numbers for comparison: selfguided_3x3_8bpc_neon: 491058.1 361473.4 347705.9 selfguided_5x5_8bpc_neon: 352655.0 266423.7 248192.2 selfguided_mix_8bpc_neon: 826094.1 612372.2 581943.1
-
Martin Storsjö authored
looprestoration_common.S contains functions that can be used as is with one single instantiation of the functions for both 8 and 16 bpc. This file will be built once, regardless of which bitdepths are enabled. looprestoration_tmpl.S contains functions where the source can be shared and templated between 8 and 16 bpc. This will be included by the separate 8/16bpc implementaton files.
-
Martin Storsjö authored
Don't add it to dav1d_sgr_calc_ab1/2_neon and box3/5_v, as the same concrete function implementations can be shared for both 8 and 16 bpc for those functions.
-
Martin Storsjö authored
This allows using completely different codepaths for 10 and 12 bpc, or just adding SIMD functions for either of them.
-
Martin Storsjö authored
Set flags further from the branch instructions that use them.
-
Martin Storsjö authored
For 8bpc and 10bpc, int16_t is enough here, and for 12bpc, other intermediate int16_t buffers also need to be made of size coef anyway.
-
Martin Storsjö authored
-
- Feb 10, 2020
-
-
Jean-Baptiste Kempf authored
-
Only copy as much as really is needed/used.
-