- Aug 24, 2024
-
-
Martin Storsjö authored
Apparently, this case isn't actually ever executed, at least in most checkasm runs, but some tools could complain about the relocation against 160b, which pointed elsewhere than intended.
-
- Aug 23, 2024
-
-
Martin Storsjö authored
This does the same optimizations as 3329f8d1 and 1790e132 on the rest of the code.
-
Martin Storsjö authored
This makes the code behave as intended, when filling a rectangle with arbitrary width (filling with the largest power of two width until filled); previously, it accidentally fell back on writing 4 pixel wide stripes immediately. No measurable effect on checkasm benchmarks though.
-
- Aug 22, 2024
-
-
MS armasm64 cannot compile some SVE instructions with immediate operands, e.g.: sub z0.h, z0.h, #8192 The proper form is: sub z0.h, z0.h, #32, lsl #8 This patch contains the needed fixes.
-
Martin Storsjö authored
Don't include the BTI landing pad instruction in the loops. If built with BTI enabled, AARCH64_VALID_JUMP_TARGET expands to a no-op instruction that indicates that indirect jumps can land there. But there's no need for the loops to include that instruction.
-
Add an Armv9.0-A SVE2 code path for high bitdepth convolutions. Only 2D convolutions have 6-tap specialisations of their vertical passes. All other convolutions are 4- or 8-tap filters which fit well with the 4-element 16-bit SDOT instruction of SVE2. This patch renames HBD prep/put_neon to prep/put_16bpc_neon and exports put_16bpc_neon. Benchmarks show up-to 17% FPS increase depending on the input video and the CPU used. This patch will increase the .text by around 8 KiB. Relative performance to the C reference on some Cortex-A/X CPUs: regular A715 A720 X3 X4 A510 A520 w4 hv neon: 3.93x 4.10x 5.21x 5.17x 3.57x 5.27x w4 hv sve2: 4.99x 5.14x 6.00x 6.05x 4.33x 3.99x w8 hv neon: 1.72x 1.67x 1.98x 2.18x 2.95x 2.94x w8 hv sve2: 2.12x 2.29x 2.52x 2.62x 2.60x 2.60x w16 hv neon: 1.59x 1.53x 1.83x 1.89x 2.35x 2.24x w16 hv sve2: 1.94x 2.12x 2.33x 2.18x 2.06x 2.06x w32 hv neon: 1.49x 1.50x 1.66x 1.76x 2.10x 2.16x w32 hv sve2: 1.81x 2.09x 2.11x 2.09x 1.84x 1.87x w64 hv neon: 1.52x 1.50x 1.55x 1.71x 1.95x 2.05x w64 hv sve2: 1.84x 2.08x 1.97x 1.98x 1.74x 1.77x w4 h neon: 5.35x 5.47x 7.39x 5.78x 3.92x 5.19x w4 h sve2: 7.91x 8.35x 11.95x 10.33x 5.81x 5.42x w8 h neon: 4.49x 4.43x 6.50x 4.87x 7.18x 6.17x w8 h sve2: 6.09x 6.22x 9.59x 7.70x 7.89x 6.83x w16 h neon: 2.53x 2.52x 2.34x 1.86x 2.71x 2.75x w16 h sve2: 3.41x 3.47x 3.53x 3.25x 2.89x 2.96x w32 h neon: 2.07x 2.08x 1.97x 1.56x 2.17x 2.21x w32 h sve2: 2.76x 2.84x 2.94x 2.75x 2.24x 2.29x w64 h neon: 1.86x 1.86x 1.76x 1.41x 1.87x 1.88x w64 h sve2: 2.47x 2.54x 2.65x 2.46x 1.94x 1.94x w4 v neon: 5.22x 5.17x 6.36x 5.60x 4.23x 7.30x w4 v sve2: 5.86x 5.90x 7.81x 7.16x 4.86x 4.15x w8 v neon: 4.83x 4.79x 6.96x 6.45x 4.74x 8.40x w8 v sve2: 5.25x 5.23x 7.76x 6.79x 4.84x 4.13x w16 v neon: 2.59x 2.60x 2.93x 2.47x 1.80x 4.16x w16 v sve2: 2.85x 2.88x 3.36x 2.73x 1.86x 2.00x w32 v neon: 2.12x 2.13x 2.33x 2.03x 1.34x 3.11x w32 v sve2: 2.36x 2.40x 2.73x 2.32x 1.41x 1.48x w64 v neon: 1.94x 1.92x 2.02x 1.78x 1.12x 2.59x w64 v sve2: 2.16x 2.15x 2.37x 2.03x 1.17x 1.22x w4 0 neon: 1.75x 1.71x 1.44x 1.56x 3.18x 2.87x w4 0 sve2: 4.28x 4.39x 5.72x 6.42x 5.50x 4.68x w8 0 neon: 3.05x 3.04x 4.44x 4.64x 3.84x 3.52x w8 0 sve2: 3.85x 3.80x 5.45x 6.01x 4.92x 4.26x w16 0 neon: 2.92x 2.93x 3.82x 3.23x 4.58x 4.44x w16 0 sve2: 4.29x 4.27x 4.25x 4.15x 5.58x 5.29x w32 0 neon: 2.73x 2.76x 3.50x 2.67x 4.44x 4.26x w32 0 sve2: 4.09x 4.10x 3.75x 3.39x 5.67x 5.22x w64 0 neon: 2.73x 2.70x 3.27x 3.14x 4.57x 4.68x w64 0 sve2: 4.06x 3.97x 3.54x 3.18x 6.36x 6.25x sharp A715 A720 X3 X4 A510 A520 w4 hv neon: 3.54x 3.64x 4.43x 4.45x 3.03x 4.72x w4 hv sve2: 4.30x 4.55x 5.38x 5.26x 4.04x 3.76x w8 hv neon: 1.30x 1.25x 1.51x 1.60x 2.44x 2.43x w8 hv sve2: 1.86x 2.06x 2.09x 2.18x 2.37x 2.39x w16 hv neon: 1.19x 1.16x 1.43x 1.36x 1.95x 1.98x w16 hv sve2: 1.68x 1.91x 1.94x 1.84x 1.89x 1.94x w32 hv neon: 1.13x 1.12x 1.30x 1.29x 1.75x 1.81x w32 hv sve2: 1.58x 1.84x 1.75x 1.74x 1.70x 1.76x w64 hv neon: 1.13x 1.13x 1.21x 1.25x 1.65x 1.69x w64 hv sve2: 1.57x 1.84x 1.62x 1.67x 1.62x 1.65x w4 h neon: 5.38x 5.49x 7.46x 5.74x 3.93x 5.23x w4 h sve2: 7.86x 8.37x 11.99x 10.38x 5.81x 5.40x w8 h neon: 3.46x 3.49x 5.36x 4.64x 6.40x 5.62x w8 h sve2: 5.95x 6.23x 9.61x 7.76x 7.86x 6.89x w16 h neon: 1.99x 1.97x 2.07x 1.91x 2.43x 2.51x w16 h sve2: 3.42x 3.46x 3.75x 3.23x 2.89x 2.98x w32 h neon: 1.67x 1.62x 1.66x 1.63x 1.95x 2.01x w32 h sve2: 2.86x 2.84x 2.94x 2.72x 2.21x 2.29x w64 h neon: 1.45x 1.45x 1.51x 1.48x 1.69x 1.70x w64 h sve2: 2.47x 2.54x 2.64x 2.46x 1.93x 1.95x w4 v neon: 4.07x 4.01x 5.15x 4.74x 3.38x 6.56x w4 v sve2: 5.88x 5.86x 7.81x 7.15x 4.85x 4.39x w8 v neon: 3.64x 3.59x 5.38x 4.92x 3.59x 7.23x w8 v sve2: 5.23x 5.19x 7.77x 6.66x 4.81x 4.13x w16 v neon: 1.93x 1.95x 2.25x 1.92x 1.35x 3.46x w16 v sve2: 2.85x 2.88x 3.36x 2.71x 1.86x 1.94x w32 v neon: 1.57x 1.58x 1.78x 1.60x 1.01x 2.67x w32 v sve2: 2.36x 2.39x 2.73x 2.35x 1.41x 1.50x w64 v neon: 1.44x 1.42x 1.54x 1.43x 0.85x 2.19x w64 v sve2: 2.17x 2.15x 2.37x 2.06x 1.18x 1.25x
-
- Aug 21, 2024
-
-
Arpad Panyik authored
Add 6-tap variant of standard bit-depth horizontal subpel filters using the Armv8.6 I8MM USMMLA matrix multiply instruction. This patch also extends the HV filter with 6-tap horizontal pass using USMMLA. Benchmarks show up-to 6-7% FPS increase depending on the input video and the CPU used. This patch will increase the .text by around 1.2 KiB. Relative runtime of micro benchmarks after this patch on Neoverse and Cortex CPU cores: regular V2 V1 X3 A720 A715 A520 A510 w8 hv: 0.860x 0.895x 0.870x 0.896x 0.896x 0.938x 0.936x w16 hv: 0.829x 0.886x 0.865x 0.908x 0.906x 0.946x 0.944x w32 hv: 0.837x 0.883x 0.862x 0.914x 0.915x 0.953x 0.949x w64 hv: 0.840x 0.883x 0.862x 0.914x 0.914x 0.955x 0.952x w8 h: 0.746x 0.754x 0.747x 0.723x 0.724x 0.874x 0.866x w16 h: 0.749x 0.764x 0.745x 0.731x 0.731x 0.858x 0.852x w32 h: 0.739x 0.754x 0.738x 0.729x 0.729x 0.839x 0.837x w64 h: 0.736x 0.749x 0.733x 0.725x 0.726x 0.847x 0.836x
-
- Aug 12, 2024
-
-
Arpad Panyik authored
The macro parameter \xmy of filter_8tap_fn was used incorrectly as a pointer instead of \lsrc. They refer to the same register but in different context.
-
- Aug 04, 2024
-
-
Kyle Siefring authored
Performance Impact on Sapphire Rapids: Chimera: 0.46% Faster
-
- Jun 26, 2024
-
-
Arpad Panyik authored
The constants used for the subpel filters were placed in the .text section for simplicity and peak performance, but this does not work on systems with execute only .text sections (e.g.: OpenBSD). The performance cost of moving the constants to the .rodata section is small and mostly within the measurable noise.
-
- Jun 25, 2024
-
-
Martin Storsjö authored
The ldr instruction only can handle offsets that are a multiple of the element size; most assemblers implicitly produce the ldur instruction when a non-aligned offset is provided. Older versions of MS armasm64, however, error out on this. Since MSVC 2022 17.8, armasm64 implicitly can produce ldur, but 2022 17.7 and earlier require explicitly writing the instruction as ldur. Despite this, even older versions still fail to build the mc_dotprod.S sources, with errors like this: src\libdav1d.a.p\mc_dotprod.obj.asm(556) : error A2513: operand 2: Constant value out of range mov x10, (((0*15-1)<<7)|(3*15-1)) This happens on MSVC 2022 17.1 and older, while 17.2 and newer accept the negative value expression here. In practice, HAVE_DOTPROD doesn't get enabled by the Meson configure script at the moment, as it uses inline assembly to test for external assembler features.
-
Add run-time CPU feature detection for DotProd and i8mm on AArch64.
-
Henrik Gramner authored
-
- Jun 17, 2024
-
-
Ronald S. Bultje authored
-
- Jun 10, 2024
-
-
Nathan E. Egge authored
-
- Jun 05, 2024
-
-
Arpad Panyik authored
The DotProd/I8MM horizontal and HV/2D subpel filters use -4 offset for sampling instead of -3 to be better aligned in some cases. This resulted in an out of bounds access, which led to crashes. This patch fixes it.
-
- May 27, 2024
-
-
Henrik Gramner authored
-
Henrik Gramner authored
The conditions for when to (re)allocate those buffers are identical, so they can be merged into a single branch. The allocation of the buffers themselves can also be combined to reduce the number of allocation calls.
-
Henrik Gramner authored
It's only ever called on data which has already been zero-initialized.
-
Henrik Gramner authored
n_tc is always >= n_fc, so we only need to check the latter.
-
Henrik Gramner authored
-
Henrik Gramner authored
-
Henrik Gramner authored
The amount of nested macros caused by having to support SSE2 makes the code very difficult to maintain and modify. It is also of questionable value considering most other asm requires SSSE3.
-
Henrik Gramner authored
-
- May 25, 2024
-
-
Jean-Baptiste Kempf authored
-
- May 20, 2024
-
-
Use a slightly shorter series of instructions to compute cdf update rate.
-
Henrik Gramner authored
Error out early instead of producing bogus mismatch errors in case of an incorrect cpu mask for example.
-
- May 19, 2024
-
-
Martin Storsjö authored
The ldr instruction can take an immediate offset which is a multiple of the loaded element size. If the ldr instruction is given an immediate offset which isn't a multiple of the element size, most assemblers implicitly generate a "ldur" instruction instead. Older versions of MS armasm64.exe don't do this, but instead error out with "error A2518: operand 2: Memory offset must be aligned". (Current versions don't do this but correctly generate "ldur" implicitly.) Switch this instruction to an explicit "ldur", like we do elsewhere, to fix building with these older tools.
-
- May 18, 2024
-
-
NDK 26 dropped support for API versions 19 and 20 (KitKat, Android 4.4). The minimum supported API is now 21 (Lollipop, Android 5.0).
-
- May 14, 2024
-
-
Kyle Siefring authored
Changes stem from redesigning the reduction stage of the multisymbol decode function. * No longer use adapt4 for 5 possible symbol values * Specialize reduction for 4/8/16 decode functions * Modify control flow +------------------------+--------------+--------------+---------------+ | | Neoverse V1 | Neoverse N1 | Cortex A72 | | | (Graviton 3) | (Graviton 2) | (Graviton 1) | +------------------------+-------+------+-------+------+-------+-------+ | | Old | New | Old | New | Old | New | +------------------------+-------+------+-------+------+-------+-------+ | decode_bool_neon | 13.0 | 12.9 | 14.9 | 14.0 | 39.3 | 29.0 | +------------------------+-------+------+-------+------+-------+-------+ | decode_bool_adapt_neon | 15.4 | 15.6 | 17.5 | 16.8 | 41.6 | 33.5 | +------------------------+-------+------+-------+------+-------+-------+ | decode_bool_equi_neon | 11.3 | 12.0 | 14.0 | 12.2 | 35.0 | 26.3 | +------------------------+-------+------+-------+------+-------+-------+ | decode_hi_tok_c | 73.7 | 57.8 | 73.4 | 60.5 | 130.1 | 103.9 | +------------------------+-------+------+-------+------+-------+-------+ | decode_hi_tok_neon | 63.3 | 48.2 | 65.2 | 51.2 | 119.0 | 105.3 | +------------------------+-------+------+-------+------+-------+-------+ | decode_symbol_\ | 28.6 | 22.5 | 28.4 | 23.5 | 67.8 | 55.1 | | adapt4_neon | | | | | | | +------------------------+-------+------+-------+------+-------+-------+ | decode_symbol_\ | 29.5 | 26.6 | 29.0 | 28.8 | 76.6 | 74.0 | | adapt8_neon | | | | | | | +------------------------+-------+------+-------+------+-------+-------+ | decode_symbol_\ | 31.6 | 31.2 | 33.3 | 33.0 | 77.5 | 68.1 | | adapt16_neon | | | | | | | +------------------------+-------+------+-------+------+-------+-------+
-
Optimize the widening copy part of subpel filters (the prep_neon function). In this patch we combine widening shifts with widening multiplications in the inner loops to get maximum throughput. The change will increase .text by 36 bytes. Relative performance of micro benchmarks (lower is better): Cortex-A55: mct_w4: 0.795x mct_w8: 0.913x mct_w16: 0.912x mct_w32: 0.838x mct_w64: 1.025x mct_w128: 1.002x Cortex-A510: mct_w4: 0.760x mct_w8: 0.636x mct_w16: 0.640x mct_w32: 0.854x mct_w64: 0.864x mct_w128: 0.995x Cortex-A72: mct_w4: 0.616x mct_w8: 0.854x mct_w16: 0.756x mct_w32: 1.052x mct_w64: 1.044x mct_w128: 0.702x Cortex-A76: mct_w4: 0.837x mct_w8: 0.797x mct_w16: 0.841x mct_w32: 0.804x mct_w64: 0.948x mct_w128: 0.904x Cortex-A78: mct_w16: 0.542x mct_w32: 0.725x mct_w64: 0.741x mct_w128: 0.745x Cortex-A715: mct_w16: 0.561x mct_w32: 0.720x mct_w64: 0.740x mct_w128: 0.748x Cortex-X1: mct_w32: 0.886x mct_w64: 0.882x mct_w128: 0.917x Cortex-X3: mct_w32: 0.835x mct_w64: 0.803x mct_w128: 0.808x
-
Save a complex arithmetic instruction in the jump table address calculation of prep_neon function.
-
Move the BTI landing pads out of the inner loops of prep_neon function. Only the width=4 and width=8 cases are affected. If BTI is enabled, moving the AARCH64_VALID_JUMP_TARGET out of the inner loops we get better execution speed on Cortex-A510 relative to the original (lower is better): w4: 0.969x w8: 0.722x Out-of-order cores are not affected.
-
-
- May 13, 2024
-
-
Arpad Panyik authored
Optimize the copy part of subpel filters (the put_neon function). For small block sizes (<16) the usage of general purpose registers is usually the best way to do the copy. Relative performance of micro benchmarks (lower is better): Cortex-A55: w2: 0.991x w4: 0.992x w8: 0.999x w16: 0.875x w32: 0.775x w64: 0.914x w128: 0.998x Cortex-A510: w2: 0.159x w4: 0.080x w8: 0.583x w16: 0.588x w32: 0.966x w64: 1.111x w128: 0.957x Cortex-A76: w2: 0.903x w4: 0.683x w8: 0.944x w16: 0.948x w32: 0.919x w64: 0.855x w128: 0.991x Cortex-A78: w32: 0.867x w64: 0.820x w128: 1.011x Cortex-A715: w32: 0.834x w64: 0.778x w128: 1.000x Cortex-X1: w32: 0.809x w64: 0.762x w128: 1.000x Cortex-X3: w32: 0.733x w64: 0.720x w128: 0.999x
-
Arpad Panyik authored
Save a complex arithmetic instruction in the jump table address calculation of put_neon function.
-
Arpad Panyik authored
Move the BTI landing pads out of the inner loops of put_neon function, the only exception is the width=16 case where it is already outside of the loops. When BTI is enabled, the relative performance of omitting the AARCH64_VALID_JUMP_TARGET from the inner loops on Cortex-A510 (lower is better): w2: 0.981x w4: 0.991x w8: 0.612x w32: 0.687x w64: 0.813x w128: 0.892x Out-of-order CPUs are mostly unaffected.
-
Henrik Gramner authored
-
Henrik Gramner authored
Both POSIX and the C standard places several environmental limits on setjmp() invocations, with essentially anything beyond comparing the return value with a constant as a simple branch condition being UB. We were previously performing a function call using the setjmp() return value as an argument, which is technically not allowed even though it happened to work correctly in practice. Some systems may loosen those restrictions and allow for more flexible usage, but we shouldn't be relying on that.
-
- May 12, 2024
-
-
Removed some unnecessary vector register copies from the initial horizontal filter parts of the HV subpel filters. The performance improvements are better for the smaller filter block sizes. The narrowing shifts were also rewritten at the end of the *filter8* because it was only beneficial for the Cortex-A55 among the DotProd capable CPU cores. On other out-of-order or newer CPUs the UZP1+SHRN instruction combination is better. Relative performance of micro benchmarks (lower is better): Cortex-A55: mct regular w4: 0.980x mct regular w8: 1.007x mct regular w16: 1.007x mct sharp w4: 0.983x mct sharp w8: 1.012x mct sharp w16: 1.005x Cortex-A510: mct regular w4: 0.935x mct regular w8: 0.984x mct regular w16: 0.986x mct sharp w4: 0.927x mct sharp w8: 0.983x mct sharp w16: 0.987x Cortex-A78: mct regular w4: 0.974x mct regular w8: 0.988x mct regular w16: 0.991x mct sharp w4: 0.971x mct sharp w8: 0.987x mct sharp w16: 0.979x Cortex-715: mct regular w4: 0.958x mct regular w8: 0.993x mct regular w16: 0.998x mct sharp w4: 0.974x mct sharp w8: 0.991x mct sharp w16: 0.997x Cortex-X1: mct regular w4: 0.983x mct regular w8: 0.993x mct regular w16: 0.996x mct sharp w4: 0.974x mct sharp w8: 0.990x mct sharp w16: 0.995x Cortex-X3: mct regular w4: 0.953x mct regular w8: 0.993x mct regular w16: 0.997x mct sharp w4: 0.981x mct sharp w8: 0.993x mct sharp w16: 0.995x
-