- May 20, 2024
-
-
Henrik Gramner authored
Error out early instead of producing bogus mismatch errors in case of an incorrect cpu mask for example.
-
- May 14, 2024
-
-
Kyle Siefring authored
Changes stem from redesigning the reduction stage of the multisymbol decode function. * No longer use adapt4 for 5 possible symbol values * Specialize reduction for 4/8/16 decode functions * Modify control flow +------------------------+--------------+--------------+---------------+ | | Neoverse V1 | Neoverse N1 | Cortex A72 | | | (Graviton 3) | (Graviton 2) | (Graviton 1) | +------------------------+-------+------+-------+------+-------+-------+ | | Old | New | Old | New | Old | New | +------------------------+-------+------+-------+------+-------+-------+ | decode_bool_neon | 13.0 | 12.9 | 14.9 | 14.0 | 39.3 | 29.0 | +------------------------+-------+------+-------+------+-------+-------+ | decode_bool_adapt_neon | 15.4 | 15.6 | 17.5 | 16.8 | 41.6 | 33.5 | +------------------------+-------+------+-------+------+-------+-------+ | decode_bool_equi_neon | 11.3 | 12.0 | 14.0 | 12.2 | 35.0 | 26.3 | +------------------------+-------+------+-------+------+-------+-------+ | decode_hi_tok_c | 73.7 | 57.8 | 73.4 | 60.5 | 130.1 | 103.9 | +------------------------+-------+------+-------+------+-------+-------+ | decode_hi_tok_neon | 63.3 | 48.2 | 65.2 | 51.2 | 119.0 | 105.3 | +------------------------+-------+------+-------+------+-------+-------+ | decode_symbol_\ | 28.6 | 22.5 | 28.4 | 23.5 | 67.8 | 55.1 | | adapt4_neon | | | | | | | +------------------------+-------+------+-------+------+-------+-------+ | decode_symbol_\ | 29.5 | 26.6 | 29.0 | 28.8 | 76.6 | 74.0 | | adapt8_neon | | | | | | | +------------------------+-------+------+-------+------+-------+-------+ | decode_symbol_\ | 31.6 | 31.2 | 33.3 | 33.0 | 77.5 | 68.1 | | adapt16_neon | | | | | | | +------------------------+-------+------+-------+------+-------+-------+
-
- May 13, 2024
-
-
Henrik Gramner authored
-
Henrik Gramner authored
Both POSIX and the C standard places several environmental limits on setjmp() invocations, with essentially anything beyond comparing the return value with a constant as a simple branch condition being UB. We were previously performing a function call using the setjmp() return value as an argument, which is technically not allowed even though it happened to work correctly in practice. Some systems may loosen those restrictions and allow for more flexible usage, but we shouldn't be relying on that.
-
- May 10, 2024
-
-
Luca Barbato authored
Will be used to gate code using vec_absd and other useful instructions.
-
- Apr 15, 2024
-
-
Port improvements from the hi token functions to the rest of the symbol adaption functions. These weren't originally ported since they didn't work with arbitrary padding. In practice, zero padding is already used and only the tests need to be updated. Results - Neoverse N1 Old: msac_decode_symbol_adapt4_c: 41.4 ( 1.00x) msac_decode_symbol_adapt4_neon: 31.0 ( 1.34x) msac_decode_symbol_adapt8_c: 54.5 ( 1.00x) msac_decode_symbol_adapt8_neon: 32.2 ( 1.69x) msac_decode_symbol_adapt16_c: 85.6 ( 1.00x) msac_decode_symbol_adapt16_neon: 37.5 ( 2.28x) New: msac_decode_symbol_adapt4_c: 41.5 ( 1.00x) msac_decode_symbol_adapt4_neon: 27.7 ( 1.50x) msac_decode_symbol_adapt8_c: 55.7 ( 1.00x) msac_decode_symbol_adapt8_neon: 30.1 ( 1.85x) msac_decode_symbol_adapt16_c: 82.4 ( 1.00x) msac_decode_symbol_adapt16_neon: 35.2 ( 2.34x)
-
- Apr 08, 2024
-
-
Henrik Gramner authored
It was originally disabled due to older meson versions mixing the output of 'meson test -v' from different tests, which made the log difficult to read. Newer versions however caches the output from each test as it runs and prints it in one contiguous block, so that's no longer an issue.
-
- Apr 02, 2024
-
-
Martin Storsjö authored
On AArch64, the performance counter registers usually are restricted and not accessible from user space. On macOS, we currently use mach_absolute_time() as timer on aarch64. This measures wallclock time but with a very coarse resolution. There is a private API, kperf, that one can use for getting high precision timers though. Unfortunately, it requires running the checkasm binary as root (e.g. with sudo). Also, as it is a private, undocumented API, it can potentially change at any time. This is handled by adding a new meson build option, for switching to this timer. If the timer source in checkasm could be changed at runtime with an option, this wouldn't need to be a build time option. This allows getting benchmarks like this: mc_8tap_regular_w16_hv_8bpc_c: 1522.1 ( 1.00x) mc_8tap_regular_w16_hv_8bpc_neon: 331.8 ( 4.59x) Instead of this: mc_8tap_regular_w16_hv_8bpc_c: 9.0 ( 1.00x) mc_8tap...
-
- Mar 04, 2024
-
-
Martin Storsjö authored
-
- Feb 29, 2024
-
-
Henrik Gramner authored
Prints a list of cpuflags available for the current architecture. Flags which are supported on the current system will be printed in green, and flags which are unsupported in red with a ~ prefix.
-
- Feb 28, 2024
-
-
Add run-time CPU feature detection for DotProd, i8mm, SVE and SVE2. SVE and SVE2 are AArch64-only features.
-
- Feb 26, 2024
-
-
Nathan E. Egge authored
-
- Feb 22, 2024
-
-
Arpad Panyik authored
The 6-tap sub-pel filter specialisation uses different code paths for sharp (8-tap) and regular/smooth (6-tap) filtering kernels. This patch enables benchmarking for the different code paths.
-
- Feb 21, 2024
-
-
Henrik Gramner authored
* Process the entire buffer to get better coverage of eob handling. * Use a more reasonable buffer size. * Ignore trailing dif bits to allow for more implementation flexibility.
-
- Feb 18, 2024
-
-
Default to using the number of logical cores divided by thread count.
-
- Jan 31, 2024
-
-
Nathan E. Egge authored
-
Nathan E. Egge authored
-
Nathan E. Egge authored
-
Nathan E. Egge authored
-
Nathan E. Egge authored
inv_txfm_add_4x4_identity_identity_0_8bpc_c: 534.6 ( 1.00x) inv_txfm_add_4x4_identity_identity_0_8bpc_rvv: 72.2 ( 7.40x) inv_txfm_add_4x4_identity_identity_1_8bpc_c: 534.7 ( 1.00x) inv_txfm_add_4x4_identity_identity_1_8bpc_rvv: 72.3 ( 7.40x)
-
- Jan 30, 2024
-
-
Nathan E. Egge authored
-
- Jan 24, 2024
-
-
Henrik Gramner authored
-
Henrik Gramner authored
Only print the paths relative to the argon directory. This avoids excessive terminal line wrapping due to long path names which otherwise interferes with the '\r' usage for progress reporting.
-
- Jan 23, 2024
-
-
Allows to explicitly enable/disable seek-stress tests.
-
- Jan 21, 2024
-
-
Relative speedup over C code: msac_decode_bool_c: 0.5 ( 1.00x) msac_decode_bool_lsx: 0.5 ( 1.09x) msac_decode_bool_adapt_c: 0.7 ( 1.00x) msac_decode_bool_adapt_lsx: 0.6 ( 1.20x) msac_decode_symbol_adapt4_c: 1.3 ( 1.00x) msac_decode_symbol_adapt4_lsx: 1.0 ( 1.30x) msac_decode_symbol_adapt8_c: 2.1 ( 1.00x) msac_decode_symbol_adapt8_lsx: 1.0 ( 2.05x) msac_decode_symbol_adapt16_c: 3.7 ( 1.00x) msac_decode_symbol_adapt16_lsx: 0.8 ( 4.77x)
-
Hecai Yuan authored
-
- Jan 11, 2024
-
-
Henrik Gramner authored
Also prefer re-setting the signal handler upon intercept in combination with SA_RESETHAND over re-raising exceptions with the SIG_DFL handler.
-
Henrik Gramner authored
-
- Dec 19, 2023
-
-
- Dec 15, 2023
-
-
Martin Storsjö authored
This was missed in 2ef970a8. Also print this text for EXCEPTION_IN_PAGE_ERROR on Windows.
-
- Nov 12, 2023
-
-
The max_width/max_height values can exceed 16-bit range.
-
- Nov 01, 2023
-
-
Martin Storsjö authored
longjmp on Windows uses SEH to unwind on ARM/ARM64 too, just like on x86_64, thus use RtlCaptureContext/RtlRestoreContext instead of setjmp/longjmp on those architectures as well.
-
-
Some functionality is only available on WINAPI_PARTITION_DESKTOP systems.
-
This allows for the use of standard VT100 escape codes for text coloring, which simplifies things by eliminating a bunch of Windows-specific code. This is only supported since Windows 10. Things will still run on older systems, just without colored text output.
-
-
- Jul 12, 2023
-
-
Matthias Dressel authored
Integrates --bench-c into --bench to simplify benchmarks.
-
- Jul 07, 2023
-
-
Matthias Dressel authored
-
- Jul 06, 2023
-
-
-
Pack two indices into each byte instead of storing them separately. Reduces memory usage by up to 16 kB per sb128 in streams that uses screen content tools when frame-threading is enabled, at the cost of some additional computational overhead for packing/unpacking.
-