- May 16, 2020
-
-
Jean-Baptiste Kempf authored
-
- May 15, 2020
-
-
Henrik Gramner authored
Add code to check that a function doesn't accidentally overwrite anything in the area located just above the current stack frame.
-
Marvin Scholz authored
-
Marvin Scholz authored
This allows selecting at runtime if placebo should use OpenGL or Vulkan for rendering.
-
Marvin Scholz authored
-
Marvin Scholz authored
-
Marvin Scholz authored
-
Marvin Scholz authored
-
- May 14, 2020
-
-
Marvin Scholz authored
-
Marvin Scholz authored
-
Marvin Scholz authored
To un-clutter the main dav1dplay.c, move the fifo to its own file and header.
-
Martin Storsjö authored
If the maximum number of arguments (currently 15) is changed into an even number, and a function actually takes the full number of arguments, we would have the situation where the checked spot on the stack is at the same place as we store an inverted copy of it. We already allocate enough space for two values though (for stack alignment purposes, 16 bytes on arm64 and 8 bytes on arm32) so by storing the reference in the upper half of this, the lower half of it works as canary and isn't overwritten.
-
Martin Storsjö authored
-
Martin Storsjö authored
-
- May 13, 2020
-
-
Use 'unsigned' instead of 'unsigned int' for consistency. Add 'const' to a few variables. Make proper use of C99 features.
-
Also skip the AVX warmup.
-
If functions return a float value, this value is stored in this register.
-
Martin Storsjö authored
We should just use a normal bl here, and the linker will add the 'x' bit if necessary. This fixes calling the checkasm_fail_func on windows, where the code is built in thumb mode (and the linker doesn't clear the 'x' bit in the blx instruction).
-
- May 12, 2020
-
-
-
-
* The build from 'build-debian' is reused. 'logging' is not disabled since that would trigger an almost full rebuild. * All ASM tests are merged into one job which is expected to seldomly fail, therefore ease of debugging is traded in for efficiency.
-
-
-
Martin Storsjö authored
When benchmarking, the functions are called with a fixed width of 64x32 or 32x16, while the test itself is run with a random size in the range up to 128x32. In 16 bpc mode, the source pixels must be within the valid range, because they otherwise cause accesses out of bounds in the scaling array.
-
- May 11, 2020
-
-
Also avoid integer overflows by using 64-bit intermediate precision.
-
- May 10, 2020
-
-
Allows for macro-op fusion.
-
Eliminates the x86-64 check from the meson configuration file to be consistent with how other x86-64-exclusive code is handled.
-
Allows for constant propagation and tail call elimination in the msac initialization, which is performed in each tile.
-
Utilize the unsigned representation of a signed integer to skip the refill code if the count was already negative to begin with, which saves a few clock cycles at the end of each tile.
-
Martin Storsjö authored
Add an element size specifier to the existing individual transform functions for 8 bpc, naming them e.g. inv_dct_8h_x8_neon, to clarify that they operate on input vectors of 8h, and make the symbols public, to let the 10 bpc case call them from a different object file. The same convention is used in the new itx16.S, like inv_dct_4s_x8_neon. Make the existing itx.S compiled regardless of whether 8 bpc support is enabled. For builds with 8 bpc support disabled, this does include the unused frontend functions though, but this is hopefully tolerable to avoid having to split the file into a sharable file for transforms and a separate one for frontends. This only implements the 10 bpc case, as that case can use transforms operating on 16 bit coefficients in the second pass. Relative speedup vs C for a few functions: Cortex A53 A72 A73 inv_txfm_add_4x4_dct_dct_0_10bpc_neon: 4.14 4.06 4.49 inv_txfm_add_4x4_dct_dct_1_10bpc_neon: 6.51 6.49 6.42 inv_txfm_add_8x8_dct_dct_0_10bpc_neon: 5.02 4.63 6.23 inv_txfm_add_8x8_dct_dct_1_10bpc_neon: 8.54 7.13 11.96 inv_txfm_add_16x16_dct_dct_0_10bpc_neon: 5.52 6.60 8.03 inv_txfm_add_16x16_dct_dct_1_10bpc_neon: 11.27 9.62 12.22 inv_txfm_add_16x16_dct_dct_2_10bpc_neon: 9.60 6.97 8.59 inv_txfm_add_32x32_dct_dct_0_10bpc_neon: 2.60 3.48 3.19 inv_txfm_add_32x32_dct_dct_1_10bpc_neon: 14.65 12.64 16.86 inv_txfm_add_32x32_dct_dct_2_10bpc_neon: 11.57 8.80 12.68 inv_txfm_add_32x32_dct_dct_3_10bpc_neon: 8.79 8.00 9.21 inv_txfm_add_32x32_dct_dct_4_10bpc_neon: 7.58 6.21 7.80 inv_txfm_add_64x64_dct_dct_0_10bpc_neon: 2.41 2.85 2.75 inv_txfm_add_64x64_dct_dct_1_10bpc_neon: 12.91 10.27 12.24 inv_txfm_add_64x64_dct_dct_2_10bpc_neon: 10.96 7.97 10.31 inv_txfm_add_64x64_dct_dct_3_10bpc_neon: 8.95 7.42 9.55 inv_txfm_add_64x64_dct_dct_4_10bpc_neon: 7.97 6.12 7.82
-
Martin Storsjö authored
This matches what is done in C by -fvisibility=hidden. This avoids issues with relocations against other symbols exported from another assembly file.
-
Martin Storsjö authored
-
Martin Storsjö authored
-
Martin Storsjö authored
-
Martin Storsjö authored
Before this, we never did the early exit from the first pass. Before: Cortex A53 A72 A73 inv_txfm_add_64x16_dct_dct_1_8bpc_neon: 7275.7 5198.3 5250.9 inv_txfm_add_64x16_dct_dct_2_8bpc_neon: 7276.1 5197.0 5251.3 inv_txfm_add_64x16_dct_dct_3_8bpc_neon: 7275.8 5196.2 5254.5 inv_txfm_add_64x16_dct_dct_4_8bpc_neon: 7273.6 5198.8 5254.2 After: inv_txfm_add_64x16_dct_dct_1_8bpc_neon: 5187.8 3763.8 3735.0 inv_txfm_add_64x16_dct_dct_2_8bpc_neon: 7280.6 5185.6 5256.3 inv_txfm_add_64x16_dct_dct_3_8bpc_neon: 7270.7 5179.8 5250.3 inv_txfm_add_64x16_dct_dct_4_8bpc_neon: 7271.7 5212.4 5256.4 The other related variants didn't have this bug and properly exited early when possible.
-
Martin Storsjö authored
Unify some loads and stores, avoiding some extra pointer moving.
-
Martin Storsjö authored
This gives a couple cycles speedup.
-
Martin Storsjö authored
-
Martin Storsjö authored
This isn't used for a sqrdmulh in its current form here. The one left in idct_coeffs[1] isn't used within the idct itself, but inv_txfm_horz_scale_dct_32x8 relies on it being left there for use with sqrdmulh scaling later.
-
Martin Storsjö authored
These cases were removed from x86 to save space and simplify the code in e0b88bd2, as those cases were essentially unused in real world bitstreams.
-