- Sep 30, 2022
-
-
-
Henrik Gramner authored
'-fvisibility=hidden' only applies to definitions, not declarations, so the compiler has to be conservative about how references to global data symbols are performed. Explicitly specifying the visibility allows for better code generation.
-
- Sep 28, 2022
-
-
Whitespace is added to the result if compiling with MSVC using /std:c11 which breaks various things. Adding strip() fixes the problem.
-
-
Use explicit parameter type detection and manually clobber the upper bits instead of relying on internal compiler behavior.
-
- Sep 26, 2022
-
-
The 32-bit width parameter was used directly as a pointer offset, but the upper half is undefined. Fix it by replacing 'cmp' with 'sub' to explicitly zero those bits.
-
- Sep 19, 2022
-
-
Martin Storsjö authored
This fixes conformance with the argon test samples, in particular with these samples: profile0_core/streams/test10100_579_8614.obu profile0_core/streams/test10218_6914.obu This gives a pretty notable slowdown to these transforms - some examples: Before: Cortex A53 A72 A73 Apple M1 inv_txfm_add_8x8_dct_dct_1_10bpc_neon: 365.7 290.2 299.8 0.3 inv_txfm_add_16x16_dct_dct_2_10bpc_neon: 1865.2 1384.1 1457.5 2.6 inv_txfm_add_64x64_dct_dct_4_10bpc_neon: 33976.3 26817.0 24864.2 40.4 After: inv_txfm_add_8x8_dct_dct_1_10bpc_neon: 397.7 322.2 335.1 0.4 inv_txfm_add_16x16_dct_dct_2_10bpc_neon: 2121.9 1336.7 1664.6 2.6 inv_txfm_add_64x64_dct_dct_4_10bpc_neon: 38569.4 27622.6 28176.0 51.0 Thus, for the transforms alone, it makes them around 10-13% slower (the Apple M1 measurements are too noisy to be conclusive here). Measured on actual full decoding, it makes decoding of 10 bpc Chimera around maybe 1% slower on an Apple M1 - close to measurement noise anyway.
-
Henrik Gramner authored
-
Henrik Gramner authored
Using smaller immediates also results in a small code size reduction in some cases, so apply those changes to the (10bpc-only) SSE code as well.
-
Henrik Gramner authored
Certain clips were incorrectly performed on negated values, which caused things to be off-by-one in both directions. Correct this by negating such values prior to clipping instead of afterwards.
-
- Sep 15, 2022
-
-
Martin Storsjö authored
Since meson 0.58.0 (released in May 2021), meson accepts adding '.S' assembly files as source files to the clang-cl compiler. If using an older version of meson, keep using gas-preprocessor just like for MSVC builds.
-
David Conrad authored
Section 7.9.2 returns 0 "If RefMiRows[ srcIdx ] is not equal to MiRows, RefMiCols[ srcIdx ] is not equal to MiCols" dav1d was comparing pixel width/height, not block width/height, so conform with the spec
-
David Conrad authored
Individual OBMC lapped predictions have a max width of 64 pixels for the top lap and have a max height of 64 for the left laps This is 7.11.3.9. Overlapped motion compensation process step4 = Clip3( 2, 16, Num_4x4_Blocks_Wide[ candSz ] ) dav1d wasn't clipping this as needed, which means that with scaled MC, the interpolation of the 2nd half of a 128 block was incorrect, since mx/my for subpel filter selection need to be reset at the 64 pixel boundary
-
David Conrad authored
This is the parameter combination: num_y_points == 0 && num_cb_points == 0 && num_cr_points == 0 && chroma_scaling_from_luma == 1 && clip_to_restricted_range == 1 Film grain application has two effects: adding noise, and optionally clipping to video range For luma, the spec skips film grain application if there's no noise (num_y_points == 0), but for chroma, it's only skipped if there's no chroma noise *and* chroma_scaling_from_luma is false This means it's possible for there to be no noise (num_*_points = 0), but if clip_to_restricted_range is true then chroma pixels can be clipped to video range, if chroma_scaling_from_luma is true. Luma pixels, however, aren't clipped to video range unless there's noise to apply. dav1d currently skips applying film grain entirely if there is no noise, regardless of the secondary clipping.
-
David Conrad authored
The syntax of itu_t_t35_payload_bytes is not defined in the AV1 specification, but it does state that decoders should ignore the entire OBU if they do not understand it.
-
David Conrad authored
In section 5.11.34 txSz is always defined to TX_4X4 if Lossless is true Chroma deblock filter size calculation needs to use this overridden txSz when lossless is enabled
-
David Conrad authored
The spec divides err by two, rounding to 0, instead of >>1, which rounds towards negative infinity
-
David Conrad authored
It's possible to encode a large coefficient that becomes 0 after the clipping in dequant (Abs( dq ) & 0xFFFFFF), e.g. 0x1000000 After that &0xFFFFFF, coeffs are saturated in the range of [-(1 << (bitdepth+7)), 1 << (bitdepth+7)) dav1d implements this saturation via umin(dq - sign, cf_max), then applies the sign afterwards via xor. However, for dq = 0 and sign = 1, this step evaulates to umin(UINT_MAX, cf_max) == cf_max instead of the expected 0. So instead, do unsigned saturate as umin(dq, cf_max + sign), then apply sign via (sign ? -dq : dq) On arm this is the same number of instructions, since cneg exists and is used On x86 this requires an additional instruction, but this isn't a latency-critical path
-
David Conrad authored
In 8-bit adst, it's possible that the final Round2(x[0], 12) can exceed 16-bits signed Specifically, in 7.13.2.6. Inverse ADST4 process, the precision requirement is: "It is a requirement of bitstream conformance that all values stored in the s and x arrays by this process are representable by a signed integer using r + 12 bits of precision." For 8 bits, r is 16 for both row and column, so x[] can be 28-bit signed. For values [134215680, 134217727] (within 2047 of the maximum 28-bit value), the final Round2(x[0], 12) evaluates to 32768, exceeding 16-bits signed. So switch to using sqrshrn, which saturates to 16-bits signed This is a continuation of: Commit b53ff29d arm: itx: Do clipping in all narrowing downshifts
-
- Sep 14, 2022
-
-
Martin Storsjö authored
Previously, they could be allocated with any random alignment matching the end of the MuxerContext/DemuxerContext. The priv structs themselves can have members that require specific alignment, or at least the default alignment of malloc()/calloc() (which is sufficient for native types such as uint64_t and doubles). This fixes crashes in some arm builds, where GCC (correctly) wants to use 64 bit aligned stores to write to MD5Context.
-
- Sep 12, 2022
-
-
Henrik Gramner authored
-
- Sep 10, 2022
-
-
Henrik Gramner authored
-
- Sep 09, 2022
-
-
Henrik Gramner authored
Increase the probing size, and change the logic to assume a stream is valid even if no conclusive decision could be made within the probing window as long as a sequence header was detected.
-
Matthias Dressel authored
Allow checkasm to run.
-
Matthias Dressel authored
It is now handled by the gitlab runner. Ref: 7d859f9c
-
Matthias Dressel authored
-
Matthias Dressel authored
* Android armv7: target API 19 since it's the lowest directly provided by the new NDK. * Newer NDK has generic tools for ar, strip, etc. * Remove windres as it's only relevant for Windows targets.
-
Matthias Dressel authored
Remove experimental since gcc12, clang14, mold are now in unstable.
-
- Sep 08, 2022
-
-
Victorien Le Couviour--Tuffet authored
Store the used size instead of the allocated size. The used size can be smaller than the allocated size, which results in a wrong computation of the linear progress from the frame_progress bitfield.
-
Henrik Gramner authored
The width parameter is used directly as a pointer offset, so ensure that it has an appropriately sized data type. This has been done previously for luma, but chroma was overlooked.
-
- Sep 07, 2022
-
-
-
We don't have a separate 8-bit AVX-512 5-tap Wiener filter so the 7-tap function is used for chroma as well, and in some esoteric edge cases chroma dst pointers may only have a 32-byte alignment despite having a width larger than 32, so use an unaligned store as a workaround.
-
- Sep 02, 2022
-
-
Victorien Le Couviour--Tuffet authored
-
Victorien Le Couviour--Tuffet authored
-
Victorien Le Couviour--Tuffet authored
The pattern matching feature has been improved and is now performed under the new --function parameter, rendering this one obsolete.
-
Victorien Le Couviour--Tuffet authored
Allows to run checkasm only for functions matching a given pattern.
-
- Aug 30, 2022
-
-
Victorien Le Couviour--Tuffet authored
The copy_lpf_progress bitfield might not be fully cleared when size goes down. Credit to Oss-Fuzz.
-
- Aug 19, 2022
-
-
James Almer authored
Fixes a regression since commit 3d3c51a0.
-
- Jul 25, 2022
-
-
Henrik Gramner authored
The code size increase of inlining every call to certain functions isn't a worthwhile trade-off, and most compilers actually ends up overriding those particular inlining hints anyway. In some cases it's also better to split the function into separate luma and chroma functions.
-
- Jul 19, 2022
-
-
In 0aca76c3 sequences of pand/pandn/por was replaced by pblendvb, but one instruction (which now acts as a no-op) was accidentally left in.
-