Commits · master · Sebastian Dröge / dav1d

Sep 30, 2022
- x86: Add 10-bit 8x8/8x16/16x8/16x16 itx AVX-512 (Ice Lake) asm · cac76e4b
  Henrik Gramner authored 2 years ago and Henrik Gramner committed 2 years ago
  
  cac76e4b
- Specify hidden visibility for global data symbol declarations · e4c4af02
  Henrik Gramner authored 2 years ago
```
'-fvisibility=hidden' only applies to definitions, not declarations,
so the compiler has to be conservative about how references to global
data symbols are performed.

Explicitly specifying the visibility allows for better code generation.
```
  e4c4af02
Sep 28, 2022
- build: strip() the result of cc.get_define() · 58c856b7
  Henrik Gramner authored 2 years ago and Henrik Gramner committed 2 years ago
```
Whitespace is added to the result if compiling with MSVC using /std:c11
which breaks various things. Adding strip() fixes the problem.
```
  58c856b7
- checkasm: Move printf format string to .rodata on x86 · 0b0b5fbf
  Henrik Gramner authored 2 years ago and Henrik Gramner committed 2 years ago
  
  0b0b5fbf
- checkasm: Improve 32-bit parameter clobbering on x86-64 · 6fefa6a5
  Henrik Gramner authored 2 years ago and Henrik Gramner committed 2 years ago
```
Use explicit parameter type detection and manually clobber the
upper bits instead of relying on internal compiler behavior.
```
  6fefa6a5
Sep 26, 2022

x86: Fix incorrect 32-bit parameter usage in high bit-depth AVX-512 mc · 8349845c

Henrik Gramner authored 2 years ago and

Henrik Gramner committed 2 years ago

The 32-bit width parameter was used directly as a pointer offset, but
the upper half is undefined. Fix it by replacing 'cmp' with 'sub' to
explicitly zero those bits.

8349845c

Sep 19, 2022

arm: itx: Add clipping to row_clip_min/max in the 10 bpc codepaths · 345127a7

Martin Storsjö authored 2 years ago

This fixes conformance with the argon test samples, in particular
with these samples:
    profile0_core/streams/test10100_579_8614.obu
    profile0_core/streams/test10218_6914.obu

This gives a pretty notable slowdown to these transforms - some
examples:

Before:                                 Cortex A53       A72       A73    Apple M1
inv_txfm_add_8x8_dct_dct_1_10bpc_neon:       365.7     290.2     299.8    0.3
inv_txfm_add_16x16_dct_dct_2_10bpc_neon:    1865.2    1384.1    1457.5    2.6
inv_txfm_add_64x64_dct_dct_4_10bpc_neon:   33976.3   26817.0   24864.2   40.4
After:
inv_txfm_add_8x8_dct_dct_1_10bpc_neon:       397.7     322.2     335.1    0.4
inv_txfm_add_16x16_dct_dct_2_10bpc_neon:    2121.9    1336.7    1664.6    2.6
inv_txfm_add_64x64_dct_dct_4_10bpc_neon:   38569.4   27622.6   28176.0   51.0

Thus, for the transforms alone, it makes them around 10-13% slower
(the Apple M1 measurements are too noisy to be conclusive here).

Measured on actual full decoding, it makes decoding of 10 bpc
Chimera around maybe 1% slower on an Apple M1 - close to measurement
noise anyway.

345127a7

x86: Fix overflows in 12bpc AVX2 IDCT/IADST · 9c74a9b0
Henrik Gramner authored 2 years ago

9c74a9b0

x86: Fix overflows in 12bpc AVX2 DC-only IDCT · 49b1c3c5

Henrik Gramner authored 2 years ago

Using smaller immediates also results in a small code size reduction in
some cases, so apply those changes to the (10bpc-only) SSE code as well.

49b1c3c5

x86: Fix clipping in high bit-depth AVX2 4x16 IDCT · 0c8a3461

Henrik Gramner authored 2 years ago

Certain clips were incorrectly performed on negated values, which
caused things to be off-by-one in both directions. Correct this by
negating such values prior to clipping instead of afterwards.

0c8a3461

Sep 15, 2022

Don't use gas-preprocessor with clang-cl for arm targets · cc9651f5

Martin Storsjö authored 3 years ago

Since meson 0.58.0 (released in May 2021), meson accepts adding '.S'
assembly files as source files to the clang-cl compiler.

If using an older version of meson, keep using gas-preprocessor
just like for MSVC builds.

cc9651f5

Fix checking the reference dimesions for the projection process · d4a2b75d

David Conrad authored 2 years ago

Section 7.9.2 returns 0 "If RefMiRows[ srcIdx ] is not equal to MiRows,
RefMiCols[ srcIdx ] is not equal to MiCols"

dav1d was comparing pixel width/height, not block width/height,
so conform with the spec

d4a2b75d

Fix calculation of OBMC lap dimensions · eb25f00c

David Conrad authored 2 years ago

Individual OBMC lapped predictions have a max width of 64 pixels
for the top lap and have a max height of 64 for the left laps

This is 7.11.3.9. Overlapped motion compensation process
step4 = Clip3( 2, 16, Num_4x4_Blocks_Wide[ candSz ] )

dav1d wasn't clipping this as needed, which means that with scaled MC, the
interpolation of the 2nd half of a 128 block was incorrect, since mx/my
for subpel filter selection need to be reset at the 64 pixel boundary

eb25f00c

Support film grain application whose only effect is clipping to video range · 10f5ce54

David Conrad authored 2 years ago

This is the parameter combination:
num_y_points == 0 && num_cb_points == 0 && num_cr_points == 0 &&
chroma_scaling_from_luma == 1 && clip_to_restricted_range == 1

Film grain application has two effects: adding noise, and optionally
clipping to video range

For luma, the spec skips film grain application if there's no noise
(num_y_points == 0), but for chroma, it's only skipped if there's no
chroma noise *and* chroma_scaling_from_luma is false

This means it's possible for there to be no noise (num_*_points = 0), but
if clip_to_restricted_range is true then chroma pixels can be clipped to
video range, if chroma_scaling_from_luma is true. Luma pixels, however,
aren't clipped to video range unless there's noise to apply.
dav1d currently skips applying film grain entirely if there is no noise,
regardless of the secondary clipping.

10f5ce54

Ignore T.35 metadata if the OBU contains no payload · 673ee248

David Conrad authored 2 years ago

The syntax of itu_t_t35_payload_bytes is not defined in the AV1
specification, but it does state that decoders should ignore the
entire OBU if they do not understand it.

673ee248

Fix chroma deblock filter size calculation for lossless · 2152826b

David Conrad authored 2 years ago

In section 5.11.34 txSz is always defined to TX_4X4 if Lossless is true

Chroma deblock filter size calculation needs to use this overridden txSz
when lossless is enabled

2152826b

Fix rounding in the calculation of initialSubpelX · e202fa08
David Conrad authored 2 years ago
```
The spec divides err by two, rounding to 0, instead of >>1,
which rounds towards negative infinity
```
e202fa08

Fix overflow when saturating dequantized coefficients clipped to 0 · ee98592b

David Conrad authored 2 years ago

It's possible to encode a large coefficient that becomes 0 after
the clipping in dequant (Abs( dq ) & 0xFFFFFF), e.g. 0x1000000
After that &0xFFFFFF, coeffs are saturated in the range of
[-(1 << (bitdepth+7)), 1 << (bitdepth+7))

dav1d implements this saturation via umin(dq - sign, cf_max), then applies
the sign afterwards via xor. However, for dq = 0 and sign = 1, this step
evaulates to umin(UINT_MAX, cf_max) == cf_max instead of the expected 0.

So instead, do unsigned saturate as umin(dq, cf_max + sign),
then apply sign via (sign ? -dq : dq)
On arm this is the same number of instructions, since cneg exists and is used
On x86 this requires an additional instruction, but this isn't a
latency-critical path

ee98592b

Fix overflow in 8-bit NEON ADST · 1bdb776c

David Conrad authored 2 years ago

In 8-bit adst, it's possible that the final Round2(x[0], 12) can exceed
16-bits signed

Specifically, in 7.13.2.6. Inverse ADST4 process, the precision requirement is:
"It is a requirement of bitstream conformance that all values stored in the
s and x arrays by this process are representable by a signed integer using
r + 12 bits of precision."

For 8 bits, r is 16 for both row and column, so x[] can be 28-bit signed.
For values [134215680, 134217727] (within 2047 of the maximum 28-bit value),
the final Round2(x[0], 12) evaluates to 32768, exceeding 16-bits signed.

So switch to using sqrshrn, which saturates to 16-bits signed

This is a continuation of: Commit b53ff29d
arm: itx: Do clipping in all narrowing downshifts

1bdb776c

Sep 14, 2022

tools: Allocate the priv structs with proper alignment · 08c70801

Martin Storsjö authored 2 years ago

Previously, they could be allocated with any random alignment
matching the end of the MuxerContext/DemuxerContext. The
priv structs themselves can have members that require specific
alignment, or at least the default alignment of malloc()/calloc()
(which is sufficient for native types such as uint64_t and
doubles).

This fixes crashes in some arm builds, where GCC (correctly) wants
to use 64 bit aligned stores to write to MD5Context.

08c70801

Sep 12, 2022
- x86: Fix clipping in 10bpc SSE4.1 IDCT asm · 128a0d89
  Henrik Gramner authored 2 years ago
  
  128a0d89
Sep 10, 2022
- build: Improve Windows linking options · 178681e5
  Henrik Gramner authored 2 years ago
  
  178681e5
Sep 09, 2022

tools: Improve demuxer probing · 52473197

Henrik Gramner authored 2 years ago

Increase the probing size, and change the logic to assume a stream is
valid even if no conclusive decision could be made within the probing
window as long as a sequence header was detected.

52473197

CI: Disable trimming on some tests · 934713e4
Matthias Dressel authored 2 years ago
```
Allow checkasm to run.
```
934713e4
CI: Remove git 'safe.directory' config · 3920bd9d
Matthias Dressel authored 2 years ago
```
It is now handled by the gitlab runner.

Ref: 7d859f9c
```
3920bd9d
gcovr: Ignore parsing errors · ddb3189c
Matthias Dressel authored 2 years ago

ddb3189c

crossfiles: Update Android toolchains · aa3fda78

Matthias Dressel authored 2 years ago

* Android armv7: target API 19 since it's the lowest directly provided
  by the new NDK.
* Newer NDK has generic tools for ar, strip, etc.
* Remove windres as it's only relevant for Windows targets.

aa3fda78

CI: Update images · d92594bd
Matthias Dressel authored 2 years ago
```
Remove experimental since gcc12, clang14, mold are now in unstable.
```
d92594bd

Sep 08, 2022

threading: Limit the progress bitfields to the used size · 6680d26f

Victorien Le Couviour--Tuffet authored 2 years ago

Store the used size instead of the allocated size.

The used size can be smaller than the allocated size, which results in
a wrong computation of the linear progress from the frame_progress
bitfield.

6680d26f

x86: Fix rare crash in chroma film grain asm · fab6427e

Henrik Gramner authored 2 years ago

The width parameter is used directly as a pointer offset, so ensure
that it has an appropriately sized data type.

This has been done previously for luma, but chroma was overlooked.

fab6427e

Sep 07, 2022
- x86: Fix overflows in 12bpc AVX2 identity itx asm · 677129c2
  Henrik Gramner authored 2 years ago and Henrik Gramner committed 2 years ago
  
  677129c2
- x86: Fix an alignment issue in 8-bit AVX-512 loop restoration · 58b15237
  Henrik Gramner authored 2 years ago and Henrik Gramner committed 2 years ago
```
We don't have a separate 8-bit AVX-512 5-tap Wiener filter so the 7-tap
function is used for chroma as well, and in some esoteric edge cases
chroma dst pointers may only have a 32-byte alignment despite having
a width larger than 32, so use an unaligned store as a workaround.
```
  58b15237
Sep 02, 2022
- checkasm: Add short options · 895fed08
  Victorien Le Couviour--Tuffet authored 2 years ago
  
  895fed08
- checkasm: Add pattern matching to --test · 713a4f4e
  Victorien Le Couviour--Tuffet authored 2 years ago
  
  713a4f4e
- checkasm: Remove pattern matching from --bench · a63a7c96
  Victorien Le Couviour--Tuffet authored 2 years ago
```
The pattern matching feature has been improved and is now performed
under the new --function parameter, rendering this one obsolete.
```
  a63a7c96
- checkasm: Add a --function option · d5d37926
  Victorien Le Couviour--Tuffet authored 2 years ago
```
Allows to run checkasm only for functions matching a given pattern.
```
  d5d37926
Aug 30, 2022
- threading: Fix copy_lpf_progress initialization · a3a55b18
  Victorien Le Couviour--Tuffet authored 2 years ago
```
The copy_lpf_progress bitfield might not be fully cleared when size goes
down.

Credit to Oss-Fuzz.
```
  a3a55b18
Aug 19, 2022
- data: don't overwrite the Dav1dDataProps size value · cd5e4152
  James Almer authored 2 years ago
```
Fixes a regression since commit 3d3c51a0.
```
  cd5e4152
Jul 25, 2022

Adjust inlining attributes on some functions · a029d689

Henrik Gramner authored 2 years ago

The code size increase of inlining every call to certain functions
isn't a worthwhile trade-off, and most compilers actually ends up
overriding those particular inlining hints anyway.

In some cases it's also better to split the function into separate
luma and chroma functions.

a029d689

Jul 19, 2022
- x86: Remove leftover instruction in loopfilter AVX2 asm · 0b7a0a2e
  Henrik Gramner authored 2 years ago and Henrik Gramner committed 2 years ago
```
In 0aca76c3 sequences of pand/pandn/por was replaced by pblendvb, but
one instruction (which now acts as a no-op) was accidentally left in.
```
  0b7a0a2e