- Jan 29, 2020
-
-
Henrik Gramner authored
Required for AVX-512.
-
Martin Storsjö authored
Before: ARM32: Cortex A7 A8 A9 A53 A72 A73 cdef_filter_4x4_8bpc_neon: 964.6 599.5 707.9 601.2 465.1 405.2 cdef_filter_4x8_8bpc_neon: 1726.0 1066.2 1238.7 1041.7 798.6 725.3 cdef_filter_8x8_8bpc_neon: 2974.4 1671.8 1943.9 1806.1 1229.8 1242.1 ARM64: cdef_filter_4x4_8bpc_neon: 569.2 337.8 348.7 cdef_filter_4x8_8bpc_neon: 1031.1 623.3 633.6 cdef_filter_8x8_8bpc_neon: 1847.5 1097.7 1117.5 After: ARM32: Cortex A7 A8 A9 A53 A72 A73 cdef_filter_4x4_8bpc_neon: 798.4 524.2 617.3 506.8 432.4 361.1 cdef_filter_4x8_8bpc_neon: 1394.7 910.4 1054.0 863.6 730.2 632.2 cdef_filter_8x8_8bpc_neon: 2364.6 1453.8 1675.1 1466.0 1086.4 1107.7 ARM64: cdef_filter_4x4_8bpc_neon: 461.7 303.1 308.6 cdef_filter_4x8_8bpc_neon: 833.0 547.5 556.0 cdef_filter_8x8_8bpc_neon: 1459.3 934.1 967.9
-
Martin Storsjö authored
-
- Jan 28, 2020
-
-
- Jan 27, 2020
-
-
The main feature is splitting the main filter code into three different code paths depending on the strength values. Clipping is only required when both the primary and secondary strengths are non-zero, which is an uncommon case. Being able to skip that complexity in the common cases is significantly faster.
-
-
- Jan 25, 2020
-
-
If the primary strengths for both luma and chroma are zero the direction is always zero. If both the primary and secondary luma strengths are zero the entire luma filtering process is a no-op.
-
- Jan 21, 2020
-
-
Henrik Gramner authored
Required for the AVX-512 instructions added in Ice Lake.
-
Konstantin Pavlov authored
The image now includes nasm 2.14.02 which is needed to assemble AVX512 code.
-
- Jan 20, 2020
-
-
Victorien Le Couviour--Tuffet authored
-
Victorien Le Couviour--Tuffet authored
It's a little easier for the CPU to simply overwrite a 32-bit reg rather than writing it's low 8 bits while conserving bits 8 to 31. In order to do that it actually fetches those bits, merge to a 32-bit value, and write that back to the 32-bit GPR. As those are always cleared, perform a zero extend mov to dword instead.
-
Victorien Le Couviour--Tuffet authored
This allows for AVX-512 code to issue vzeroupper automatically in RET, when not specified in cglobal but later on, along WIN64_SPILL_XMM.
-
- Jan 15, 2020
-
-
Martin Storsjö authored
Don't assume we can do a clipped negation in 16 bit before the multiplication (as it might affect the end result), but do the multiplication first and negate in 32 bit, just like in the reference.
-
- Jan 14, 2020
-
-
- Jan 10, 2020
-
-
Ronald S. Bultje authored
gen_grain_y_ar0_8bpc_c: 84853.3 gen_grain_y_ar0_8bpc_ssse3: 23528.0 gen_grain_y_ar1_8bpc_c: 140775.5 gen_grain_y_ar1_8bpc_ssse3: 70410.2 gen_grain_y_ar2_8bpc_c: 251311.3 gen_grain_y_ar2_8bpc_ssse3: 95222.2 gen_grain_y_ar3_8bpc_c: 394763.0 gen_grain_y_ar3_8bpc_ssse3: 103541.9 gen_grain_uv_ar0_8bpc_420_c: 29773.7 gen_grain_uv_ar0_8bpc_420_ssse3: 7068.9 gen_grain_uv_ar1_8bpc_420_c: 46113.2 gen_grain_uv_ar1_8bpc_420_ssse3: 22148.1 gen_grain_uv_ar2_8bpc_420_c: 70061.4 gen_grain_uv_ar2_8bpc_420_ssse3: 25479.0 gen_grain_uv_ar3_8bpc_420_c: 113826.0 gen_grain_uv_ar3_8bpc_420_ssse3: 30004.9 fguv_32x32xn_8bpc_420_csfl0_c: 8148.9 fguv_32x32xn_8bpc_420_csfl0_ssse3: 1371.3 fguv_32x32xn_8bpc_420_csfl1_c: 6391.9 fguv_32x32xn_8bpc_420_csfl1_ssse3: 1034.8 fgy_32x32xn_8bpc_c: 14201.3 fgy_32x32xn_8bpc_ssse3: 3443.0
-
Dale Curtis authored
dav1d_open() is part of the public API and should be sanitized, limit sanitizer disable to just the problematic dlsym() method.
-
CFI will SIGILL when calling a function pointer obtained through dlsym(), regardless of whether or not the signature is correct. See https://bugs.llvm.org/show_bug.cgi?id=44500
-
- Jan 09, 2020
-
-
------------------------------------------ mct_bilinear_w4_0_8bpc_avx2: 3.8 mct_bilinear_w4_0_8bpc_avx512icl: 3.7 --------------------- mct_bilinear_w8_0_8bpc_avx2: 5.0 mct_bilinear_w8_0_8bpc_avx512icl: 4.8 --------------------- mct_bilinear_w16_0_8bpc_avx2: 8.5 mct_bilinear_w16_0_8bpc_avx512icl: 7.1 --------------------- mct_bilinear_w32_0_8bpc_avx2: 29.5 mct_bilinear_w32_0_8bpc_avx512icl: 17.1 --------------------- mct_bilinear_w64_0_8bpc_avx2: 68.1 mct_bilinear_w64_0_8bpc_avx512icl: 34.7 --------------------- mct_bilinear_w128_0_8bpc_avx2: 180.5 mct_bilinear_w128_0_8bpc_avx512icl: 138.0 ------------------------------------------ mct_bilinear_w4_h_8bpc_avx2: 4.0 mct_bilinear_w4_h_8bpc_avx512icl: 3.9 --------------------- mct_bilinear_w8_h_8bpc_avx2: 5.3 mct_bilinear_w8_h_8bpc_avx512icl: 5.0 --------------------- mct_bilinear_w16_h_8bpc_avx2: 11.7 mct_bilinear_w16_h_8bpc_avx512icl: 7.5 --------------------- mct_bilinear_w32_h_8bpc_avx2: 41.8 mct_bilinear_w32_h_8bpc_avx512icl: 20.3 --------------------- mct_bilinear_w64_h_8bpc_avx2: 94.9 mct_bilinear_w64_h_8bpc_avx512icl: 35.0 --------------------- mct_bilinear_w128_h_8bpc_avx2: 240.1 mct_bilinear_w128_h_8bpc_avx512icl: 143.8 ------------------------------------------ mct_bilinear_w4_v_8bpc_avx2: 4.1 mct_bilinear_w4_v_8bpc_avx512icl: 4.0 --------------------- mct_bilinear_w8_v_8bpc_avx2: 6.0 mct_bilinear_w8_v_8bpc_avx512icl: 5.4 --------------------- mct_bilinear_w16_v_8bpc_avx2: 10.3 mct_bilinear_w16_v_8bpc_avx512icl: 8.9 --------------------- mct_bilinear_w32_v_8bpc_avx2: 29.5 mct_bilinear_w32_v_8bpc_avx512icl: 25.9 --------------------- mct_bilinear_w64_v_8bpc_avx2: 64.3 mct_bilinear_w64_v_8bpc_avx512icl: 41.3 --------------------- mct_bilinear_w128_v_8bpc_avx2: 198.2 mct_bilinear_w128_v_8bpc_avx512icl: 139.6 ------------------------------------------ mct_bilinear_w4_hv_8bpc_avx2: 5.6 mct_bilinear_w4_hv_8bpc_avx512icl: 5.2 --------------------- mct_bilinear_w8_hv_8bpc_avx2: 8.3 mct_bilinear_w8_hv_8bpc_avx512icl: 7.0 --------------------- mct_bilinear_w16_hv_8bpc_avx2: 19.4 mct_bilinear_w16_hv_8bpc_avx512icl: 12.1 --------------------- mct_bilinear_w32_hv_8bpc_avx2: 69.1 mct_bilinear_w32_hv_8bpc_avx512icl: 32.5 --------------------- mct_bilinear_w64_hv_8bpc_avx2: 164.4 mct_bilinear_w64_hv_8bpc_avx512icl: 71.1 --------------------- mct_bilinear_w128_hv_8bpc_avx2: 405.2 mct_bilinear_w128_hv_8bpc_avx512icl: 193.1 ------------------------------------------
-
-
YMM and ZMM registers on x86 are turned off to save power when they haven't been used for some period of time. When they are used there will be a "warmup" period during which performance will be reduced and inconsistent which is problematic when trying to benchmark individual functions. Periodically issue "dummy" instructions that uses those registers to prevent them from being powered down. The end result is more consistent benchmark results. Credits to Henrik Gramner's commit 1878c7f2af0a9c73e291488209109782c428cfcf from x264.
-
- Jan 08, 2020
-
-
-
The spec requires that the final residual values of a compliant bitstream must be representable in bitdepth+1 bits, which is not possible if intermediate values are large enough to be clipped.
-
-
The coefficients after the first (8-bit) 1D identity transform may require more than 16 bits precision before downshifting in some cases, and they may also need to be clipped to int16_t after downshifting.
-
- When building dav1d as a dependency fallback, the parent project needs to provide the variable name of a dependency object, added dav1d_dep for that. - SOURCE_ROOT is the root of the main project source tree, use current_source_dir() instead.
-
- Jan 07, 2020
-
-
Needed to use UINT_MAX.
-
- Jan 05, 2020
-
-
Martin Storsjö authored
This gives small gains on A72 and A73, and on A53 on symbol_adapt16. Before: Cortex A53 A72 A73 msac_decode_symbol_adapt4_neon: 63.2 52.8 53.3 msac_decode_symbol_adapt8_neon: 68.5 57.9 55.7 msac_decode_symbol_adapt16_neon: 92.8 59.7 62.8 After: msac_decode_symbol_adapt4_neon: 63.3 48.3 50.0 msac_decode_symbol_adapt8_neon: 68.7 55.5 54.0 msac_decode_symbol_adapt16_neon: 88.6 58.8 60.0
-
- Jan 02, 2020
-
-
Martin Storsjö authored
-
Martin Storsjö authored
Make sure to not clip to a 16 bit range before the downshift is done. Add clipping to 16 bit range in all other identity transforms, where there is no downshift. 4x4, 8x4 and 4x8 don't have any downshift, thus the existing code structure works fine. The identity transforms of size 32 already are specialcased with the downshift folded in where possible. Clamping properly in them should be enough, as any out of range values will be clamped to pixel range in the end anyway. Therefore we only need specialcased identity in the first pass (to keep intermediates in 32 bit until downshifting) for 8x8 and all the size 16 variants.
-
Martin Storsjö authored
Don't use the \() token concatenation operator in the .irp loops; if the function definition is enclosed in a .macro, we can't use \() in the loop as it is expanded already when the macro is expanded, before the loop is expanded.
-
Ronald S. Bultje authored
The second loop depends on the output of the first loop. Fixes #326.
-
- Jan 01, 2020
-
-
Ronald S. Bultje authored
-
Ronald S. Bultje authored
Fixes #327.
-
Ronald S. Bultje authored
For chroma coefficients that are masked (&= 0xfffff) to no value, the context becomes in a weird state where it has no magnitude (ctx & 0x3f == 0), but it does have a sign (ctx & 0xc0 != 0x40). Our old code checked just the magnitude part of the context to set the skip context of neighbouring blocks, but libaom uses both sign and magnitude for this purpose. Therefore, adjust our code so it does the same thing. Luma code only checks magnitude for this purpose and is thus not affected by this peculiarity. Fixes #325.
-
- Dec 31, 2019
-
-
- Dec 29, 2019
-
-
Ronald S. Bultje authored
Fixes #320. The problem here is that qidx=0 is (in libaom) a shortcut for lossless, which normally becomes WHT_WHT, but under some obscure conditions, it can also be non-lossless, in which case qidx=0 implies DCT_DCT for luma. For chroma, apparently we should use the default inference pattern, which becomes DCT_DCT for inter also, but requires the standard lookup table for intra.
-
This is consistent with libaom's av1_superres_scaled(). Fixes #322.
-
- Dec 28, 2019
-
-
Ronald S. Bultje authored
Fixes C part of #321.
-