- Feb 21, 2021
-
-
Jean-Baptiste Kempf authored
-
- Feb 19, 2021
-
-
Martin Storsjö authored
Relative speedup vs C for a few functions: Cortex A7 A8 A9 A53 A72 A73 inv_txfm_add_4x4_dct_dct_0_10bpc_neon: 2.79 5.08 2.99 2.83 3.49 4.44 inv_txfm_add_4x4_dct_dct_1_10bpc_neon: 5.74 9.43 5.72 7.19 6.73 6.92 inv_txfm_add_8x8_dct_dct_0_10bpc_neon: 3.13 3.68 2.79 3.25 3.21 3.33 inv_txfm_add_8x8_dct_dct_1_10bpc_neon: 7.09 10.41 7.00 10.55 8.06 9.02 inv_txfm_add_16x16_dct_dct_0_10bpc_neon: 5.01 6.76 4.56 5.58 5.52 2.97 inv_txfm_add_16x16_dct_dct_1_10bpc_neon: 8.62 12.48 13.71 11.75 15.94 16.86 inv_txfm_add_16x16_dct_dct_2_10bpc_neon: 6.05 8.81 6.13 8.18 7.90 12.27 inv_txfm_add_32x32_dct_dct_0_10bpc_neon: 2.90 3.90 2.16 2.63 3.56 2.74 inv_txfm_add_32x32_dct_dct_1_10bpc_neon: 13.57 17.00 13.30 13.76 14.54 17.08 inv_txfm_add_32x32_dct_dct_2_10bpc_neon: 8.29 10.54 8.05 10.68 12.75 14.36 inv_txfm_add_32x32_dct_dct_3_10bpc_neon: 6.78 8.40 7.60 10.12 8.97 12.96 inv_txfm_add_32x32_dct_dct_4_10bpc_neon: 6.48 6.74 6.00 7.38 7.67 9.70 inv_txfm_add_64x64_dct_dct_0_10bpc_neon: 3.02 4.59 2.21 2.65 3.36 2.47 inv_txfm_add_64x64_dct_dct_1_10bpc_neon: 9.86 11.30 9.14 13.80 12.46 14.83 inv_txfm_add_64x64_dct_dct_2_10bpc_neon: 8.65 9.76 7.60 12.05 10.55 12.62 inv_txfm_add_64x64_dct_dct_3_10bpc_neon: 7.78 8.65 6.98 10.63 9.15 11.73 inv_txfm_add_64x64_dct_dct_4_10bpc_neon: 6.61 7.01 5.52 8.41 8.33 9.69
-
Martin Storsjö authored
While these might not be needed in practice, add them for consistency.
-
Martin Storsjö authored
-
Martin Storsjö authored
This makes these instances consistent with the rest of similar cases.
-
Martin Storsjö authored
-
Martin Storsjö authored
In these cases, the function wrote a 64 pixel wide output, regardless of the actual width.
-
Martin Storsjö authored
-
Nathan E. Egge authored
Relative speed-ups over C code (compared with gcc-9.3.0): C AVX2 wiener_5tap_10bpc: 194892.0 14831.9 13.14x wiener_5tap_12bpc: 194295.4 14828.9 13.10x wiener_7tap_10bpc: 194391.7 19461.4 9.99x wiener_7tap_12bpc: 194136.1 19418.7 10.00x
-
Nathan E. Egge authored
-
- Feb 17, 2021
-
-
Jean-Baptiste Kempf authored
-
Relative speed-ups over C code (compared with gcc-9.3.0): C ASM cdef_dir_16bpc_avx2: 534.2 72.5 7.36x cdef_dir_16bpc_ssse3: 534.2 104.8 5.10x cdef_dir_16bpc_ssse3 (x86-32): 854.1 116.2 7.35x
-
-
- Feb 16, 2021
-
-
- Feb 15, 2021
-
-
Jean-Baptiste Kempf authored
-
It's supposed to warn about const-correctness issues, but it doesn't handle arrays of pointers correctly and will cause false positive warnings when using memset() to zero such arrays for example.
-
-
Not having a quantizer matrix is the most common case, so it's worth having a separate code path for it that eliminates some calculations and table lookups. Without a qm, not only can we skip calculating dq * qm, but only Exp-Golomb-coded coefficients will have the potential to overflow, so we can also skip clipping for the vast majority of coefficients.
-
Cache indices of non-zero coefficients during the AC token decoding loop in order to speed up the sign decoding/dequant loop later.
-
- Feb 13, 2021
-
-
Henrik Gramner authored
Looprestoration SIMD code may overread the input buffers by a small amount. Pad the buffer to make sure this memory is valid to access.
-
- Feb 12, 2021
-
-
Marvin Scholz authored
-
Martin Storsjö authored
On darwin, 32 bit parameters that aren't passed in registers but on the stack, are packed tightly instead of each of them occupying an 8 byte slot.
-
-
- Feb 11, 2021
-
-
-
Henrik Gramner authored
The previous implementation did multiple passes in the horizontal and vertical directions, with the intermediate values being stored in buffers on the stack. This caused bad cache thrashing. By interleaving the all the different passes in combination with a ring buffer for storing only a few rows at a time the performance is improved by a significant amount. Also slightly speed up neighbor calculations by packing the a and b values into a single 32-bit unsigned integer which allows calculations on both values simultaneously.
-
Henrik Gramner authored
Split the 5x5, 3x3, and mix cases into separate functions. Shrink some tables. Move some scalar calculations out of the DSP function. Make Wiener and SGR share the same function prototype to eliminate a branch in lr_stripe().
-
Henrik Gramner authored
Large stack allocations on Windows need to use stack probing in order to guarantee that all stack memory is committed before accessing it. This is done by ensuring that the guard page(s) at the end of the currently committed pages are touched prior to any pages beyond that.
-
Emmanuel Gil Peyrot authored
Neither --buildtype=plain nor --buildtype=debug set -ffast-math, so llround() is kept as a function call and isn’t optimised out into cvttsd2siq (on amd64), thus requiring the math lib to be linked. Note that even with -ffast-math, it isn’t guaranteed that a call to llround() will always be omitted (I have reproduced this on PowerPC), so this fix is correct even if we ever decide to enable -ffast-math in other build types.
-
- Feb 10, 2021
-
-
Martin Storsjö authored
Before: Cortex A53 A72 A73 inv_txfm_add_4x4_dct_dct_0_10bpc_neon: 40.7 23.0 24.0 inv_txfm_add_4x4_dct_dct_1_10bpc_neon: 116.0 71.5 78.2 inv_txfm_add_8x8_dct_dct_0_10bpc_neon: 85.7 50.7 53.8 inv_txfm_add_8x8_dct_dct_1_10bpc_neon: 287.0 203.5 215.2 inv_txfm_add_16x16_dct_dct_0_10bpc_neon: 255.7 129.1 140.4 inv_txfm_add_16x16_dct_dct_1_10bpc_neon: 1401.4 1026.7 1039.2 inv_txfm_add_16x16_dct_dct_2_10bpc_neon: 1913.2 1407.3 1479.6 After: inv_txfm_add_4x4_dct_dct_0_10bpc_neon: 38.7 21.5 22.2 inv_txfm_add_4x4_dct_dct_1_10bpc_neon: 116.0 71.3 77.2 inv_txfm_add_8x8_dct_dct_0_10bpc_neon: 76.7 44.7 43.5 inv_txfm_add_8x8_dct_dct_1_10bpc_neon: 278.0 203.0 203.9 inv_txfm_add_16x16_dct_dct_0_10bpc_neon: 236.9 106.2 116.2 inv_txfm_add_16x16_dct_dct_1_10bpc_neon: 1368.7 999.7 1008.4 inv_txfm_add_16x16_dct_dct_2_10bpc_neon: 1880.5 1381.2 1459.4
-
- Feb 09, 2021
-
-
Make them operate in a more cache friendly manner, interleaving horizontal and vertical filtering (reducing the amount of stack used from 51 KB to 4 KB), similar to what was done for x86 in 78d27b7d. This also adds separate 5tap versions of the filters and unrolls the vertical filter a bit more (which maybe could have been done without doing the rewrite). This does, however, increase the compiled code size by around 3.5 KB. Before: Cortex A53 A72 A73 wiener_5tap_8bpc_neon: 136855.6 91446.2 87363.6 wiener_7tap_8bpc_neon: 136861.6 91454.9 87374.5 wiener_5tap_10bpc_neon: 167685.3 114720.3 116522.1 wiener_5tap_12bpc_neon: 167677.5 114724.7 116511.9 wiener_7tap_10bpc_neon: 167681.6 114738.5 116567.0 wiener_7tap_12bpc_neon: 167673.8 114720.8 116515.4 After: wiener_5tap_8bpc_neon: 87102.1 60460.6 66803.8 wiener_7tap_8bpc_neon: 110831.7 78489.0 82015.9 wiener_5tap_10bpc_neon: 109999.2 90259.0 89238.0 wiener_5tap_12bpc_neon: 109978.3 90255.7 89220.7 wiener_7tap_10bpc_neon: 137877.6 107578.5 103435.6 wiener_7tap_12bpc_neon: 137868.8 107568.9 103390.4
-
Change order of multiply accumulates to allow inorder cores to forward the results.
-
- Feb 08, 2021
-
-
Additionally reschedule instructions for loading, to reduce stalls on in order cores. This applies the changes from a3b8157e on the arm32 version. Before: Cortex A7 A8 A9 A53 A72 A73 warp_8x8_8bpc_neon: 3659.3 1746.0 1931.9 2128.8 1173.7 1188.9 warp_8x8t_8bpc_neon: 3650.8 1724.6 1919.8 2105.0 1147.7 1206.9 warp_8x8_16bpc_neon: 4039.4 2111.9 2337.1 2462.5 1334.6 1396.5 warp_8x8t_16bpc_neon: 3973.9 2137.1 2299.6 2413.2 1282.8 1369.6 After: warp_8x8_8bpc_neon: 2920.8 1269.8 1410.3 1767.3 860.2 1004.8 warp_8x8t_8bpc_neon: 2904.9 1283.9 1397.5 1743.7 863.6 1024.7 warp_8x8_16bpc_neon: 3895.5 2060.7 2339.8 2376.6 1331.1 1394.0 warp_8x8t_16bpc_neon: 3822.7 2026.7 2298.7 2325.4 1278.1 1360.8
-
We currently run 'git describe --match' to obtain the current version, but meson doesn't properly quote/escape the pattern string on Windows. As a result, "fatal: Not a valid object name .ninja_log" is printed when compiling on Windows systems. Compilation still works, but the warning is annoying and misleading. Currently we don't actually need the pattern matching functionality (which is why things still work), so simply remove it as a workaround.
-
Martin Storsjö authored
This silences the following warning: tools/output/xxhash.c(127): warning C4244: '=': conversion from 'unsigned long' to 'unsigned char', possible loss of data
-
Martin Storsjö authored
This fixes bus errors due to missing alignment, when built with GCC 9 for arm32 with -mfpu=neon.
-
The required 'xxhash.h' header can either be in system include directory or can be copied to 'tools/output'. The xxh3_128bits based muxer shows no significant slowdown compared to the null muxer. Decoding times Chimera-AV1-8bit-1920x1080-6736kbps.ivf with 4 frame and 4 tile threads on a core i7-8550U (disabled turbo boost): null: 72.5 s md5: 99.8 s xxh3: 73.8 s Decoding Chimera-AV1-10bit-1920x1080-6191kbps.ivf with 6 frame and 4 tile threads on a m1 mc mini: null: 27.8 s md5: 105.9 s xxh3: 28.3 s
-
Matthias Dressel authored
Verification should not succeed if the given string is too short to be a real hash. Fixes videolan/dav1d#361
-
- Feb 06, 2021
-
-
Janne Grunau authored
-
Henrik Gramner authored
-