mc: Reduce stack use in {put,prep}_scaled_{bilin,8tap}
For the bilin cases, this seems to make things marginally faster (measured on x86_64; 7-25% faster with compiler autovectorization). For 8tap, it doesn't make much of a difference at all. Before: GCC Clang mc_scaled_8tap_regular_w128_8bpc_c: 115155.5 98549.3 mc_scaled_8tap_regular_w128_8bpc_ssse3: 17936.0 18411.1 mc_scaled_bilinear_w128_8bpc_c: 40290.0 51812.9 mc_scaled_bilinear_w128_8bpc_ssse3: 18243.9 18177.0 After: mc_scaled_8tap_regular_w128_8bpc_c: 116304.3 99453.2 mc_scaled_8tap_regular_w128_8bpc_ssse3: 18387.0 18077.3 mc_scaled_bilinear_w128_8bpc_c: 37381.4 41145.0 mc_scaled_bilinear_w128_8bpc_ssse3: 18423.8 18031.6 (Benchmarked with the seed 0; the total runtime for the scaled benchmarks are significantly affected by the random seed.) This reduces the stack usage of these functions from around 65 KB each, to less than 1 KB for bilin, and around 2 KB for 8tap. With this in place, the required stack space for dav1d should be mostly identical across configurations; on x86_64 (both with and without assembly), it can run with 62 KB of stack, and on arm and aarch64, it can run with 58 KB of stack.
Loading