WIP - cdef_filter_block_c optimizations for emscripten
Explicitly unrolling the 2-pass loop in cdef_filter_block_c has an overall ~2% improvement on decode time on WebAssembly running in node 11. Doesn't seem to affect native as much, but doesn't seem to regress either.
Unrolling in the padding func where things always come in 2-pixel increments shaves another half percent or so off of decode times in node 11 / WebAssembly.
Using a clang-specific pragma, gated on #ifdef EMSCRIPTEN
, to minimize disruption to the source.
Testing with 360p test file: https://media-streaming.wmflabs.org/clean/av1/caminandes-llamigos.webm.640x360.av1.webm in the benchmark test harness for ogv.js, vs dav1d native builds on macOS.
This brings wasm decode time on my MacBook Pro down from ~64.0 seconds to ~61.5s, while native non-SIMD decode time stays around 23.5s, with no measurable difference for me. (WebAssembly has a bounds-check penalty on memory access, making a ~2x slowdown typical vs native for memory-heavy work.)
(Note that this is building with -O3
which includes -funroll-loops
and adding it manually doesn't help, so I'm not sure why the compiler is not unrolling these loops already. I think there may be something funky in emscripten's llvm/clang usage, which I'll try to track down later.)