Skip to content

x86: Make AVX2 SGR gatherless

Henrik Gramner requested to merge gramner/dav1d:sgr_avx2_gatherless into master

Instead of using gathers we can calculate the value of sgr_x_by_x[min(z, 255)] by doing 256 / (z + 1) in floating-point with some clipping for z == 0 and z >= 255.

As the required precision of the division is fairly small it can be performed using an approximate reciprocal, which is significantly faster than a regular division.

Gather instructions are slow on all AMD CPU:s, and on most Intel CPU:s ever since µcode updates were issued as a workaround for the Gather Data Sampling side channel vulnerability.

Checkasm numbers:

                         Old       New
Zen 4
sgr_3x3_8bpc_avx2:     26103.6   19346.5
sgr_3x3_10bpc_avx2:    27151.4   20147.3
sgr_5x5_8bpc_avx2:     17706.9   14045.3
sgr_5x5_10bpc_avx2:    18093.6   14498.4
sgr_mix_8bpc_avx2:     39354.4   30353.7
sgr_mix_10bpc_avx2:    43539.9   33032.1

Rocket Lake (with GDS mitigations):
sgr_3x3_8bpc_avx2:     72664.3   39553.6
sgr_3x3_10bpc_avx2:    73240.0   42907.6
sgr_5x5_8bpc_avx2:     43513.9   29035.4 
sgr_5x5_10bpc_avx2:    43374.3   32218.1
sgr_mix_8bpc_avx2:    114492.5   63391.3
sgr_mix_10bpc_avx2:   119367.7   71170.8

Merge request reports

Loading