x86: Make AVX2 SGR gatherless
Instead of using gathers we can calculate the value of sgr_x_by_x[min(z, 255)]
by doing 256 / (z + 1)
in floating-point with some clipping for z == 0
and z >= 255
.
As the required precision of the division is fairly small it can be performed using an approximate reciprocal, which is significantly faster than a regular division.
Gather instructions are slow on all AMD CPU:s, and on most Intel CPU:s ever since µcode updates were issued as a workaround for the Gather Data Sampling side channel vulnerability.
Checkasm numbers:
Old New
Zen 4
sgr_3x3_8bpc_avx2: 26103.6 19346.5
sgr_3x3_10bpc_avx2: 27151.4 20147.3
sgr_5x5_8bpc_avx2: 17706.9 14045.3
sgr_5x5_10bpc_avx2: 18093.6 14498.4
sgr_mix_8bpc_avx2: 39354.4 30353.7
sgr_mix_10bpc_avx2: 43539.9 33032.1
Rocket Lake (with GDS mitigations):
sgr_3x3_8bpc_avx2: 72664.3 39553.6
sgr_3x3_10bpc_avx2: 73240.0 42907.6
sgr_5x5_8bpc_avx2: 43513.9 29035.4
sgr_5x5_10bpc_avx2: 43374.3 32218.1
sgr_mix_8bpc_avx2: 114492.5 63391.3
sgr_mix_10bpc_avx2: 119367.7 71170.8