x86: Improve film grain AVX2 asm
On average a couple of percent faster. In the end it didn't really help that much on high bit-depth though.
Checkasm numbers on Skylake:
old new
fguv_32x32xn_8bpc_420_csfl0_avx2: 610.9 591.0
fguv_32x32xn_8bpc_420_csfl1_avx2: 530.4 505.2
fguv_32x32xn_8bpc_422_csfl0_avx2: 613.5 591.5
fguv_32x32xn_8bpc_422_csfl1_avx2: 524.7 502.2
fguv_32x32xn_8bpc_444_csfl0_avx2: 452.8 436.7
fguv_32x32xn_8bpc_444_csfl1_avx2: 412.1 379.5
fgy_32x32xn_8bpc_avx2: 1501.7 1424.8
fguv_32x32xn_16bpc_420_csfl0_avx2: 626.7 620.8
fguv_32x32xn_16bpc_420_csfl1_avx2: 547.1 540.8
fguv_32x32xn_16bpc_422_csfl0_avx2: 642.7 630.6
fguv_32x32xn_16bpc_422_csfl1_avx2: 554.1 545.8
fguv_32x32xn_16bpc_444_csfl0_avx2: 509.2 508.3
fguv_32x32xn_16bpc_444_csfl1_avx2: 441.8 437.0
fgy_32x32xn_16bpc_avx2: 1593.0 1552.4