x86/deblock: make hbd/ssse3 implementations 32bit-compatible
Potential improvements / future directions:
-
wd=16 modifies only 6 pixels per side, so 6x2x2=24 bytes can be done in 1.5 transposes instead of 2 (i.e. one transpose8x8w + 2 transpose4x4w, instead of 2 transpose8x8w) + accompanying writes (movu+[movq or movhps] instead of 2xmova). This would also save minor amounts of stack space since we don't need to save p7/q7 anymore. The disadvantage of this is that the writes would be unaligned. -
I've modified flat6/8 to use psubw new, old; pand new, mask; paddw new, old
instead of the originalpandn old, mask; pand new, mask; por new, old
versions. This made writing 32bit code easier and turned out to be slightly faster, also. I did not (yet) do this for flat16.
Edited by Ronald S. Bultje