x86: Add 6-tap variants of 8bpc mc SSSE3 functions
This also removes the SSE2 mc code, as navigating the maze of nested macros and %if
statements was a bit too much for an ISA extension that's essentially irrelevant.
Makes overall SSSE3 decoding performance of a Bosphorus sample increase by around 30-35%.
Checkasm numbers on Zen 4:
8-tap 6-tap
w2_v: 18.5 16.4
w2_hv: 31.0 27.7
w4_v: 17.5 15.3
w4_hv: 68.3 36.3
w8_h: 31.4 23.3
w8_v: 24.0 20.0
w8_hv: 144.3 67.6
w16_h: 83.1 59.8
w16_v: 60.9 49.7
w16_hv: 381.8 173.6
w32_h: 257.0 184.9
w32_v: 179.7 147.4
w32_hv: 1129.7 515.1
w64_h: 910.7 654.9
w64_v: 620.1 510.2
w64_hv: 3835.2 1765.6
w128_h: 2582.5 1853.9
w128_v: 1763.6 1464.8
w128_hv: 10651.3 4938.9
Edited by Henrik Gramner