x86: Add cdef_filter SSE optimizations
old 32 new old 64 new
cdef_filter_4x4_8bpc_sse2: 205.8 130.5 189.1 128.5
cdef_filter_4x4_8bpc_ssse3: 163.3 103.7 142.5 103.3
cdef_filter_4x4_8bpc_sse4: 150.3 99.5 130.6 98.8
cdef_filter_4x8_8bpc_sse2: 377.2 222.8 336.7 222.1
cdef_filter_4x8_8bpc_ssse3: 291.6 171.4 245.7 164.6
cdef_filter_4x8_8bpc_sse4: 264.7 163.2 218.7 157.2
cdef_filter_8x8_8bpc_sse2: 668.5 369.9 567.4 365.0
cdef_filter_8x8_8bpc_ssse3: 509.5 271.8 399.6 250.6
cdef_filter_8x8_8bpc_sse4: 461.6 258.5 341.0 234.3
Most performance gain is from having separate code paths for !pri_strength
and !sec_strength
, but there's various small optimizations everywhere.
The 32-bit PIC handling is also cleaned up and simplified.