arm: cdef: Do a 8 bit implementation for cases with all edges present
This increases the code size by around 3.4 KB.
Before:
ARM32: Cortex A7 A8 A9 A53 A72 A73
cdef_filter_4x4_8bpc_neon: 807.1 517.0 617.7 506.6 429.9 357.8
cdef_filter_4x8_8bpc_neon: 1407.9 899.3 1054.6 862.3 726.5 628.1
cdef_filter_8x8_8bpc_neon: 2394.9 1456.8 1676.8 1461.2 1084.4 1101.2
ARM64:
cdef_filter_4x4_8bpc_neon: 460.7 301.8 308.0
cdef_filter_4x8_8bpc_neon: 831.6 547.0 555.2
cdef_filter_8x8_8bpc_neon: 1454.6 935.6 960.4
After:
ARM32:
cdef_filter_4x4_8bpc_neon: 669.3 541.3 524.4 424.9 322.7 298.1
cdef_filter_4x8_8bpc_neon: 1159.1 922.9 881.1 709.2 538.3 514.1
cdef_filter_8x8_8bpc_neon: 1888.8 1285.4 1358.5 1152.9 839.3 871.2
ARM64:
cdef_filter_4x4_8bpc_neon: 383.6 262.1 259.9
cdef_filter_4x8_8bpc_neon: 684.9 472.2 464.7
cdef_filter_8x8_8bpc_neon: 1160.0 756.8 788.0
(The checkasm benchmark averages three different cases; the fully edged case is one of those three, while it's the most common case in actual video. The difference is much bigger if only benchmarking that particular case.)
This actually apparently makes the code a little bit slower for the w=4 cases on Cortex A8, while it's a significant speedup on all other cores.
Edited by Jean-Baptiste Kempf