AArch64: Optimize 2D i8mm subpel filters
Rewrite the accumulator initializations of the horizontal part of the 2D filters with zero register fills. It can improve the performance on out-of-order CPUs which can fill vector registers by zero with zero latency. Zeroed accumulators imply the usage of the rounding shifts at the end of filters.
The only exception is the very short *hv_filter4*
, where the longer
latency of rounding shift could decrease the performance.
Relative performance of micro benchmarks (lower is better):
Cortex-X3:
mct_8tap_regular_w16_hv_8bpc_i8mm: 0.982x
mct_8tap_sharp_w16_hv_8bpc_i8mm: 0.979x
mct_8tap_regular_w8_hv_8bpc_i8mm: 0.972x
mct_8tap_sharp_w8_hv_8bpc_i8mm: 0.969x
mct_8tap_regular_w4_hv_8bpc_i8mm: 0.942x
mct_8tap_sharp_w4_hv_8bpc_i8mm: 0.935x
mc_8tap_regular_w16_hv_8bpc_i8mm: 0.988x
mc_8tap_sharp_w16_hv_8bpc_i8mm: 0.982x
mc_8tap_regular_w8_hv_8bpc_i8mm: 0.981x
mc_8tap_sharp_w8_hv_8bpc_i8mm: 0.975x
mc_8tap_regular_w4_hv_8bpc_i8mm: 0.998x
mc_8tap_sharp_w4_hv_8bpc_i8mm: 0.996x
mc_8tap_regular_w2_hv_8bpc_i8mm: 1.006x
mc_8tap_sharp_w2_hv_8bpc_i8mm: 0.993x
Cortex-A715:
mct_8tap_regular_w16_hv_8bpc_i8mm: 0.883x
mct_8tap_sharp_w16_hv_8bpc_i8mm: 0.931x
mct_8tap_regular_w8_hv_8bpc_i8mm: 0.882x
mct_8tap_sharp_w8_hv_8bpc_i8mm: 0.928x
mct_8tap_regular_w4_hv_8bpc_i8mm: 0.969x
mct_8tap_sharp_w4_hv_8bpc_i8mm: 0.934x
mc_8tap_regular_w16_hv_8bpc_i8mm: 0.881x
mc_8tap_sharp_w16_hv_8bpc_i8mm: 0.925x
mc_8tap_regular_w8_hv_8bpc_i8mm: 0.879x
mc_8tap_sharp_w8_hv_8bpc_i8mm: 0.925x
mc_8tap_regular_w4_hv_8bpc_i8mm: 0.917x
mc_8tap_sharp_w4_hv_8bpc_i8mm: 0.976x
mc_8tap_regular_w2_hv_8bpc_i8mm: 0.915x
mc_8tap_sharp_w2_hv_8bpc_i8mm: 0.972x
Cortex-A510:
mct_8tap_regular_w16_hv_8bpc_i8mm: 0.994x
mct_8tap_sharp_w16_hv_8bpc_i8mm: 0.949x
mct_8tap_regular_w8_hv_8bpc_i8mm: 0.987x
mct_8tap_sharp_w8_hv_8bpc_i8mm: 0.947x
mct_8tap_regular_w4_hv_8bpc_i8mm: 1.002x
mct_8tap_sharp_w4_hv_8bpc_i8mm: 0.999x
mc_8tap_regular_w16_hv_8bpc_i8mm: 0.989x
mc_8tap_sharp_w16_hv_8bpc_i8mm: 1.003x
mc_8tap_regular_w8_hv_8bpc_i8mm: 0.986x
mc_8tap_sharp_w8_hv_8bpc_i8mm: 1.000x
mc_8tap_regular_w4_hv_8bpc_i8mm: 1.007x
mc_8tap_sharp_w4_hv_8bpc_i8mm: 1.000x
mc_8tap_regular_w2_hv_8bpc_i8mm: 1.005x
mc_8tap_sharp_w2_hv_8bpc_i8mm: 1.000x