arm64: mc: Use addp instead of addv+trn1 in warp
Before: Cortex A53 A72 A73
warp_8x8_8bpc_neon: 1952.8 1161.3 1151.1
warp_8x8t_8bpc_neon: 1937.1 1147.5 1139.0
After:
warp_8x8_8bpc_neon: 1860.8 1068.6 1105.8
warp_8x8t_8bpc_neon: 1846.9 1056.4 1099.8