arm32: mc: NEON implementation of put/prep 8tap/bilin for 16 bpc
Examples of checkasm benchmarks:
Cortex A7 A8 A9 A53 A72 A73
mc_8tap_regular_w8_0_16bpc_neon: 158.7 106.2 167.0 127.9 55.0 77.2
mc_8tap_regular_w8_h_16bpc_neon: 1000.8 557.5 749.2 609.2 401.4 485.4
mc_8tap_regular_w8_hv_16bpc_neon: 2278.9 1255.4 1352.5 1277.2 867.8 915.9
mc_8tap_regular_w8_v_16bpc_neon: 1060.0 393.6 485.5 448.3 298.0 298.2
mc_bilinear_w8_0_16bpc_neon: 159.7 96.6 161.1 123.7 55.4 74.7
mc_bilinear_w8_h_16bpc_neon: 342.3 250.8 352.9 239.0 158.4 165.1
mc_bilinear_w8_hv_16bpc_neon: 587.7 373.8 469.0 339.8 244.4 247.5
mc_bilinear_w8_v_16bpc_neon: 285.8 189.3 284.9 180.4 103.4 100.9
mct_8tap_regular_w8_0_16bpc_neon: 233.0 136.6 229.3 169.3 86.2 98.3
mct_8tap_regular_w8_h_16bpc_neon: 1106.8 588.3 817.9 654.1 406.4 489.8
mct_8tap_regular_w8_hv_16bpc_neon: 2473.3 1326.3 1428.2 1373.7 903.3 951.1
mct_8tap_regular_w8_v_16bpc_neon: 1266.0 474.1 581.3 505.9 382.0 373.4
mct_bilinear_w8_0_16bpc_neon: 232.9 126.2 225.0 166.3 86.2 91.7
mct_bilinear_w8_h_16bpc_neon: 380.6 270.6 386.0 259.7 154.1 151.9
mct_bilinear_w8_hv_16bpc_neon: 631.4 409.2 509.4 372.1 243.1 244.1
mct_bilinear_w8_v_16bpc_neon: 349.5 233.5 347.9 212.4 138.7 138.4
For comparison, the corresponding numbers for the existing arm64 implementation:
Cortex A53 A72 A73
mc_8tap_regular_w8_0_16bpc_neon: 94.1 48.9 62.3
mc_8tap_regular_w8_h_16bpc_neon: 570.4 388.1 467.3
mc_8tap_regular_w8_hv_16bpc_neon: 1035.8 775.0 891.2
mc_8tap_regular_w8_v_16bpc_neon: 399.8 284.5 278.2
mc_bilinear_w8_0_16bpc_neon: 90.0 44.3 57.4
mc_bilinear_w8_h_16bpc_neon: 191.7 158.7 156.4
mc_bilinear_w8_hv_16bpc_neon: 295.6 235.0 244.9
mc_bilinear_w8_v_16bpc_neon: 147.2 99.0 88.8
mct_8tap_regular_w8_0_16bpc_neon: 139.4 78.4 84.9
mct_8tap_regular_w8_h_16bpc_neon: 612.3 395.9 478.6
mct_8tap_regular_w8_hv_16bpc_neon: 1113.0 804.3 963.5
mct_8tap_regular_w8_v_16bpc_neon: 462.1 370.8 353.3
mct_bilinear_w8_0_16bpc_neon: 135.6 77.0 80.5
mct_bilinear_w8_h_16bpc_neon: 210.8 159.2 141.7
mct_bilinear_w8_hv_16bpc_neon: 325.7 238.4 227.3
mct_bilinear_w8_v_16bpc_neon: 180.7 136.7 129.5