arm64: itx: Add NEON optimized inverse transforms
The speedup for most non-dc-only dct functions is around 9-12x over the C code generated by GCC 7.3.
Relative speedups vs C for a few functions:
Cortex A53 A72 A73
inv_txfm_add_4x4_dct_dct_0_8bpc_neon: 3.90 4.16 5.65
inv_txfm_add_4x4_dct_dct_1_8bpc_neon: 7.20 8.05 11.19
inv_txfm_add_8x8_dct_dct_0_8bpc_neon: 5.09 6.73 6.45
inv_txfm_add_8x8_dct_dct_1_8bpc_neon: 12.18 10.80 13.05
inv_txfm_add_16x16_dct_dct_0_8bpc_neon: 7.31 9.35 11.17
inv_txfm_add_16x16_dct_dct_1_8bpc_neon: 14.36 13.06 15.93
inv_txfm_add_16x16_dct_dct_2_8bpc_neon: 11.00 10.09 12.05
inv_txfm_add_32x32_dct_dct_0_8bpc_neon: 4.41 5.40 5.77
inv_txfm_add_32x32_dct_dct_1_8bpc_neon: 13.84 13.81 18.04
inv_txfm_add_32x32_dct_dct_2_8bpc_neon: 11.75 11.87 15.22
inv_txfm_add_32x32_dct_dct_3_8bpc_neon: 10.20 10.40 13.13
inv_txfm_add_32x32_dct_dct_4_8bpc_neon: 9.01 9.21 11.56
inv_txfm_add_64x64_dct_dct_0_8bpc_neon: 3.84 4.82 5.28
inv_txfm_add_64x64_dct_dct_1_8bpc_neon: 14.40 12.69 16.71
inv_txfm_add_64x64_dct_dct_4_8bpc_neon: 10.91 9.63 12.67
Some of the specialcased identity_identity transforms for 32x32 give insane speedups over the generic C code:
inv_txfm_add_32x32_identity_identity_0_8bpc_neon: 225.26 238.11 247.07
inv_txfm_add_32x32_identity_identity_1_8bpc_neon: 225.33 238.53 247.69
inv_txfm_add_32x32_identity_identity_2_8bpc_neon: 59.60 61.94 64.63
inv_txfm_add_32x32_identity_identity_3_8bpc_neon: 26.98 27.99 29.21
inv_txfm_add_32x32_identity_identity_4_8bpc_neon: 15.08 15.93 16.56
The total decoding speedup for one particular test with Chimera (-i Chimera-AV1-8bit-1920x1080-6736kbps.ivf -o n.null --framethreads 4 --tilethreads 4 --skip 120 --limit 1000
on a Snapdragon 835) takes it from 109 fps to 118 fps.
Also mentioning @janne for potential reviews.