arm64: itx: Do the final calculation of adst8/adst16 in 32 bit to avoid too narrow clipping
See issue #295 (closed), this fixes it for arm64.
Before: Cortex A53 A72 A73
inv_txfm_add_8x8_adst_adst_1_8bpc_neon: 332.0 248.0 247.1
inv_txfm_add_16x16_adst_adst_2_8bpc_neon: 1676.8 1197.0 1186.8
After:
inv_txfm_add_8x8_adst_adst_1_8bpc_neon: 358.0 269.0 276.2
inv_txfm_add_16x16_adst_adst_2_8bpc_neon: 1785.2 1347.8 1312.1
This would probably only be needed for adst in the first pass, but the additional code complexity from splitting the implementations (as we currently don't have transforms differentiated between first and second pass) isn't necessarily worth it (the speedup over C code is still 8-10x).
Also notifying @gramner
Edited by Martin Storsjö