arm64: itx: Do the final calculation of adst8/adst16 in 32 bit to avoid too narrow clipping (!783) · Merge requests · VideoLAN / dav1d

Martin Storsjö requested to merge mstorsjo/dav1d:arm64-itx-32bit into master Sep 03, 2019

See issue #295 (closed), this fixes it for arm64.

Before:                                 Cortex A53      A72      A73
inv_txfm_add_8x8_adst_adst_1_8bpc_neon:      332.0    248.0    247.1
inv_txfm_add_16x16_adst_adst_2_8bpc_neon:   1676.8   1197.0   1186.8
After:
inv_txfm_add_8x8_adst_adst_1_8bpc_neon:      358.0    269.0    276.2
inv_txfm_add_16x16_adst_adst_2_8bpc_neon:   1785.2   1347.8   1312.1

This would probably only be needed for adst in the first pass, but the additional code complexity from splitting the implementations (as we currently don't have transforms differentiated between first and second pass) isn't necessarily worth it (the speedup over C code is still 8-10x).

Also notifying @gramner

Edited Sep 03, 2019 by Martin Storsjö

arm64: itx: Do the final calculation of adst8/adst16 in 32 bit to avoid too narrow clipping

Merge request reports