arm64: itx: Add NEON optimized inverse transforms (!726) · Merge requests · VideoLAN / dav1d

The speedup for most non-dc-only dct functions is around 9-12x over the C code generated by GCC 7.3.

Relative speedups vs C for a few functions:

                                              Cortex A53    A72    A73
inv_txfm_add_4x4_dct_dct_0_8bpc_neon:               3.90   4.16   5.65
inv_txfm_add_4x4_dct_dct_1_8bpc_neon:               7.20   8.05  11.19
inv_txfm_add_8x8_dct_dct_0_8bpc_neon:               5.09   6.73   6.45
inv_txfm_add_8x8_dct_dct_1_8bpc_neon:              12.18  10.80  13.05
inv_txfm_add_16x16_dct_dct_0_8bpc_neon:             7.31   9.35  11.17
inv_txfm_add_16x16_dct_dct_1_8bpc_neon:            14.36  13.06  15.93
inv_txfm_add_16x16_dct_dct_2_8bpc_neon:            11.00  10.09  12.05
inv_txfm_add_32x32_dct_dct_0_8bpc_neon:             4.41   5.40   5.77
inv_txfm_add_32x32_dct_dct_1_8bpc_neon:            13.84  13.81  18.04
inv_txfm_add_32x32_dct_dct_2_8bpc_neon:            11.75  11.87  15.22
inv_txfm_add_32x32_dct_dct_3_8bpc_neon:            10.20  10.40  13.13
inv_txfm_add_32x32_dct_dct_4_8bpc_neon:             9.01   9.21  11.56
inv_txfm_add_64x64_dct_dct_0_8bpc_neon:             3.84   4.82   5.28
inv_txfm_add_64x64_dct_dct_1_8bpc_neon:            14.40  12.69  16.71
inv_txfm_add_64x64_dct_dct_4_8bpc_neon:            10.91   9.63  12.67

Some of the specialcased identity_identity transforms for 32x32 give insane speedups over the generic C code:

inv_txfm_add_32x32_identity_identity_0_8bpc_neon: 225.26 238.11 247.07
inv_txfm_add_32x32_identity_identity_1_8bpc_neon: 225.33 238.53 247.69
inv_txfm_add_32x32_identity_identity_2_8bpc_neon:  59.60  61.94  64.63
inv_txfm_add_32x32_identity_identity_3_8bpc_neon:  26.98  27.99  29.21
inv_txfm_add_32x32_identity_identity_4_8bpc_neon:  15.08  15.93  16.56

The total decoding speedup for one particular test with Chimera (-i Chimera-AV1-8bit-1920x1080-6736kbps.ivf -o n.null --framethreads 4 --tilethreads 4 --skip 120 --limit 1000 on a Snapdragon 835) takes it from 109 fps to 118 fps.

Also mentioning @janne for potential reviews.

arm64: itx: Add NEON optimized inverse transforms

Merge request reports