arm: itx: Add clipping to row_clip_min/max in the 10 bpc codepaths
This fixes conformance with the argon test samples, in particular with these samples:
profile0_core/streams/test10100_579_8614.obu
profile0_core/streams/test10218_6914.obu
This gives a pretty notable slowdown to these transforms - some examples:
Before: Cortex A53 A72 A73 Apple M1
inv_txfm_add_8x8_dct_dct_1_10bpc_neon: 365.7 291.4 299.2 0.5
inv_txfm_add_16x16_dct_dct_2_10bpc_neon: 1864.8 1408.2 1458.2 2.6
inv_txfm_add_64x64_dct_dct_4_10bpc_neon: 31019.8 25440.7 24892.5 42.8
After:
inv_txfm_add_8x8_dct_dct_1_10bpc_neon: 401.7 322.5 343.4 0.6
inv_txfm_add_16x16_dct_dct_2_10bpc_neon: 2154.4 1614.3 1704.9 2.7
inv_txfm_add_64x64_dct_dct_4_10bpc_neon: 38220.0 28423.7 28172.6 51.6
Thus, for the transforms alone, it makes them around 10-20% slower.
Measured on actual full decoding, it makes decoding of 10 bpc Chimera 2% slower on an Apple M1 (from 164 to 160 fps).