aarch64: Use rounded right shifts in dequant
Don't manually add in the rounding constant (via a fused multiply-add instruction) when we can just do a plain rounded right shift.
Cortex A53 A72 A73
8bpc:
Before:
dequant_4x4_cqm_neon: 515 246 267
dequant_4x4_dc_cqm_neon: 410 265 266
dequant_4x4_dc_flat_neon: 413 271 271
dequant_4x4_flat_neon: 519 254 274
dequant_8x8_cqm_neon: 1555 980 1002
dequant_8x8_flat_neon: 1562 994 1014
After:
dequant_4x4_cqm_neon: 499 246 255
dequant_4x4_dc_cqm_neon: 376 265 255
dequant_4x4_dc_flat_neon: 378 271 260
dequant_4x4_flat_neon: 500 254 262
dequant_8x8_cqm_neon: 1489 900 925
dequant_8x8_flat_neon: 1493 915 938
10bpc:
Before:
dequant_4x4_cqm_neon: 483 275 275
dequant_4x4_dc_cqm_neon: 429 256 261
dequant_4x4_dc_flat_neon: 435 267 267
dequant_4x4_flat_neon: 487 283 288
dequant_8x8_cqm_neon: 1511 1112 1076
dequant_8x8_flat_neon: 1518 1139 1089
After:
dequant_4x4_cqm_neon: 472 255 239
dequant_4x4_dc_cqm_neon: 404 256 232
dequant_4x4_dc_flat_neon: 406 267 234
dequant_4x4_flat_neon: 472 255 239
dequant_8x8_cqm_neon: 1462 922 978
dequant_8x8_flat_neon: 1462 922 978
This makes it around 3% faster on the Cortex A53, around 8% faster for 8bpc on Cortex A72/A73, and around 10-20% faster for 10bpp on A72/A73.