Commits · 27491dd9538185f15428bede8ceb4f502087af40 · Bogdan Gligorijević / dav1d

Aug 24, 2024

Martin Storsjö authored 8 months ago

Apparently, this case isn't actually ever executed, at least in most
checkasm runs, but some tools could complain about the relocation
against 160b, which pointed elsewhere than intended.

27491dd9

Aug 23, 2024

aarch64: Avoid looping through the BTI instructions · e560d2ba
Martin Storsjö authored 8 months ago
```
This does the same optimizations as
3329f8d1 and
1790e132 on the rest of the
code.
```
e560d2ba

aarch64: ipred: Use the right fill width loop in ipred_z3_fill_padding_neon · 5a33c5c6

Martin Storsjö authored 8 months ago

This makes the code behave as intended, when filling a rectangle
with arbitrary width (filling with the largest power of two width
until filled); previously, it accidentally fell back on writing 4
pixel wide stripes immediately.

No measurable effect on checkasm benchmarks though.

5a33c5c6

Aug 22, 2024

AArch64: SVE MS armasm64 fix of HBD subpel filters · 472b31f8

Arpad Panyik authored 8 months ago and

Martin Storsjö committed 8 months ago

MS armasm64 cannot compile some SVE instructions with immediate
operands, e.g.:
  sub  z0.h, z0.h, #8192

The proper form is:
  sub  z0.h, z0.h, #32, lsl #8

This patch contains the needed fixes.

472b31f8

aarch64: mc16: Optimize the BTI landing pads in put/prep_neon · 3329f8d1

Martin Storsjö authored 8 months ago

Don't include the BTI landing pad instruction in the loops.

If built with BTI enabled, AARCH64_VALID_JUMP_TARGET expands to
a no-op instruction that indicates that indirect jumps can land
there. But there's no need for the loops to include that instruction.

3329f8d1

AArch64: Add HBD subpel filters using 128-bit SVE2 · 01558f3f

Arpad Panyik authored 8 months ago and

Martin Storsjö committed 8 months ago

Add an Armv9.0-A SVE2 code path for high bitdepth convolutions. Only
2D convolutions have 6-tap specialisations of their vertical passes.
All other convolutions are 4- or 8-tap filters which fit well with
the 4-element 16-bit SDOT instruction of SVE2.

This patch renames HBD prep/put_neon to prep/put_16bpc_neon and
exports put_16bpc_neon.

Benchmarks show up-to 17% FPS increase depending on the input video
and the CPU used.

This patch will increase the .text by around 8 KiB.

Relative performance to the C reference on some Cortex-A/X CPUs:

    regular     A715    A720      X3      X4    A510    A520
 w4 hv neon:    3.93x   4.10x   5.21x   5.17x   3.57x   5.27x
 w4 hv sve2:    4.99x   5.14x   6.00x   6.05x   4.33x   3.99x
 w8 hv neon:    1.72x   1.67x   1.98x   2.18x   2.95x   2.94x
 w8 hv sve2:    2.12x   2.29x   2.52x   2.62x   2.60x   2.60x
w16 hv neon:    1.59x   1.53x   1.83x   1.89x   2.35x   2.24x
w16 hv sve2:    1.94x   2.12x   2.33x   2.18x   2.06x   2.06x
w32 hv neon:    1.49x   1.50x   1.66x   1.76x   2.10x   2.16x
w32 hv sve2:    1.81x   2.09x   2.11x   2.09x   1.84x   1.87x
w64 hv neon:    1.52x   1.50x   1.55x   1.71x   1.95x   2.05x
w64 hv sve2:    1.84x   2.08x   1.97x   1.98x   1.74x   1.77x

 w4 h neon:     5.35x   5.47x   7.39x   5.78x   3.92x   5.19x
 w4 h sve2:     7.91x   8.35x  11.95x  10.33x   5.81x   5.42x
 w8 h neon:     4.49x   4.43x   6.50x   4.87x   7.18x   6.17x
 w8 h sve2:     6.09x   6.22x   9.59x   7.70x   7.89x   6.83x
w16 h neon:     2.53x   2.52x   2.34x   1.86x   2.71x   2.75x
w16 h sve2:     3.41x   3.47x   3.53x   3.25x   2.89x   2.96x
w32 h neon:     2.07x   2.08x   1.97x   1.56x   2.17x   2.21x
w32 h sve2:     2.76x   2.84x   2.94x   2.75x   2.24x   2.29x
w64 h neon:     1.86x   1.86x   1.76x   1.41x   1.87x   1.88x
w64 h sve2:     2.47x   2.54x   2.65x   2.46x   1.94x   1.94x

 w4 v neon:     5.22x   5.17x   6.36x   5.60x   4.23x   7.30x
 w4 v sve2:     5.86x   5.90x   7.81x   7.16x   4.86x   4.15x
 w8 v neon:     4.83x   4.79x   6.96x   6.45x   4.74x   8.40x
 w8 v sve2:     5.25x   5.23x   7.76x   6.79x   4.84x   4.13x
w16 v neon:     2.59x   2.60x   2.93x   2.47x   1.80x   4.16x
w16 v sve2:     2.85x   2.88x   3.36x   2.73x   1.86x   2.00x
w32 v neon:     2.12x   2.13x   2.33x   2.03x   1.34x   3.11x
w32 v sve2:     2.36x   2.40x   2.73x   2.32x   1.41x   1.48x
w64 v neon:     1.94x   1.92x   2.02x   1.78x   1.12x   2.59x
w64 v sve2:     2.16x   2.15x   2.37x   2.03x   1.17x   1.22x

 w4 0 neon:     1.75x   1.71x   1.44x   1.56x   3.18x   2.87x
 w4 0 sve2:     4.28x   4.39x   5.72x   6.42x   5.50x   4.68x
 w8 0 neon:     3.05x   3.04x   4.44x   4.64x   3.84x   3.52x
 w8 0 sve2:     3.85x   3.80x   5.45x   6.01x   4.92x   4.26x
w16 0 neon:     2.92x   2.93x   3.82x   3.23x   4.58x   4.44x
w16 0 sve2:     4.29x   4.27x   4.25x   4.15x   5.58x   5.29x
w32 0 neon:     2.73x   2.76x   3.50x   2.67x   4.44x   4.26x
w32 0 sve2:     4.09x   4.10x   3.75x   3.39x   5.67x   5.22x
w64 0 neon:     2.73x   2.70x   3.27x   3.14x   4.57x   4.68x
w64 0 sve2:     4.06x   3.97x   3.54x   3.18x   6.36x   6.25x

      sharp     A715    A720      X3      X4    A510    A520
 w4 hv neon:    3.54x   3.64x   4.43x   4.45x   3.03x   4.72x
 w4 hv sve2:    4.30x   4.55x   5.38x   5.26x   4.04x   3.76x
 w8 hv neon:    1.30x   1.25x   1.51x   1.60x   2.44x   2.43x
 w8 hv sve2:    1.86x   2.06x   2.09x   2.18x   2.37x   2.39x
w16 hv neon:    1.19x   1.16x   1.43x   1.36x   1.95x   1.98x
w16 hv sve2:    1.68x   1.91x   1.94x   1.84x   1.89x   1.94x
w32 hv neon:    1.13x   1.12x   1.30x   1.29x   1.75x   1.81x
w32 hv sve2:    1.58x   1.84x   1.75x   1.74x   1.70x   1.76x
w64 hv neon:    1.13x   1.13x   1.21x   1.25x   1.65x   1.69x
w64 hv sve2:    1.57x   1.84x   1.62x   1.67x   1.62x   1.65x

 w4 h neon:     5.38x   5.49x   7.46x   5.74x   3.93x   5.23x
 w4 h sve2:     7.86x   8.37x  11.99x  10.38x   5.81x   5.40x
 w8 h neon:     3.46x   3.49x   5.36x   4.64x   6.40x   5.62x
 w8 h sve2:     5.95x   6.23x   9.61x   7.76x   7.86x   6.89x
w16 h neon:     1.99x   1.97x   2.07x   1.91x   2.43x   2.51x
w16 h sve2:     3.42x   3.46x   3.75x   3.23x   2.89x   2.98x
w32 h neon:     1.67x   1.62x   1.66x   1.63x   1.95x   2.01x
w32 h sve2:     2.86x   2.84x   2.94x   2.72x   2.21x   2.29x
w64 h neon:     1.45x   1.45x   1.51x   1.48x   1.69x   1.70x
w64 h sve2:     2.47x   2.54x   2.64x   2.46x   1.93x   1.95x

 w4 v neon:     4.07x   4.01x   5.15x   4.74x   3.38x   6.56x
 w4 v sve2:     5.88x   5.86x   7.81x   7.15x   4.85x   4.39x
 w8 v neon:     3.64x   3.59x   5.38x   4.92x   3.59x   7.23x
 w8 v sve2:     5.23x   5.19x   7.77x   6.66x   4.81x   4.13x
w16 v neon:     1.93x   1.95x   2.25x   1.92x   1.35x   3.46x
w16 v sve2:     2.85x   2.88x   3.36x   2.71x   1.86x   1.94x
w32 v neon:     1.57x   1.58x   1.78x   1.60x   1.01x   2.67x
w32 v sve2:     2.36x   2.39x   2.73x   2.35x   1.41x   1.50x
w64 v neon:     1.44x   1.42x   1.54x   1.43x   0.85x   2.19x
w64 v sve2:     2.17x   2.15x   2.37x   2.06x   1.18x   1.25x

01558f3f

Aug 21, 2024

AArch64: Add USMMLA impl. for SBD 6-tap H/HV filters · 713c076d

Arpad Panyik authored 8 months ago

Add 6-tap variant of standard bit-depth horizontal subpel filters
using the Armv8.6 I8MM USMMLA matrix multiply instruction. This patch
also extends the HV filter with 6-tap horizontal pass using USMMLA.

Benchmarks show up-to 6-7% FPS increase depending on the input video
and the CPU used.

This patch will increase the .text by around 1.2 KiB.

Relative runtime of micro benchmarks after this patch on Neoverse
and Cortex CPU cores:

regular      V2      V1      X3    A720    A715    A520    A510
  w8 hv:  0.860x  0.895x  0.870x  0.896x  0.896x  0.938x  0.936x
 w16 hv:  0.829x  0.886x  0.865x  0.908x  0.906x  0.946x  0.944x
 w32 hv:  0.837x  0.883x  0.862x  0.914x  0.915x  0.953x  0.949x
 w64 hv:  0.840x  0.883x  0.862x  0.914x  0.914x  0.955x  0.952x

  w8 h:   0.746x  0.754x  0.747x  0.723x  0.724x  0.874x  0.866x
 w16 h:   0.749x  0.764x  0.745x  0.731x  0.731x  0.858x  0.852x
 w32 h:   0.739x  0.754x  0.738x  0.729x  0.729x  0.839x  0.837x
 w64 h:   0.736x  0.749x  0.733x  0.725x  0.726x  0.847x  0.836x

713c076d

Aug 12, 2024

AArch64: Fix typo in SBD 6-tap 2D/HV subpel filter · 287e90a3

Arpad Panyik authored 8 months ago

The macro parameter \xmy of filter_8tap_fn was used incorrectly as a
pointer instead of \lsrc. They refer to the same register but in
different context.

287e90a3

Aug 04, 2024
- decode_coefs: Optimize index offset calculations · 5ef6b241
  Kyle Siefring authored 8 months ago
```
Performance Impact on Sapphire Rapids:

Chimera: 0.46% Faster
```
  5ef6b241
Jun 26, 2024

AArch64: Move constants of DotProd subpel filters to .rodata · 2355eeb8

Arpad Panyik authored 10 months ago

The constants used for the subpel filters were placed in the .text
section for simplicity and peak performance, but this does not work on
systems with execute only .text sections (e.g.: OpenBSD).

The performance cost of moving the constants to the .rodata section
is small and mostly within the measurable noise.

2355eeb8

Jun 25, 2024

aarch64: Explicitly use the ldur instruction where relevant in mc_dotprod.S · 7fbcdc6d

Martin Storsjö authored 10 months ago

The ldr instruction only can handle offsets that are a multiple
of the element size; most assemblers implicitly produce the ldur
instruction when a non-aligned offset is provided.

Older versions of MS armasm64, however, error out on this. Since
MSVC 2022 17.8, armasm64 implicitly can produce ldur, but 2022 17.7
and earlier require explicitly writing the instruction as ldur.

Despite this, even older versions still fail to build the mc_dotprod.S
sources, with errors like this:

    src\libdav1d.a.p\mc_dotprod.obj.asm(556) : error A2513: operand 2: Constant value out of range
        mov             x10, (((0*15-1)<<7)|(3*15-1))

This happens on MSVC 2022 17.1 and older, while 17.2 and newer
accept the negative value expression here.

In practice, HAVE_DOTPROD doesn't get enabled by the Meson configure
script at the moment, as it uses inline assembly to test for external
assembler features.

7fbcdc6d

Add Arm OpenBSD run-time CPU feature detection support · 431f4fb2
Brad Smith authored 10 months ago and Martin Storsjö committed 10 months ago
```
Add run-time CPU feature detection for DotProd and i8mm on AArch64.
```
431f4fb2
x86: Add 6-tap variants of high bit-depth mc SSSE3 functions · 32bf6cde
Henrik Gramner authored 10 months ago

32bf6cde

Jun 17, 2024
- itx: restrict number of columns iterated over based on EOB · ca83ee6d
  Ronald S. Bultje authored 10 months ago
  
  ca83ee6d
Jun 10, 2024
- cli: Prevent buffer over-read · 01b94cc3
  Nathan E. Egge authored 10 months ago
  
  01b94cc3
Jun 05, 2024

AArch64: Fix potential out of bounds access in DotProd H/HV filters · 92f592ed

Arpad Panyik authored 10 months ago

The DotProd/I8MM horizontal and HV/2D subpel filters use -4 offset
for sampling instead of -3 to be better aligned in some cases. This
resulted in an out of bounds access, which led to crashes.

This patch fixes it.

92f592ed

May 27, 2024
- x86: Eliminate hardcoded struct offsets in refmvs load_tmvs() asm · da2cc781
  Henrik Gramner authored 10 months ago
  
  da2cc781
- refmvs: Consolidate r and rp_proj allocations · 26a2744e
  Henrik Gramner authored 10 months ago
```
The conditions for when to (re)allocate those buffers are identical,
so they can be merged into a single branch.

The allocation of the buffers themselves can also be combined to
reduce the number of allocation calls.
```
  26a2744e
- refmvs: Remove dav1d_refmvs_init() · 54801d07
  Henrik Gramner authored 10 months ago
```
It's only ever called on data which has already been zero-initialized.
```
  54801d07
- refmvs: Simplify 2-pass logic · 89a200c8
  Henrik Gramner authored 10 months ago
```
n_tc is always >= n_fc, so we only need to check the latter.
```
  89a200c8
- x86: Add 6-tap variants of 8bpc mc SSSE3 functions · ca156d90
  Henrik Gramner authored 11 months ago
  
  ca156d90
- x86: Add minor 8bpc mc SSE improvements · 8afbd4f6
  Henrik Gramner authored 11 months ago
  
  8afbd4f6
- x86: Remove 8bpc mc SSE2 asm · 85c16391
  Henrik Gramner authored 11 months ago
```
The amount of nested macros caused by having to support SSE2 makes
the code very difficult to maintain and modify. It is also of
questionable value considering most other asm requires SSSE3.
```
  85c16391
- x86: Remove unused macro in mc16_avx512.asm · d3997acb
  Henrik Gramner authored 11 months ago
  
  d3997acb
May 25, 2024
- Update NEWS for 1.4.2 · 805d9e5a
  Jean-Baptiste Kempf authored 11 months ago
  
  805d9e5a
May 20, 2024
- ARM64: Minor improvement to symbol decode · 3623543c
  Kyle Siefring authored 11 months ago and Jean-Baptiste Kempf committed 11 months ago
```
Use a slightly shorter series of instructions to compute cdf update
rate.
```
  3623543c
- tests: Verify dav1d command line in dav1d_argon.bash · bb948769
  Henrik Gramner authored 11 months ago
```
Error out early instead of producing bogus mismatch errors in case
of an incorrect cpu mask for example.
```
  bb948769
May 19, 2024

arm64: msac: Explicitly use the ldur instruction · 9469e184

Martin Storsjö authored 11 months ago

The ldr instruction can take an immediate offset which is a multiple
of the loaded element size. If the ldr instruction is given an
immediate offset which isn't a multiple of the element size,
most assemblers implicitly generate a "ldur" instruction instead.

Older versions of MS armasm64.exe don't do this, but instead error
out with "error A2518: operand 2: Memory offset must be aligned".
(Current versions don't do this but correctly generate "ldur"
implicitly.)

Switch this instruction to an explicit "ldur", like we do elsewhere,
to fix building with these older tools.

9469e184

May 18, 2024

CI: Update Android image · 37155c11

Matthias Dressel authored 1 year ago and

Jean-Baptiste Kempf committed 11 months ago

NDK 26 dropped support for API versions 19 and 20 (KitKat, Android 4.4).
The minimum supported API is now 21 (Lollipop, Android 5.0).

37155c11

May 14, 2024

ARM64: Various optimizations for symbol decode · 7f68f23c

Kyle Siefring authored 1 year ago

Changes stem from redesigning the reduction stage of the multisymbol
decode function.
* No longer use adapt4 for 5 possible symbol values
* Specialize reduction for 4/8/16 decode functions
* Modify control flow

+------------------------+--------------+--------------+---------------+
|                        |  Neoverse V1 |  Neoverse N1 |   Cortex A72  |
|                        | (Graviton 3) | (Graviton 2) |  (Graviton 1) |
+------------------------+-------+------+-------+------+-------+-------+
|                        |  Old  |  New |  Old  |  New |  Old  |  New  |
+------------------------+-------+------+-------+------+-------+-------+
| decode_bool_neon       |  13.0 | 12.9 |  14.9 | 14.0 |  39.3 |  29.0 |
+------------------------+-------+------+-------+------+-------+-------+
| decode_bool_adapt_neon |  15.4 | 15.6 |  17.5 | 16.8 |  41.6 |  33.5 |
+------------------------+-------+------+-------+------+-------+-------+
| decode_bool_equi_neon  |  11.3 | 12.0 |  14.0 | 12.2 |  35.0 |  26.3 |
+------------------------+-------+------+-------+------+-------+-------+
| decode_hi_tok_c        |  73.7 | 57.8 |  73.4 | 60.5 | 130.1 | 103.9 |
+------------------------+-------+------+-------+------+-------+-------+
| decode_hi_tok_neon     |  63.3 | 48.2 |  65.2 | 51.2 | 119.0 | 105.3 |
+------------------------+-------+------+-------+------+-------+-------+
| decode_symbol_\        |  28.6 | 22.5 |  28.4 | 23.5 |  67.8 |  55.1 |
| adapt4_neon            |       |      |       |      |       |       |
+------------------------+-------+------+-------+------+-------+-------+
| decode_symbol_\        |  29.5 | 26.6 |  29.0 | 28.8 |  76.6 |  74.0 |
| adapt8_neon            |       |      |       |      |       |       |
+------------------------+-------+------+-------+------+-------+-------+
| decode_symbol_\        |  31.6 | 31.2 |  33.3 | 33.0 |  77.5 |  68.1 |
| adapt16_neon           |       |      |       |      |       |       |
+------------------------+-------+------+-------+------+-------+-------+

7f68f23c

AArch64: Optimize prep_neon function · d835c6bf

Arpad Panyik authored 11 months ago and

Martin Storsjö committed 11 months ago

Optimize the widening copy part of subpel filters (the prep_neon
function). In this patch we combine widening shifts with widening
multiplications in the inner loops to get maximum throughput.

The change will increase .text by 36 bytes.

Relative performance of micro benchmarks (lower is better):

Cortex-A55:
  mct_w4:   0.795x
  mct_w8:   0.913x
  mct_w16:  0.912x
  mct_w32:  0.838x
  mct_w64:  1.025x
  mct_w128: 1.002x

Cortex-A510:
  mct_w4:   0.760x
  mct_w8:   0.636x
  mct_w16:  0.640x
  mct_w32:  0.854x
  mct_w64:  0.864x
  mct_w128: 0.995x

Cortex-A72:
  mct_w4:   0.616x
  mct_w8:   0.854x
  mct_w16:  0.756x
  mct_w32:  1.052x
  mct_w64:  1.044x
  mct_w128: 0.702x

Cortex-A76:
  mct_w4:   0.837x
  mct_w8:   0.797x
  mct_w16:  0.841x
  mct_w32:  0.804x
  mct_w64:  0.948x
  mct_w128: 0.904x

Cortex-A78:
  mct_w16:  0.542x
  mct_w32:  0.725x
  mct_w64:  0.741x
  mct_w128: 0.745x

Cortex-A715:
  mct_w16:  0.561x
  mct_w32:  0.720x
  mct_w64:  0.740x
  mct_w128: 0.748x

Cortex-X1:
  mct_w32:  0.886x
  mct_w64:  0.882x
  mct_w128: 0.917x

Cortex-X3:
  mct_w32:  0.835x
  mct_w64:  0.803x
  mct_w128: 0.808x

d835c6bf

AArch64: Optimize jump table calculation of prep_neon · f0e779bc
Arpad Panyik authored 11 months ago and Martin Storsjö committed 11 months ago
```
Save a complex arithmetic instruction in the jump table address
calculation of prep_neon function.
```
f0e779bc

AArch64: Optimize BTI landing pads of prep_neon · 1790e132

Arpad Panyik authored 11 months ago and

Martin Storsjö committed 11 months ago

Move the BTI landing pads out of the inner loops of prep_neon
function. Only the width=4 and width=8 cases are affected.

If BTI is enabled, moving the AARCH64_VALID_JUMP_TARGET out of the
inner loops we get better execution speed on Cortex-A510 relative to
the original (lower is better):
  w4: 0.969x
  w8: 0.722x

Out-of-order cores are not affected.

1790e132

x86: Update x86inc.asm · 84185303
Henrik Gramner authored 11 months ago
```
videolan/x86inc.asm@b6ba1e30
```
84185303

May 13, 2024

AArch64: Optimize put_neon function · 8141546d

Arpad Panyik authored 11 months ago

Optimize the copy part of subpel filters (the put_neon function).
For small block sizes (<16) the usage of general purpose registers
is usually the best way to do the copy.

Relative performance of micro benchmarks (lower is better):

Cortex-A55:
  w2:   0.991x
  w4:   0.992x
  w8:   0.999x
  w16:  0.875x
  w32:  0.775x
  w64:  0.914x
  w128: 0.998x

Cortex-A510:
  w2:   0.159x
  w4:   0.080x
  w8:   0.583x
  w16:  0.588x
  w32:  0.966x
  w64:  1.111x
  w128: 0.957x

Cortex-A76:
  w2:   0.903x
  w4:   0.683x
  w8:   0.944x
  w16:  0.948x
  w32:  0.919x
  w64:  0.855x
  w128: 0.991x

Cortex-A78:
  w32:  0.867x
  w64:  0.820x
  w128: 1.011x

Cortex-A715:
  w32:  0.834x
  w64:  0.778x
  w128: 1.000x

Cortex-X1:
  w32:  0.809x
  w64:  0.762x
  w128: 1.000x

Cortex-X3:
  w32: 0.733x
  w64: 0.720x
  w128: 0.999x

8141546d

AArch64: Optimize jump table calculation of put_neon · 645d1f9f
Arpad Panyik authored 11 months ago
```
Save a complex arithmetic instruction in the jump table address
calculation of put_neon function.
```
645d1f9f

AArch64: Optimize BTI landing pads of put_neon · 83452c6e

Arpad Panyik authored 11 months ago

Move the BTI landing pads out of the inner loops of put_neon
function, the only exception is the width=16 case where it is already
outside of the loops.

When BTI is enabled, the relative performance of omitting the
AARCH64_VALID_JUMP_TARGET from the inner loops on Cortex-A510 (lower
is better):
  w2:   0.981x
  w4:   0.991x
  w8:   0.612x
  w32:  0.687x
  w64:  0.813x
  w128: 0.892x

Out-of-order CPUs are mostly unaffected.

83452c6e

checkasm: Eliminate unreachable code in the Windows exception handler · cc1137c8
Henrik Gramner authored 11 months ago

cc1137c8

checkasm: Avoid UB in setjmp() invocations · 471549f2

Henrik Gramner authored 11 months ago

Both POSIX and the C standard places several environmental limits on
setjmp() invocations, with essentially anything beyond comparing the
return value with a constant as a simple branch condition being UB.

We were previously performing a function call using the setjmp()
return value as an argument, which is technically not allowed
even though it happened to work correctly in practice.

Some systems may loosen those restrictions and allow for more
flexible usage, but we shouldn't be relying on that.

471549f2

May 12, 2024

AArch64: Optimize the init of DotProd+ 2D subpel filters · a6d57b11

Arpad Panyik authored 11 months ago and

Jean-Baptiste Kempf committed 11 months ago

Removed some unnecessary vector register copies from the initial
horizontal filter parts of the HV subpel filters. The performance
improvements are better for the smaller filter block sizes.

The narrowing shifts were also rewritten at the end of the *filter8*
because it was only beneficial for the Cortex-A55 among the DotProd
capable CPU cores. On other out-of-order or newer CPUs the UZP1+SHRN
instruction combination is better.

Relative performance of micro benchmarks (lower is better):

Cortex-A55:
  mct regular w4:  0.980x
  mct regular w8:  1.007x
  mct regular w16: 1.007x

  mct sharp w4:    0.983x
  mct sharp w8:    1.012x
  mct sharp w16:   1.005x

Cortex-A510:
  mct regular w4:  0.935x
  mct regular w8:  0.984x
  mct regular w16: 0.986x

  mct sharp w4:    0.927x
  mct sharp w8:    0.983x
  mct sharp w16:   0.987x

Cortex-A78:
  mct regular w4:  0.974x
  mct regular w8:  0.988x
  mct regular w16: 0.991x

  mct sharp w4:    0.971x
  mct sharp w8:    0.987x
  mct sharp w16:   0.979x

Cortex-715:
  mct regular w4:  0.958x
  mct regular w8:  0.993x
  mct regular w16: 0.998x

  mct sharp w4:    0.974x
  mct sharp w8:    0.991x
  mct sharp w16:   0.997x

Cortex-X1:
  mct regular w4:  0.983x
  mct regular w8:  0.993x
  mct regular w16: 0.996x

  mct sharp w4:    0.974x
  mct sharp w8:    0.990x
  mct sharp w16:   0.995x

Cortex-X3:
  mct regular w4:  0.953x
  mct regular w8:  0.993x
  mct regular w16: 0.997x

  mct sharp w4:    0.981x
  mct sharp w8:    0.993x
  mct sharp w16:   0.995x

a6d57b11