Commits · master · Diego Biurrun / dav1d

Nov 15, 2019

arm: ipred: NEON implementation of dc/h/v prediction functions · 91d324eb

B Krishnan Iyer authored 5 years ago and

Martin Storsjö committed 5 years ago

		                                A73			A53

                                		Earlier	Now		Earlier	Now

intra_pred_dc_top_w64_8bpc_neon:		344.4	344.6		253.4	252.3

91d324eb

Nov 12, 2019

arm: 64: loopfilter: Avoid nested ifdefs where easily possible · dcbbf775
Martin Storsjö authored 5 years ago
```
This was requested in the review of the arm32 version of the same.
```
dcbbf775
arm: 64: loopfilter: Fix a typo in a macro parameter condition · 564482b6
Martin Storsjö authored 5 years ago
```
This removes one redundant instruction for loop filters smaller
than 16.
```
564482b6
arm64: loopfilter: Reorder instructions and tweak register use to match the arm32 port · 3069ab94
Martin Storsjö authored 5 years ago
```
This doesn't change performance measurably, but eases potential
future maintainance of the code.
```
3069ab94
arm64: loopfilter: Remove a stray double newline · abd07c67
Martin Storsjö authored 5 years ago

abd07c67

arm: 32: Port the arm64 NEON loopfilter to arm32 · 9a100261

Martin Storsjö authored 5 years ago

The code is a fairly exact 1:1 port of the ARM64 code, but operating
on 8 pixels at a time, instead of 16.

Relative speedup over C code according to checkasm:
                       Cortex A7     A8     A9    A53    A72    A73
lpf_h_sb_uv_w4_8bpc_neon:   1.36   1.40   1.25   1.71   1.55   1.59
lpf_h_sb_uv_w6_8bpc_neon:   2.18   2.11   1.74   2.65   2.32   2.34
lpf_h_sb_y_w4_8bpc_neon:    1.48   1.43   1.20   1.91   1.49   1.64
lpf_h_sb_y_w8_8bpc_neon:    2.34   2.05   1.78   2.84   2.35   2.69
lpf_h_sb_y_w16_8bpc_neon:   2.13   1.83   1.63   2.51   2.10   2.35
lpf_v_sb_uv_w4_8bpc_neon:   1.69   1.66   1.60   2.16   2.24   2.24
lpf_v_sb_uv_w6_8bpc_neon:   2.68   2.43   2.22   3.53   3.44   3.35
lpf_v_sb_y_w4_8bpc_neon:    1.74   1.74   1.43   2.34   2.14   2.18
lpf_v_sb_y_w8_8bpc_neon:    2.92   2.47   2.19   3.55   3.22   3.54
lpf_v_sb_y_w16_8bpc_neon:   2.62   2.19   1.98   3.25   2.80   3.10

Comparison to the original ARM64 assembly:
ARM64:                        A53     A72     A73
lpf_h_sb_uv_w4_8bpc_neon:   702.5   518.2   529.1
lpf_h_sb_uv_w6_8bpc_neon:  1007.3   672.6   736.6
lpf_h_sb_y_w4_8bpc_neon:   1652.8  1261.2  1276.5
lpf_h_sb_y_w8_8bpc_neon:   2144.7  1559.8  1638.7
lpf_h_sb_y_w16_8bpc_neon:  2318.3  1757.2  1792.8
lpf_v_sb_uv_w4_8bpc_neon:   447.1   302.0   292.4
lpf_v_sb_uv_w6_8bpc_neon:   600.0   397.7   406.9
lpf_v_sb_y_w4_8bpc_neon:   1212.6   840.1   818.4
lpf_v_sb_y_w8_8bpc_neon:   1623.3  1167.4  1156.7
lpf_v_sb_y_w16_8bpc_neon:  1694.9  1237.9  1182.3
ARM32:
lpf_h_sb_uv_w4_8bpc_neon:   821.2   501.1   500.8
lpf_h_sb_uv_w6_8bpc_neon:  1232.0   715.7   746.6
lpf_h_sb_y_w4_8bpc_neon:   2208.1  1373.2  1414.7
lpf_h_sb_y_w8_8bpc_neon:   3138.3  1843.1  1915.2
lpf_h_sb_y_w16_8bpc_neon:  3293.1  1842.5  1975.9
lpf_v_sb_uv_w4_8bpc_neon:   619.9   326.7   324.9
lpf_v_sb_uv_w6_8bpc_neon:   855.9   446.7   468.2
lpf_v_sb_y_w4_8bpc_neon:   1737.6   935.5  1007.0
lpf_v_sb_y_w8_8bpc_neon:   2346.7  1232.8  1298.3
lpf_v_sb_y_w16_8bpc_neon:  2353.4  1283.4  1379.9

9a100261

Nov 10, 2019
- arm: 32: Use more unique temporary labels within movrel_local · c02ec6cf
  Martin Storsjö authored 5 years ago
```
Otherwise the macro would interfere with local labels 1 and 2
in the context where the macro is expanded.
```
  c02ec6cf
Nov 01, 2019
- Tiny improvements to generate_grain_uv_420 · a1647a59
  Ronald S. Bultje authored 5 years ago
```
Before:
gen_grain_uv_ar2_8bpc_420_avx2: 29176.2
After:
gen_grain_uv_ar2_8bpc_420_avx2: 26794.0
```
  a1647a59
Oct 28, 2019
- Update README.md section on Roadmap · 07dab8cb
  Jean-Baptiste Kempf authored 5 years ago
  
  07dab8cb
Oct 25, 2019
- build: Add a workaround for Xcode 11 -fstack-check bug · bb160f09
  Henrik Gramner authored 5 years ago
  
  0.5.1
  
  bb160f09
- Update NEWS for 0.5.1 · fc54119c
  Jean-Baptiste Kempf authored 5 years ago
  
  fc54119c
Oct 24, 2019

x86: Fix overflows in inverse identity SSSE3 transforms · 103cd220
Henrik Gramner authored 5 years ago and Henrik Gramner committed 5 years ago

103cd220
x86: Fix overflows in inverse identity AVX2 transforms · a20b5757
Henrik Gramner authored 5 years ago and Henrik Gramner committed 5 years ago

a20b5757

x86: adapt SSSE3 wiener filter to SSE2 · 36d615d1

Victorien Le Couviour--Tuffet authored 5 years ago

Also slightly optimized the 32-bit SSSE3, especially by the removal of
an XMM store/load.

---------------------
x86_64:
------------------------------------------
wiener_chroma_8bpc_c: 193155.1
wiener_chroma_8bpc_sse2: 48973.4
wiener_chroma_8bpc_ssse3: 31486.3
---------------------
wiener_luma_8bpc_c: 192787.5
wiener_luma_8bpc_sse2: 48674.9
wiener_luma_8bpc_ssse3: 30446.3
------------------------------------------

---------------------
x86_32:
------------------------------------------
wiener_chroma_8bpc_c: 309861.0
wiener_chroma_8bpc_sse2: 52345.9
wiener_chroma_8bpc_ssse3: 32983.2
---------------------
wiener_luma_8bpc_c: 317909.1
wiener_luma_8bpc_sse2: 52522.1
wiener_luma_8bpc_ssse3: 33323.1
------------------------------------------

36d615d1

x86: adapt SSSE3 warp_affine_8x8{,t} to SSE2 · 4866abab

Victorien Le Couviour--Tuffet authored 5 years ago

---------------------
x86_64:
------------------------------------------
warp_8x8_8bpc_c: 1761.5
warp_8x8_8bpc_sse2: 583.0
warp_8x8_8bpc_ssse3: 329.3
---------------------
warp_8x8t_8bpc_c: 1694.3
warp_8x8t_8bpc_sse2: 577.6
warp_8x8t_8bpc_ssse3: 334.1
------------------------------------------

---------------------
x86_32:
------------------------------------------
warp_8x8_8bpc_c: 1842.6
warp_8x8_8bpc_sse2: 677.1
warp_8x8_8bpc_ssse3: 394.9
---------------------
warp_8x8t_8bpc_c: 1741.1
warp_8x8t_8bpc_sse2: 648.5
warp_8x8t_8bpc_ssse3: 372.6
------------------------------------------

4866abab

arm: looprestoration: Fix register names in a comment · 0526e1ea
Martin Storsjö authored 5 years ago and Jean-Baptiste Kempf committed 5 years ago

0526e1ea
arm64: looprestoration: Minimal scheduling improvements · 06ca5744
Martin Storsjö authored 5 years ago and Jean-Baptiste Kempf committed 5 years ago

06ca5744
arm64: looprestoration: Fix a typo · 8b3985fd
Martin Storsjö authored 5 years ago and Jean-Baptiste Kempf committed 5 years ago
```
The actual register used for this operand doesn't matter, but make
it consistent with the others.
```
8b3985fd
arm64: looprestoration: Fix register references in comments · cf9146c3
Martin Storsjö authored 5 years ago and Jean-Baptiste Kempf committed 5 years ago

cf9146c3
arm64: looprestoration: Use ld2r instead of ld1+dup+dup · a3641268
Martin Storsjö authored 5 years ago and Jean-Baptiste Kempf committed 5 years ago

a3641268
arm64: looprestoration: Pass a correct height parameter to sgr_box3_h_neon for the top slice · 2eaabafc
Martin Storsjö authored 5 years ago and Jean-Baptiste Kempf committed 5 years ago
```
As the neon function processes two rows at a time, this has passed
unnoticed.
```
2eaabafc

arm: looprestoration: Port the ARM64 SGR NEON assembly to 32 bit arm · 14d4edcd

Martin Storsjö authored 5 years ago and

Jean-Baptiste Kempf committed 5 years ago

The code is mostly a 1:1 port of the ARM64 code, with slightly worse
scheduling due to fewer temporary registers available. The
sgr_finish_filter1_neon function (used in the 3x3 and mix cases)
processes 4 pixels at a time while the ARM64 version processes 8,
due to not having enough registers available.

Relative speedup over C code:
                       Cortex A7     A8     A9    A53    A72    A73
selfguided_3x3_8bpc_neon:   2.12   2.89   1.79   2.61   2.03   3.87
selfguided_5x5_8bpc_neon:   2.50   3.41   2.16   3.14   2.74   4.64
selfguided_mix_8bpc_neon:   2.24   2.98   1.94   2.82   2.28   4.14

Comparison to the original ARM64 assembly:
ARM64:                    Cortex A53        A72        A73
selfguided_3x3_8bpc_neon:   486215.5   359445.6   341317.7
selfguided_5x5_8bpc_neon:   351210.8   267427.2   243399.3
selfguided_mix_8bpc_neon:   820489.1   610909.8   569946.6
ARM32:
selfguided_3x3_8bpc_neon:   542958.8   379448.8   353229.1
selfguided_5x5_8bpc_neon:   351299.6   263685.2   242415.9
selfguided_mix_8bpc_neon:   881587.6   629934.0   580121.2

14d4edcd

x86: Add minor ipred_z AVX2 optimizations · 3b33c52d
Henrik Gramner authored 5 years ago and Henrik Gramner committed 5 years ago

3b33c52d

Oct 22, 2019
- Shrink some stack buffers in the C versions of ipred_z · 6c81623e
  Henrik Gramner authored 5 years ago and Henrik Gramner committed 5 years ago
```
Only the upsamling code requires a size of width/height * 2, but
upsampling is never performed when width or height is above 8.
```
  6c81623e
- Don't backup pixel if next block not "CDEFed" · 55951027
  Luc Trudeau authored 5 years ago and Jean-Baptiste Kempf committed 5 years ago
  
  55951027
Oct 21, 2019

x86inc: fix LOAD_MM_PERMUTATION for AVX512 · 47790541

Victorien Le Couviour--Tuffet authored 5 years ago and

Henrik Gramner committed 5 years ago

Pre-permuting the registers in INIT_*MM avx512 (AVX512_MM_PERMUTATION)
is redondant. It causes the register mapping to be the same as without
the initial AVX512_MM_PERMUTATION, with the user SWAPs applied.

For example...

INIT_YMM avx512
SWAP m0, m16
SAVE_MM_PERMUTATION
; do whatever
LOAD_MM_PERMUTATION

... would result in m0 mapping to ymm16 instead of ymm0 and m1 to ymm1
instead of ymm17.

47790541

Oct 18, 2019

x86: adapt SSSE3 cdef_filter_{4x4,4x8,8x8} to SSE2 · 3e9f9676

Victorien Le Couviour--Tuffet authored 5 years ago

---------------------
x86_64:
------------------------------------------
cdef_filter_4x4_8bpc_c: 1376.0
cdef_filter_4x4_8bpc_sse2: 177.6
cdef_filter_4x4_8bpc_ssse3: 132.5
---------------------
cdef_filter_4x8_8bpc_c: 2725.0
cdef_filter_4x8_8bpc_sse2: 327.6
cdef_filter_4x8_8bpc_ssse3: 234.9
---------------------
cdef_filter_8x8_8bpc_c: 5938.8
cdef_filter_8x8_8bpc_sse2: 556.8
cdef_filter_8x8_8bpc_ssse3: 388.1
------------------------------------------

---------------------
x86_32:
------------------------------------------
cdef_filter_4x4_8bpc_c: 1569.5
cdef_filter_4x4_8bpc_sse2: 201.9
cdef_filter_4x4_8bpc_ssse3: 162.3
---------------------
cdef_filter_4x8_8bpc_c: 3141.6
cdef_filter_4x8_8bpc_sse2: 368.3
cdef_filter_4x8_8bpc_ssse3: 283.4
---------------------
cdef_filter_8x8_8bpc_c: 6534.5
cdef_filter_8x8_8bpc_sse2: 666.7
cdef_filter_8x8_8bpc_ssse3: 503.5
------------------------------------------

3e9f9676

Oct 16, 2019
- tools: fix SSE2 cpu masking · 11b72506
  Victorien Le Couviour--Tuffet authored 5 years ago
  
  11b72506
Oct 11, 2019
- ci: Try switching two GCC based arm/aarch64 build configurations to debugoptimized · 62fcd0cb
  Martin Storsjö authored 5 years ago
```
This should expose issues like #300, where there are issues with
alignment of labels, referenced from debug info.
```
  62fcd0cb
- arm64: ipred: Make sure all symbols are aligned · a6228f47
  Martin Storsjö authored 5 years ago
```
If building with debug information enabled, binutils error out with
"unaligned opcodes detected in executable segment", if there are
symbols (even local ones that don't end up in the symbol table)
that point to unaligned addresses in the text section.

This fixes issue #300.
```
  a6228f47
- Update news for 0.5.0: z2-avx2, ipred-neon and wiener-vsx · 5f86e719
  Jean-Baptiste Kempf authored 5 years ago
  
  0.5.0
  
  5f86e719
- arm: util: Split movrel into movrel and movrel_local · 5d014b41
  Martin Storsjö authored 5 years ago
  
  5d014b41
Oct 10, 2019

Check loopfilter levels prior to calling lf_mask · b7d7c8ce
Luc Trudeau authored 5 years ago

b7d7c8ce

arm64: ipred: NEON implementation of the cfl_ac functions · 57dd0aae

Martin Storsjö authored 5 years ago and

Janne Grunau committed 5 years ago

Relative speedup over the C code:
                      Cortex A53    A72    A73
cfl_ac_420_w4_8bpc_neon:    7.73   6.48   9.22
cfl_ac_420_w8_8bpc_neon:    6.70   5.56   6.95
cfl_ac_420_w16_8bpc_neon:   6.51   6.93   6.67
cfl_ac_422_w4_8bpc_neon:    9.25   7.70   9.75
cfl_ac_422_w8_8bpc_neon:    8.53   5.95   7.13
cfl_ac_422_w16_8bpc_neon:   7.08   6.87   6.06

57dd0aae

arm64: ipred: NEON implementation of the cfl_pred functions · c7693386

Martin Storsjö authored 5 years ago and

Janne Grunau committed 5 years ago

Relative speedup over the C code:
                             Cortex A53    A72    A73
cfl_pred_cfl_128_w4_8bpc_neon:    10.81   7.90   9.80
cfl_pred_cfl_128_w8_8bpc_neon:    18.38  11.15  13.24
cfl_pred_cfl_128_w16_8bpc_neon:   16.52  10.83  16.00
cfl_pred_cfl_128_w32_8bpc_neon:    3.27   3.60   3.70
cfl_pred_cfl_left_w4_8bpc_neon:    9.82   7.38   8.76
cfl_pred_cfl_left_w8_8bpc_neon:   17.22  10.63  11.97
cfl_pred_cfl_left_w16_8bpc_neon:  16.03  10.49  15.66
cfl_pred_cfl_left_w32_8bpc_neon:   3.28   3.61   3.72
cfl_pred_cfl_top_w4_8bpc_neon:     9.74   7.39   9.29
cfl_pred_cfl_top_w8_8bpc_neon:    17.48  10.89  12.58
cfl_pred_cfl_top_w16_8bpc_neon:   16.01  10.62  15.31
cfl_pred_cfl_top_w32_8bpc_neon:    3.25   3.62   3.75
cfl_pred_cfl_w4_8bpc_neon:         8.39   6.34   8.04
cfl_pred_cfl_w8_8bpc_neon:        15.99  10.12  12.42
cfl_pred_cfl_w16_8bpc_neon:       15.25  10.40  15.12
cfl_pred_cfl_w32_8bpc_neon:        3.23   3.58   3.71

The C code gets autovectorized for w >= 32, which is why the
relative speedup looks strange (but the performance of the NEON
functions is completely as expected).

c7693386

arm64: ipred: NEON implementation of the filter function · d322d451

Martin Storsjö authored 5 years ago and

Janne Grunau committed 5 years ago

Use a different layout of the filter_intra_taps depending on
architecture; the current one is optimized for the x86 SIMD
implementation.

Relative speedups over the C code:
                             Cortex A53    A72    A73
intra_pred_filter_w4_8bpc_neon:    6.38   2.81   4.43
intra_pred_filter_w8_8bpc_neon:    9.30   3.62   5.71
intra_pred_filter_w16_8bpc_neon:   9.85   3.98   6.42
intra_pred_filter_w32_8bpc_neon:  10.77   4.08   7.09

d322d451

arm64: ipred: NEON implementation of palette prediction · 4f14573c

Martin Storsjö authored 5 years ago and

Janne Grunau committed 5 years ago

Relative speedups over the C code:
                    Cortex A53    A72    A73
pal_pred_w4_8bpc_neon:    8.75   6.15   7.60
pal_pred_w8_8bpc_neon:   19.93  11.79  10.98
pal_pred_w16_8bpc_neon:  24.68  13.28  16.06
pal_pred_w32_8bpc_neon:  23.56  11.81  16.74
pal_pred_w64_8bpc_neon:  23.16  12.19  17.60

4f14573c

arm64: ipred: NEON implementation of smooth prediction · 4318600e

Martin Storsjö authored 5 years ago and

Janne Grunau committed 5 years ago

Relative speedups over the C code:
                               Cortex A53    A72    A73
intra_pred_smooth_h_w4_8bpc_neon:    8.02   4.53   7.09
intra_pred_smooth_h_w8_8bpc_neon:   16.59   5.91   9.32
intra_pred_smooth_h_w16_8bpc_neon:  18.80   5.54  10.10
intra_pred_smooth_h_w32_8bpc_neon:   5.07   4.43   4.60
intra_pred_smooth_h_w64_8bpc_neon:   5.03   4.26   4.34
intra_pred_smooth_v_w4_8bpc_neon:    9.11   5.51   7.75
intra_pred_smooth_v_w8_8bpc_neon:   17.07   6.86  10.55
intra_pred_smooth_v_w16_8bpc_neon:  17.98   6.38  11.52
intra_pred_smooth_v_w32_8bpc_neon:  11.69   5.66   8.09
intra_pred_smooth_v_w64_8bpc_neon:   8.44   4.34   5.72
intra_pred_smooth_w4_8bpc_neon:      9.81   4.85   6.93
intra_pred_smooth_w8_8bpc_neon:     16.05   5.60   9.26
intra_pred_smooth_w16_8bpc_neon:    14.01   5.02   8.96
intra_pred_smooth_w32_8bpc_neon:     9.29   5.02   7.25
intra_pred_smooth_w64_8bpc_neon:     6.53   3.94   5.26

4318600e

arm64: ipred: NEON implementation of paeth prediction · 8ab69afb

Martin Storsjö authored 5 years ago and

Janne Grunau committed 5 years ago

Relative speedups over the C code:
                            Cortex A53    A72    A73
intra_pred_paeth_w4_8bpc_neon:    8.36   6.55   7.27
intra_pred_paeth_w8_8bpc_neon:   15.24  11.36  11.34
intra_pred_paeth_w16_8bpc_neon:  16.63  13.20  14.17
intra_pred_paeth_w32_8bpc_neon:  10.83   9.21   9.87
intra_pred_paeth_w64_8bpc_neon:   8.37   7.07   7.45

8ab69afb

x86: Add ipred_z2 AVX2 asm · ea9fc9d9
Henrik Gramner authored 5 years ago and Jean-Baptiste Kempf committed 5 years ago

ea9fc9d9