Skip to content
Snippets Groups Projects

ADS kernel implementation for SVE

Open Matthias Langer requested to merge nekobasu/x264:ads_for_sve into master

This is the first of in series of changes we developed earlier this year while optimizing x264 for NVIDIA Grace.

Contains:

  • Implementations of ADS kernels for SVE (any vector size)
  • Implementations of ADS kernels for SVE2 (any vector size)

checkasm8

x264: using random seed 3804690644
x264: ARMv8
 - intra pred :          [OK]
 - coeff_last :          [OK]
 - coeff_level_run :     [OK]
x264: NEON
 - pixel sad :           [OK]
 - pixel sad_aligned :   [OK]
 - pixel ssd :           [OK]
 - pixel satd :          [OK]
 - pixel sa8d :          [OK]
 - pixel sa8d_satd :     [OK]
 - pixel sad_x3 :        [OK]
 - pixel sad_x4 :        [OK]
 - pixel var :           [OK]
 - pixel var2 :          [OK]
 - pixel hadamard_ac :   [OK]
 - pixel vsad :          [OK]
 - pixel asd :           [OK]
 - intra satd_x3 :       [OK]
 - intra sad_x3 :        [OK]
 - ssd_nv12 :            [OK]
 - ssim :                [OK]
 - sub_dct4 :            [OK]
 - sub_dct8 :            [OK]
 - add_idct4 :           [OK]
 - add_idct8 :           [OK]
 - dct4x4dc :            [OK]
 - idct4x4dc :           [OK]
 - zigzag_interleave :   [OK]
 - zigzag_frame :        [OK]
 - zigzag_field :        [OK]
 - mc luma :             [OK]
 - mc chroma :           [OK]
 - mc wpredb :           [OK]
 - mc weight :           [OK]
 - mc offsetadd :        [OK]
 - mc offsetsub :        [OK]
 - store_interleave :    [OK]
 - plane_copy :          [OK]
 - hpel filter :         [OK]
 - lowres init :         [OK]
 - integral init :       [OK]
 - mbtree :              [OK]
 - memcpy aligned :      [OK]
 - memzero aligned :     [OK]
 - intra pred :          [OK]
 - deblock :             [OK]
 - quant :               [OK]
 - dequant :             [OK]
 - denoise dct :         [OK]
 - decimate_score :      [OK]
 - coeff_last :          [OK]
 - coeff_level_run :     [OK]
 - nal escape:           [OK]
x264: SVE (128 bits)
 - pixel ssd :           [OK]
 - pixel sa8d :          [OK]
 - pixel var :           [OK]
 - pixel hadamard_ac :   [OK]
 - esa ads:              [OK]
 - sub_dct4 :            [OK]
 - zigzag_interleave :   [OK]
 - mc wpredb :           [OK]
 - deblock :             [OK]
x264: SVE2 (128 bits)
 - add_idct4 :           [OK]
x264: All tests passed Yeah :)

checkasm10

I have no name!@a64c740923aa:/x264$ ./checkasm10
x264: using random seed 3721534389
x264: ARMv8
x264: NEON
 - pixel sad :           [OK]
 - pixel ssd :           [OK]
 - pixel satd :          [OK]
 - pixel sa8d :          [OK]
 - pixel sa8d_satd :     [OK]
 - pixel sad_x3 :        [OK]
 - pixel sad_x4 :        [OK]
 - pixel var :           [OK]
 - pixel var2 :          [OK]
 - pixel hadamard_ac :   [OK]
 - pixel vsad :          [OK]
 - pixel asd :           [OK]
 - ssd_nv12 :            [OK]
 - ssim :                [OK]
 - mc luma :             [OK]
 - mc chroma :           [OK]
 - mc wpredb :           [OK]
 - mc weight :           [OK]
 - mc offsetadd :        [OK]
 - mc offsetsub :        [OK]
 - store_interleave :    [OK]
 - plane_copy :          [OK]
 - hpel filter :         [OK]
 - lowres init :         [OK]
 - integral init :       [OK]
 - mbtree :              [OK]
 - memcpy aligned :      [OK]
 - memzero aligned :     [OK]
 - quant :               [OK]
 - dequant :             [OK]
 - denoise dct :         [OK]
 - decimate_score :      [OK]
 - coeff_last :          [OK]
 - coeff_level_run :     [OK]
 - nal escape:           [OK]
x264: SVE (128 bits)
 - pixel ssd :           [OK]
 - esa ads:              [OK]
x264: SVE2 (128 bits)
x264: All tests passed Yeah :)
Edited by Matthias Langer

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • Benchmarks: SVE2 implementationt

    8-bit

    esa_ads_8x8_c: 183
    esa_ads_8x8_sve: 96
    esa_ads_8x16_c: 243
    esa_ads_8x16_sve: 107
    esa_ads_16x8_c: 241
    esa_ads_16x8_sve: 107
    esa_ads_16x16_c: 352
    esa_ads_16x16_sve: 118

    10-bit

    esa_ads_8x8_c: 146
    esa_ads_8x8_sve: 95
    esa_ads_8x16_c: 212
    esa_ads_8x16_sve: 105
    esa_ads_16x8_c: 212
    esa_ads_16x8_sve: 105
    esa_ads_16x16_c: 352
    esa_ads_16x16_sve: 117

    Benchmarks: SVE implementation

    8-bit

    esa_ads_8x8_c: 180
    esa_ads_8x8_sve: 99
    esa_ads_8x16_c: 251
    esa_ads_8x16_sve: 114
    esa_ads_16x8_c: 247
    esa_ads_16x8_sve: 113
    esa_ads_16x16_c: 363
    esa_ads_16x16_sve: 127

    10-bit

    esa_ads_8x8_c: 149
    esa_ads_8x8_sve: 101
    esa_ads_8x16_c: 211
    esa_ads_8x16_sve: 113
    esa_ads_16x8_c: 213
    esa_ads_16x8_sve: 113
    esa_ads_16x16_c: 349
    esa_ads_16x16_sve: 127
    Edited by Matthias Langer
  • Matthias Langer changed the description

    changed the description

  • Illegal instruction is not reproductible in my ARRCH64 test systems. Maybe a problem with your CI? Used registry.videolan.org/x264-debian-unstable-aarch64:20211206141032 to test.

  • added 1 commit

    • e8cb6757 - Implement ESA ADS kernels using SVE and SVE2 assembly

    Compare with previous version

  • added 1 commit

    • d8022829 - Remove last remnant of delted C implementation.

    Compare with previous version

  • Matthias Langer resolved all threads

    resolved all threads

  • 8 bits per pixel

    make -j checkasm
    ./checkasm8 --bench=esa_
    ...
    x264: All tests passed Yeah :)
    nop: 197
    esa_ads_8x8_c: 182
    esa_ads_8x8_sve: 74
    esa_ads_8x8_sve2: 77
    esa_ads_8x16_c: 210
    esa_ads_8x16_sve: 85
    esa_ads_8x16_sve2: 78
    esa_ads_16x8_c: 211
    esa_ads_16x8_sve: 84
    esa_ads_16x8_sve2: 78
    esa_ads_16x16_c: 361
    esa_ads_16x16_sve: 111
    esa_ads_16x16_sve2: 93

    10 bits per pixel:

    ./checkasm10 --bench=esa_
    x264: All tests passed Yeah :)
    nop: 196
    esa_ads_8x8_c: 170
    esa_ads_8x8_sve: 73
    esa_ads_8x8_sve2: 76
    esa_ads_8x16_c: 209
    esa_ads_8x16_sve: 84
    esa_ads_8x16_sve2: 77
    esa_ads_16x8_c: 209
    esa_ads_16x8_sve: 84
    esa_ads_16x8_sve2: 77
    esa_ads_16x16_c: 362
    esa_ads_16x16_sve: 111
    esa_ads_16x16_sve2: 92
  • Martin Storsjö
  • Martin Storsjö
  • Martin Storsjö
  • Martin Storsjö
  • I had a look at the actual assembly implementation here now as well - it looks mostly good, thanks! Just a couple comments.

    If you want to, you could also add a bit more comments in the implementation of the if( ads < thresh ) mvs[nmv++] = i; part. It's understandable when stopping and taking the time to follow it closely, but a few comments could speed up understanding of it for future readers.

  • Matthias Langer added 24 commits

    added 24 commits

    Compare with previous version

  • Matthias Langer added 1 commit

    added 1 commit

    • ce8a529e - Separate and optimize ADS kernel implementations.

    Compare with previous version

  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Please register or sign in to reply
    Loading