Draft: WIP Prototype of chunked mc
Could dav1d decode 128x128 super blocks faster?
When analyzing the impact of 128x128 superblocks on dav1d and gav1 over the Netflix 1080p test sequences, I get the following results:
Decoder | sb | Average decode time | Delta % |
---|---|---|---|
Dav1d | 64 | 3.0729 | -2.16% |
Dav1d | 128 | 3.0066 | |
Gav1 | 64 | 4.3452 | -4.43% |
Gav1 | 128 | 4.1528 |
Results from a C5Xlarge instance
One way to interpret this data is that 128x128 super blocks are more beneficial to gav1 than they are for dav1d. This result is similar to previous finding related to loop restoration where gav1 was found to be better than dav1d. This resulted in substantial improvements to the dav1d loop restoration code. The objective of this work is to determine if this is also the case for 128x128 superblocks.
Prediction Buffers vs. In-frame prediction
One difference between dav1d and gav1 that could be related to this data is how inter predictions are stored to memory. Gav1 uses scratch buffers for prediction, while dav1d does the prediction directly inside the frame buffer.
In-frame prediction could result in more cache misses for bigger blocks (i.e. 128x128 superblocks). For example, a 128x128 block in a 1080p frame only requires a scratch prediction buffer of 128x128 elements whereas the memory accessed in the frame buffer covers a 1920x128 surface. This could cause more cache misses and require memory access.
A simple test
AV1 is very complicated, a simple way to verify this, is to disable all inter coding tools except for single reference motion estimation (i.e. old school motion estimation, no compound me, no obmc, no interintra, no global motion, no warped motion).
Again using the Netflix 1080p test sequences, the relative speed up of 128x128 superblocks is almost the same on gav1 (-4.43% -> -4.28%) and slightly less on dav1d (-2.16% -> -1.76%).
Decoder | sb | Average decode time | Delta % |
---|---|---|---|
Dav1d | 64 | 2.9002 | -1.76% |
Dav1d | 128 | 2.8490 | |
Gav1 | 64 | 4.2104 | -4.28% |
Gav1 | 128 | 4.0303 |
Results from a C5Xlarge instance
Chunked Inter Prediction
While dav1d does use prediction buffers for compound inter prediction it's mostly for the warping the actual merging of the two predictors is done in-frame. This being said, replacing in-frame prediction with prediction buffers in dav1d requires a considerable amount of work and might not even be desirable. One disadvantage of prediction buffers is that they require an extra copying for skips (memcpy is murder).
Another test
To validate that what we are seeing is really caused by cache misses on 128x128 superblock, a simple experiment is to perform chunked 64x64 inter predictions. Instead of predicting the entire 128x128 superblock we only do the first 64x64 chunk, followed by the reconstruction and afterward proceed to predicting the next 64x64 block. This isn't a perfect solution, but should considerably reduce cache misses.
Decoder | sb | Average decode time | Delta % |
---|---|---|---|
Dav1d | 64 | 2.9008 | -1.15% |
Dav1d | 128 | 2.8676 | |
Gav1 | 64 | 4.2109 | -4.12% |
Gav1 | 128 | 4.0376 |
Results from a C5Xlarge instance
Inter Prediction Only
Inter Prediction (sb128) | Dav1d | Gav1 | Delta % |
---|---|---|---|
128x128 | 0.218526 | 0.255277 | -14.40% |
64x64 | 0.196064 | 0.22852 | -14.20% |
32x32 | 0.242264 | 0.287912 | -15.85% |
16x16 | 0.208807 | 0.268232 | -22.15% |
8x8 | 0.100511 | 0.148878 | -32.49% |
4x4 | 0.017288 | 0.025214 | -31.43% |
Inter Prediction (sb64) | Dav1d | Gav1 | Delta % |
---|---|---|---|
64x64 | 0.432062 | 0.530057 | -18.49% |
32x32 | 0.240078 | 0.287298 | -16.44% |
16x16 | 0.209979 | 0.269536 | -22.10% |
8x8 | 0.10139 | 0.149446 | -32.16% |
4x4 | 0.018133 | 0.025904 | -30.00% |