PoC 0latency
0latency
VLC is designed to process packets and frames according to their timestamp (PTS). This implies that it needs to wait a certain duration (until a computed date) before demuxing, decoding and displaying. The purpose is to preserve the interval between frames as much as possible, so at to avoid stuttering when watching a movie for example.
Real-time mirroring
Before making any change, we must be able to test glass-to-glass latency easily. For that purpose, we can mirror an Android device screen to VLC.
Download the latest server file from scrcpy, plug an Android device, and execute:
adb push scrcpy-server-v1.25 /data/local/tmp/vlc-scrcpy-server.jar
adb forward tcp:1234 localabstract:scrcpy
adb shell CLASSPATH=/data/local/tmp/vlc-scrcpy-server.jar \
app_process / com.genymobile.scrcpy.Server 1.25 \
tunnel_forward=true control=false cleanup=false \
max_size=1920 raw_video_stream=true
(Adapt max_size=1920
to use another resolution, that impacts the latency.)
As soon as a client connects to localhost:1234
via TCP, mirroring starts and the
device sends a raw H.264 video stream.
It can be played with:
./vlc -Idummy --demux=h264 --network-caching=0 tcp://localhost:1234
By playing a mire test on the device, and taking a picture (with a good camera) of the device next to the vlc window, we can measure the glass-to-glass delay.
Note that this delay includes the encoding time from the mobile device, which may be larger that the target hardware.
master
On On VLC4 master
without any change, the result is catastrophic (VLC is not designed to handle this use case):
The video is 30fps, and each increments represent 1 frame, so 30 frames represent 1 second. At the end of this small capture, there is almost a 10 seconds delay.
PoC
To mirror and control a remote device in real-time, the critical objective is to minimize latency. Therefore, any unnecessary wait is a bug.
Concretely, all waits based on a timestamp must be removed. Therefore, in 0latency mode, clocks become useless and timestamps are irrelevant. Also, buffering must be removed as much as possible.
To that end, this PoC changes several parts of the VLC pipeline.
--0latency
option
Global The first commit adds a new global option --0latency
, that will be read by several VLC components. By default, it is disabled (of course).
To enable it, pass --0latency
:
./vlc -Idummy --0latency --demux=h264 --network-caching=0 tcp://localhost:1234
Picture buffer
In VLC, when a picture is decoded, it is pushed by the decoder to a fifo queue, which is consumed by the video output.
For 0latency, at any time, we want to display the latest possible frame, so we don't want any fifo queue.
This PoC introduces a "picture buffer" (vlc_pic_buf
; yes, this is a poor name), which is a buffer of exactly 1 picture:
- the producer can push a new picture (overwriting any previous pending picture);
- the consumer can pop the latest picture, which is a blocking operation if no pending picture is available.
The producer is the decoder. The consumer is the video output.
Video output
In VLC, the video output attempts to display a picture at the expected date, so it waits for a specific timestamp. This is exactly what we want to avoid for 0latency.
If 0latency is enabled, this PoC replaces the vout thread function which does a lot of complicated things by a very simple loop (Thread0Latency()
):
- pop picture from the picture buffer;
- call vout
prepare()
; - call vout
display()
.
The function vout_PutPicture()
is also adapted to push the frame to our new picture buffer instead of the existing picture fifo.
Note that in this PoC, the picture is not redrawn on resize, so the content will be black until the next frame on resize. That could be improved later.
Input/demux
In VLC, the input MainLoop()
calls the demuxer to demux when necessary, but explicitly waits for a deadline between successive calls. We don't want to wait.
Therefore, this PoC provides an alternative MainLoop0Latency()
, which is called if 0latency is enabled. This function basically calls demux->pf_demux()
in a loop without ever waiting.
Some code in the es_out
(on control ES_OUT_SET_PCR
) based on clock (for handling jitter) is also totally bypassed.
Decoder
When the decoder implementation has a frame, it submits it to the vout via decoder_QueueVideo()
. The queue implementation is provided by the decoder owner in the core, which handles preroll and may wait.
This PoC replaces this implementation by a simple call to vout_PutPicture()
, to directly push the picture to our new picture buffer in the vout. If the vout was waiting for a picture, it is unblocked and will immediately prepare()
and
display()
.
On the module side, the avcodec decoder was adapted to disable dropping frames based on the clock (if a frame is "late"), and to enable the same options as if --low-delay
was passed.
H.264 AnnexB 1-frame latency
The input is a raw H.264 stream in AnnexB format (this is what Android MediaCodec produces). This raw H.264 is sent over TCP.
The format is:
(00 00 00 01 NALU) | ( 00 00 00 01 NALU) | …
The length of each NAL unit is not present in the stream. Therefore, on the receiving side, the parser detects the end of a NAL unit when it detects the following start code 00 00 00 01
.
However, this start code is sent as the prefix of the next frame, so the received packet will not be submitted to the decoder before the next frame is received, which adds 1-frame latency.
However, the length of the packet is known in advance on the device side. Therefore, a simple solution is to prefix the packet with its length (see Reduce latency by 1 frame).
For simplicity, for now I reused the scrcpy format, by requesting the server to send frame meta:
adb forward tcp:1234 localabstract:scrcpy
adb shell CLASSPATH=/data/local/tmp/vlc-scrcpy-server.jar \
app_process / com.genymobile.scrcpy.Server 1.25 \
tunnel_forward=true control=false cleanup=false max_size=1920 \
send_device_meta=false send_frame_meta=true send_dummy_byte=false
# ^^^^^^^^^^^^^^^^^^^^
I wrote a specific demuxer to handle it: h264_0latency
To use it, replace the --demux=
argument:
./vlc -Idummy --0latency --demux=h264_0latency --network-caching=0 tcp://localhost:1234
To make the difference obvious, I suggest to play a 1-fps video.
With all these changes, the latency is reduced to 1~2 frames (30 fps) glass-to-glass:
(the device is on the left, VLC is in the middle, scrcpy is on the right)
Protocol discussions
For this PoC, the video stream is received over TCP from an Android device connected via USB (or via wifi on a local network), using a custom protocol.
Packet loss is non-existent over USB and very low on a good local wifi network. However, packet loss would add an unacceptable latency over the Internet with a protocol taking care of packet retransmission (like TCP).
The following is some random thoughts.
Ideally, I think that:
- we want to never decode a non-I-frame packet when the previous (referenced) packets are not received/decoded (this would produce corrupted frames)
- we want to skip any previous packets (possibly lost) whenever a I-frame arrives
Concretely, the device sends:
[I] P P P P P P P P P P P P P P P [I] P P P P P P P P P P P …
If a packet is not received:
[I] P P P P P _ P P P P P P P P P [I] P P P P P P P P P P P …
^
lost
then one possible solution:
- the receiver does not decode further P-frames until the missing packet is received;
- if a more recent I-frame is received, it starts decoding it immediately and forgets/ignores all previous packets.
As a drawback, this forces to use small GOP (i.e. frequent I-frames).
To be continued…