WIP: RFC: Avoid TLS to implement mutexes
This MR is definitely not meant to be merged as is, but rather to start a discussion on our mutex implementations.
Also, full disclosure, I don't claim to fully understand how TLS is implemented in various architectures, but most specifically ELF & glibc as is the case for the rest of this description.
A bit of context: while profiling the media library I realized that we spend a lot of time in vlc_mutex_held
. As far as I can see, most of the time is spent obtaining the thread_self
address, which is declared as static _Thread_local char thread_self[1];
From there, it seems that we spend most of our time in __tls_get_addr_slow
, which (again, as far as I understand) is used to rebuilt the dynamic thread vector each time the generation changes.
What I don't understand is why the DTV changes that often, after the initial module cache is populated, we almost never invoke dlopen/dlclose, which I thought was responsible for a bump in the generation. However if that was the case, we'd see a performance stabilisation after some time, which we don't, as seen in this graph:
The number of run in the X axis is simply the number of full analysis of my music samples folder with the medialibrary tool to analyse a specific folder from CLI. As you can guess, the "-patched" version is the one with the submitted patch applied.
Using pthread_self
instead of a thread_local
variable is definitely faster since all we care about is getting a thread identifier to compare with the mutex owner, we don't actually care about what's in there. pthread_self
is, at least on x86/elf, implemented by loading the value in register fs
, while a TLS access will cause a call to __tls_get_addr
in dynamic builds. Not relying on TLS also mean we get not to care about a new shared library being loaded and causing a DTV generation bump.
Some profiling samples with/without this patch:
without | with |
---|---|
Not relying on TLS means that we go from ≃ 3.1% of IRs spent in vlc_mutex_held
to ≃ 1.34%. This also drastically reduces the amount of occurences of __tls_get_addr_slow
but I'm not sure if that's something we need to care about.
Now, this patch is wrong and kludgy, but I think there's a valuable improvement to be dug further upon. I'm not sure what the correct way would be though. I feel like it's a bad idea to rely on various platform specific thread functions in misc/threads.c as it's supposed to be the other way around. Relying on vlc_thread_id
in a portable way is not an option AFAICS since some implementations might return -1
which would be less thread specific.
Opinion welcome, and if somebody knows more about this and feels like explaining, I'm definitely interested!