io_uring: interaction with user-space threads
For each thread that submits I/O requests through io_uring, an in-kernel work queue is created to process those requests. Requests don’t always go through the work queue, however. Requests only go through the work queue under special conditions such as when certain flags are set on the request or in a situation when running the opcode associated with the request doesn’t finish on the first time. Each time an io_uring work queue is created, a node is added to the uring context that requested for work queue creation.
Generally, in a situation when io_uring is being used to run requests on an application that is running multiple threads, my expectation would be that we have one uring context object and multiple threads. This would mean that the likelyhood that we’re creating a work queue is the same as the likelyhood that we’re adding a work node to a uring_context.
In cases where an application is running only thread, it still doesn’t make much sense to expect that the application creates multiple uring queues(contexts) as this would be equivalent to opening and closing the same file over and over again during the lifetime of an application.
A well designed application would definitely not be doing this.
Background
Take a look at the code below, this is going to be the gist of this post. This is because it seems to imply that an io_uring user is more unlikely to be creating a work queue and more likely to be creating new io_uring contexts.
‘likely’ and ‘unlikely’ macros are compiler based instrumentation which allows and guides the compiler into generating more performant code by instructing it than some branch are likely to be true or unlikely to be true, respectively.
This also allows for someone reading the code to determine that the authors/developers of a piece of code expect a certain branch to be taken more often than not. In our particular case here, the branch is, supposedly, less likely to be taken.
The code above serves the purpose of creating a new work queue and it associates the work queue to the requesting io_uring context. Sometimes, the work queue might already exist in which case it is not created but rather we just proceed and associate the io_uring context with this pre-existing work queue.
The question we’re trying to answer is, how often is it that work queue already exists and how often is it that the second branch executes without the first branch executing. It would seem to me that the second branch is even likely to execute as compared to first branch. Or, rather, what is the likelyhood that the first branch executes compared to the likelyhood that the second branch doesn’t execute?
The developer/author of this code seems to imply that the second branch is a more likely to execute as compared to the first branch hence tagging the first branch with the ‘unlikely’ macro and not tagging the second branch with anything.
How likely is it that we’re creating a new work queue as compared to how likely it is that, we possibly created a new io_uring context which we now need to associate with a user-space thread?
Could be it is me who is wrong but it seems the first branch, which creates a work queue is more likely to execute than having multiple io_uring contexts submiting work. Why is this? This is because an io_uring context is simply a file and having multiple io_uring contexts is comparable to having one file open multiple times.
Each io_uring context is a file and will behave identically to any other the process has opened, so why have multiple? Why would an application have the same file open multiple times?
If we don’t create a new work queue but at the same time continue to associate a certain work queue with a io_uring context it means that the work queue is already associated with an io_uring context.
Each work queue is owned by user-space thread and that would in turn mean that the user-space thread has multiple io_uring context active.
How do I use a single io_uring context across all my threads
First, it is probably worth noting that liburing doesn’t support this approach natively. With liburing, if a process owns an io_uring context and it proceeds to fork, both processes will now share an io_uring context but there is no native synchronization which is needed inorder to use the io_uring context from both processes.
In other words, if you submit a request from say the child process, it results in updates to the submission queue head and possibly the completion queue head. These changes do will not natively reflect on parent process and consequently, the parent process cannot submit requests via the same io_uring context anymore. Simply put, the parent process local io_uring context is now corrupted or maybe stale and cannot submit requests anymore.
When we talk about context in the above, we’re talking about the object a process recieves from calls like:
which is the most common way of initializing a uring queue. It creates a context both in the kernel and in user-space. The user-space uring context is stashed in the ‘ring’ argument and contains variables that are important to the process when submitting requests. The variable ‘ring’ is the object which gets corrupted/stale when used from multiple processes/threads and liburing doesn’t provide a way around it.
The above goes in line with the original hypothesis, that the io_uring developers/authors/engineers don’t expect sharing of a kernel io_uring context across multiple threads and even possibly wrong marks it at something that is unlikely to happen while creation and io_uring is facilitated to great lengths.
To get around these liburing limitations, one has to incorporate memory mapping and have the liburing object mmaped so that changes to it are accessible to different threads.
Similar code has to be written even when accessing liburing directly and not with the use of liburing. To use one kernel io_uring object across multiple threads, changes to items like the completions queue head/tail pointer have to transparent across all the threads that are using the queue. There are possible other solutions to this but mmap the objects should easily solve the problem.
How are other applications using io_uring
There are a number of applications which have already added support for io_uring and this article was written with a lot of consideration to how Wine could incorporate io_uring support for possible performance improvements. Wine runs as a number of threads and will probably spawn new threads when running applications. This makes the interaction of io_uring with user-space a priority topic.
There are other application such as Netty and Qemu which have also already added io_uring support. Let’s take a look at how they do it, tough, these processes may not be as multithreaded as Wine.
Qemu
Qemu is a PC emulator which is routinely used alongside hypervisors like KVM and Xen to allow running of virtual machines.
Qemu seems to create a global io_uring object which, presumably is meant to be used across multiple threads. This approach can be seen here.
A global object is created and since this code is written with multi-threading in mind, this object is meant for use with multiple threads. Looking back at the gist of this article, this would essentially mean that the path marked with ‘unlikely’ whereby it creates work queue is executed with increased likelyhood. A
t the same time, the code that is not tagged ‘unlikely’, seems to execute only as much as the work queue creation path seems to execute. Wouldn’t this mean that the work queue creation path shouldn’t be tagged with ‘unlikely’?
Netty
It is not immediately clear whether netty is a multithreaded application or a single threaded application. However, it also only create aa single io_uring context/setup and uses that through-out the lifetime of the application. Based on:
it is almost obvious that Netty also only creates a single in-kernel io_uring context, however. If there happens to be multiple threads, then they’re probably supposed to submit from this context and synchronize with it. It’s horrible for me to keep repeating this, but, once again liburing.h doesn’t support this kind of design and also this kind of a design is downplayed in the kernel and marked as ‘unlikely’.
Postgresql
Postgresql recently added io_uring support to the code. It uses a shared memory approach which seems similar to the solution provided for sharing user-space uring contexts across threads and for the first time, an application has created more than one kernel-space uring contexts.
Above can be observed mostly from:
However, this io_uring context seem to belong to different user-space threads and consequently, this doesn’t break our original hypothesis.
Notably, all these three applications and probably others do not make use of liburing, unsurprisingly. This is probably because, they wish to have fine grained control of their use of io_uring and also possibly because, liburing only seems to cater for people who don’t wish to understand the details of how io_uring operates. This might be good for saving time when writting an application but it’s definitely a bad idea since you need to have a good understanding of how io_uring operates inorder to take full advantage of it.
Conclusion
io_uring should provision for sharing uring contexts across threads and not downplay such application design as the relevant Linux kernel code seems to imply.