Monday, August 22, 2011

My trace in HelenOS Camp 2011

The coding camp is over and it's time to look back a little and describe the things that went into the mainline. For my part, the most interesting and also the most time consuming bits were related to fixing ticket #259.

The gist of the ticket was to change the way how open files are passed from one task to another, typically when spawning a new task and passing it its standard input, output and error files. Previously, this was achieved by exporting VFS-internal information - the VFS triplet composed of the file system handle, device handle and file system index - to the donor task. The donor task then sent this triplet via IPC to the acceptor task which in turn used it to open the respective files.

There were several problems with this approach, the most serious being the one that I didn't quite like it. But other than that, it was implemented as an afterthought in the absence of better alternatives. Reopening the file is not exactly the same thing as passing its open handle, because the file handle is associated with position, append on write flag and mode (not currently supported in HelenOS). If the donor closed its handle or did not even have the file open, the mechanism was susceptible to races.

In ticket #259 I called for a first-class VFS operation for passing device handles. The result of the operation should be that VFS duplicates a donor's file handle in the acceptor's table of open files and informs the acceptor about the new handle. Now, the real problem of course is how VFS learns about the intentions of its two clients and how it locates their data. We also want to prevent VFS from injecting a file handle against the will of any respective acceptor and, vice versa, importing a file handle against the will of any respective donor.

I started drafting the authorization protocol about a month before the camp during one of the rather frequent electricity blackouts. At that time I thought of introducing a specialized system IPC message for passing any sort of a handle in general between two clients of the same server. The message, something like IPC_M_PASS_HANDLE_AUTHORIZE, would be understood by the kernel (just like any other system IPC message). In this case, the kernel would be the trustworthy entity certifying that the donor task agreed with the acceptor task on passing a handle. I planned to store such a certificate, along with identification of both parties and the handle, in the donor's kernel address space, from where it would be later picked up by the donor sending another system message, for example IPC_M_PASS_HANDLE_REALIZE, to VFS. It was this second message which should make VFS perform the transfer. The only problematic question was how to organize such certifications within the donor. Should it be limited to a use-once right or rather should each task have a container for storing and managing certifications? Or should the kernel send the second message itself on behalf of the donor?

With the basic idea and some unanswered questions, I joined the camp (technically the camp started while I was already there sitting at my dining table).

I discussed my idea with Jiri and Martin to get some feedback. Jiri was traditionally skeptical and was frowning at me. Apart from having an alternative idea of turning HelenOS into a capability-based system (sigh), he also advised me to make the whole mechanism more generic. Instead of passing some sort of a handle, why not let the two sides agree on a protocol-specific change of externally kept state? I liked his idea so I moved from IPC_M_PASS_HANDLE_AUTHORIZE to IPC_M_STATE_CHANGE_AUTHORIZE and reserved the first three arguments of the message for the protocol-specific definition of the desired change of state. Martin was influential in solving the question of how the acceptor learns about the new state. I originally intended the donor would tell the acceptor, but there was a purely hypothetical risk of the donor intentionally misleading the acceptor by providing false information. Not so sure about why would the donor have bothered with such a mystification, but anyway, I came to realization that this final phase of the exchange is in fact an optional, protocol-specific synchronization between the acceptor and the server.



At this point (or slightly before that), I started coding. I first implemented the IPC_M_STATE_CHANGE_AUTHORIZE and its kernel and async framework support. The donor sends this message to the acceptor, which may be (and usually is) the child task. It packs the protocol-specific identification of the desired change of state into the message. Here comes one important detail: the donor also needs to identify the server process by presenting an open phone to it in the message. When sent, the kernel remembers the server phone and passes the message to the acceptor. The acceptor now has a chance to reject the offer or agree to it. If the acceptor agrees, it too needs to present an open phone to the same server in the answer. The kernel now checks that both the donor and the acceptor mean the same server task by looking at the two phones. If there is a match, it sends a per-task kernel event notification EVENT_TASK_STATE_CHANGE to the server task. Note that the use of a kernel notification here gets us rid of the problem with storing the certification in the donor task.

For this to be possible, I had to implement support for per-task kernel event notifications as HelenOS only supported global kernel event notifications. The difference is that in the case of a global kernel notification only one task can subscribe it (i.e. one task receiving e.g. the EVENT_KLOG notification) while it is technically possible for each task to subscribe to any of its own per-task events. In this regard, per-task kernel events resemble POSIX signals.

Back to the server task receiving the EVENT_TASK_STATE_CHANGE notification. The notification carries information necessary for the server to locate both the donor's and acceptor's client data and the description of the desired change of state. In connection with accessing some task's client data from the notification context, it was necessary to add functions for managing external references to the client connection tracking structures.

In case of the VFS server, it handles the EVENT_TASK_STATE_CHANGE notification by looking up the donor's and acceptor's client data which contain the respective tables of open files. It performs several sanity checks to verify that the file handle exists in the donor's name space and that there is a free file descriptor in the acceptor's name space. If everything looks good, it clones the vfs_file_t structure and points it to the underlying VFS node. Finally, it enqueues the new file handle in the acceptor's client data and broadcasts its condition variable to wake up any clients that may be waiting for it.

This brings us to the details of the acceptor-server synchronization. For this purpose, VFS implements a new interface VFS_IN_WAIT_HANDLE and libc makes it available to clients via the fd_wait() function. Clients invoking fd_wait() will simply block on their VFS client data condition variable until there appears the new file descriptor. VFS clients call fd_wait() after the reply to IPC_M_STATE_CHANGE_AUTHORIZE and before replying to the leading message such as LOADER_SET_FILES. Without a synchronization like this, the donor could exit before the server has a chance to duplicate its handle or it may be too early for the acceptor to query its new state.

I tried to illustrate the whole communication in a picture above. As a whole, it may seem more complicated than the previous solution that simply passed VFS triplets around, but is definitely cleaner. Moreover, HelenOS now features a universal mechanism for achieving a change of externally kept state between two clients of an arbitrary server.

Friday, August 12, 2011

Family background of synchronization primitives

If you want to have some good Friday read about synchronization primitives and enjoy a couple of fitting analogies, make sure to read this LWN article. May be, you will want to learn more about Seqlocks, just as I did after reading the article.