Thursday, November 5, 2009

Reincarnation server for HelenOS?

Some operating systems have the capability to restart a server process after it unexpectedly turns up its toes. In MINIX 3, there is the reincarnation server (RS) to restart dead or hung services. In Solaris 10, the Service Management Facility (SMF) can do basically the same thing (see 6533008 for more on the planned feature for detecting unresponsive services). Now, when asked whether a similar feature should be implemented in HelenOS, I have to say: "No, thank you!".

Besides my belief that these reliability features may lead to system software of poorer quality because the software writers will simply rely on the restart, I have also seen too many bugs with the potential to render the system unusable especially thanks to the restarting.

In my case, these were all bugs in the start-up phase of the fault management daemon (fmd), which I happen to be sustaining as part of my job. You can find similar reports here by searching for "fmd full". The problem with fmd is that when it starts, it tries to process a backlog of unprocessed events, which is a non-trivial task. As with any complex tasks, there inevitably are bugs. So when fmd hits a bug in this phase, it dumps core and SMF restarts it. Then fmd starts to process the unprocessed events again and it dumps core again. This goes on and on until the core files fill up the /var file system and the customer calls Sun.