Monday, February 11, 2013

Horizontal or vertical scalability?

Just finished reading a networking paper originating in the MINIX group called Keep Net Working - On a Dependable and Fast Networking Stack. Before going into the real meat of the paper, I must first grumble a little because the paper completely withholds the fact that HelenOS was the first multiserver microkernel-based system to have a fully decomposed networking stack, as described in Lukáš Mejdrech's 2009 master thesis Networking and TCP/IP stack for HelenOS system and briefed about by me at FOSDEM 2012. True, the first HelenOS networking stack was not a big success and we have reimplemented most of it since then, but HelenOS still has a fully decomposed networking stack. I suspect the authors of the paper must have suffered with temporary amnesia when writing the following words in 2012:
By chopping up the networking stack into many more components than in any other system we know...
Putting an end to my grumbling, the HelenOS networking stack has not been decomposed for the sake of saturating multi-gigabit network interfaces nor for self-healing capabilities, which are the areas the paper indeed attempts to innovate in.

What is more interesting about the paper is that it assumes processor cores are becoming an increasingly cheap resource and suggests to dedicate individual cores to networking stack components. So far so good and it is likely that this approach greatly improves the performance of the completely non-scalable legacy MINIX 3 networking stack. The problem, as I see it, is that this improvement seems to have gone into a dead end and there is hardly a way for it to keep the same design while still bringing new performance gains at the same time. The reason I think this is actually the case is the evolution of the Solaris 10 networking stack as described in the FireEngine whitepaper. According to both the MINIX group paper and the Solaris whitepaper, the two have gone into completely opposite directions.

The Solaris networking stack has abandoned the idea of horizontal networking stack scalability in which every component is serviced by a couple of kernel threads that can all run on a different processor in favor of a model in which the networking stack is multithreaded and each processor has de facto its own local instance of it. This vertical setup allows for a packet to be processed entirely on one processor, greatly improving cache utilization. Of course, this approach scales with the number of processors because the Solaris stack can be processing as many packets in parallel as there are CPUs without suffering from poor data locality.

On the other hand, the MINIX group paper suggest to have each networking stack component wired to a dedicated processor, so that each packet travels horizontally across the several processors as it is being processed by the networking stack. The paper praises this design's cache utilization because it appears the stack components can run without kernel intervention most of the time so that the caches remain populated with the component's code and data, but it completely ignores the effects caused by the packet itself changing processors and caches constantly.

It may be that the costs and benefits of each approach are different for a monolithic kernel such as Solaris and a multiserver MINIX 3 derivative such as the one described in the paper, but one thing is for sure. The lessons learned during the evolution of the Solaris networking stack warrant a very interesting comparison study.