Wednesday, November 26, 2008

Debugging HelenOS with QEMU and GDB

It is possible to use the combination of QEMU and GDB to debug live HelenOS system. You need to have the respective cross-GDB ready if you intend to debug the system on a non-native architecture. You start by running QEMU like this:
$ qemu -cdrom image.iso -s -S
The lower-case -s tells QEMU to wait for the GDB session at port 1234. The capital -S instructs QEMU not to start the simulation before you enter c from the GDB prompt.

Once you have QEMU waiting for a connection at port 1234, do the following:
(gdb) target remote localhost:1234
After that, you should see something like this:
Remote debugging using localhost:1234
0x0000fff0 in ?? ()
(gdb)
GDB is now ready and you can start the simulation, but you still don't have any symbols. GDB allows you to load the symbol information using the symbol command:
(gdb) symbol kernel/kernel.raw
Instead of the kernel symbols, you can load symbols of any of the userspace ELF binaries, but bare in mind that you can have only one set of symbols loaded at a time. Now you are ready to proceed and can start the simulation:
(gdb) c
Later you can break back to the debugger by pressing Ctrl+C in the GDB window.

This method gives you a nice debugging features for QEMU targets on which HelenOS runs. One problem is that the simulator cannot separate the execution of the kernel from the execution of the separate userspace tasks so if you single step long enough, there will be some context switches that you won't be able to filter out. In this light, the debugging approach seems to be most suitable for debugging the kernel.

Saturday, November 8, 2008

Thoughts about observability

My English language spell-checker tells me that there is no such word as observability. In the Solaris technical parlance, we use this term to refer to the software features that allow one to have a pretty good idea about what was the system doing sometimes in the past, be it in the time of a crash or in the history of the running system, and also what is the system doing right now. There are many observability tools and techniques, some more sophisticated than others.

For instance, Linus Torvalds, as well as a great deal of his fellow Linux kernel developers and also Linux users, expressed himself, and this is not an exact quotation, that real men use debug prints and their brain to debug something as complex as the Linux kernel. I last saw someone else to use these words yesterday on the Czech Linux site root.cz. Linus has also said, at least once, that Solaris is a piece of crap mainly because it has observability features that go far beyond the debug print-brain technology. According to Linus, the holy grail of debugging is:
So forget about it. The whole model is totally broken. We need to make
bug-reports short and sweet, enough so that random people can
copy-and-paste them into an email or take a digital photo. Anything else
IS TOTALLY INSANE AND USELESS!
Frankly, sometimes it is too early for the crash dump to be generated so we occasionally also receive digital photos of the panic messages from our customers or from the OpenSolaris community members, but you hardly ever get enough information from a photo. What you really want to have is a crash dump, because in 99.999% cases it contains all the state which lead to the crash.

In Solaris the debugger is called mdb or kmdb, depending on whether you are looking into a coredump, a crashdump, a running system or a running system in single-stepping mode. Solaris has had a kernel debugger for ages and for my job, it is the ultimate tool without which I would be condemned to looking into a crystal sphere^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H digital photos.

A colleague of mine, Vineeth Pillai, is sustaining Sun's SAM QFS product. Besides the Solaris modules, he also has some Linux kernel modules to take care of. Whenever there is a problem with the Linux kernel module, he is out of luck, because he doesn't get the usual Solaris crash goodies and must resort to the printk() workaround. True, there has been a kernel debugger in Linux since 2.6.26, but the customers are not running that version yet.

My last root-caused bug is a good example of usefulness of crash dumps. Last week, I worked on a bug which was occurring somewhere between OBP and the kernel. The usual observability tool - kmdb and dtrace - were not an option here due to the special restricted context of the OBP's callback_handler environment. Luckily, I was able to use a feature of Solaris debug kernels called TRAPTRACE. With this feature enabled, the kernel records essential data about each trap which occurs on the system into a kernel buffer. When the system later crashed due to CR 6750765, I was able to successfully reconstruct the flow of events that resulted in the crash using that data and mdb's ::ttrace dcmd which presents them in a human readable form. There was no way I could do without the crashdump and the TRAPTRACE data in this case.

I'd definitely like to see similar observability features in HelenOS one day. So far, we were using the simulator features to replace the role of the kernel debugger, but that is going to be unsustainable in the future, as the project focus shifts more towards real metal. The trend is looking good though, we already have the trace task, which can be best described as a variant of the strace command on Linux or the truss command on Solaris.

Monday, November 3, 2008

HelenOS Camp '08 Pictures

The first picture was taken in a restaurant/cafe called Alenka. We chose Alenka because they make good coffee and have free wireless Internet access. It was Saturday morning and we were warming up before we'd climb up to the base camp.


The next picture shows the part of the camp with commit access in the base camp. We already have Vineeth amongst us, so it must be Monday afternoon.



This is not Yetti, but our Itanium guru Jakub.



Martin, most likely dealing with nonpoisonous snakes like Python.



And this is me, trying to put on some weight and FATten a little more.


And here's Vineeth, raising his ARM.

Netbooting HelenOS on Sun Ultra 60

This is the setup for netbooting HelenOS on Sun Ultra 60 which worked for me. The image server is currently Ubuntu 8.10 Linux box.

  1. Install and setup tftpd. This is what my /etc/inetd.conf looks like (make sure it remains on one line):
    tftp dgram udp wait nobody /usr/sbin/tcpd /usr/sbin/in.tftpd -s /tftpboot
  1. Install and setup rarpd; I assume that the Linux box will be on network 192.168.1.0 and the HelenOS/sparc64 box will be given address 192.168.1.2. In my case, the MAC address of the Ultra is 08:00:20:CD:BD:67. My /etc/ethers is as follows:
    08:00:20:CD:BD:67 192.168.1.2
  2. Create world-readable /tftpboot and copy the sparc64 image.boot there as file C0A80102. The number corresponds to 192.168.1.2 and is what the HelenOS host will be looking for at that directory.
  3. Interconnect the Ultra with the Linux box using a switch or a cross-cable. Then power on the Ultra and when you reach the ok> prompt, type:
    boot net
If something goes wrong, I suggest that you use tcpdump to monitor the network traffic on your NIC.

Saturday, November 1, 2008

Big microkernel survey

I have just started a new survey on this blog so that I can have a better idea of what features people value the most in microkernels. I hope to receive more feedback than for my previous survey which asked if people would consider contributing to HelenOS.

This time, I provide eleven different options. Because these are mere one, two or three-word labels, I'd like to give some explanation on what I meant by each. Hopefully, people will feel encouraged to comment their choices in the survey by attaching comments to this entry.

  1. performance - some microkernels (e.g. L4 variants) seem to be focused on maximizing the IPC performance and reducing the number of different kinds of switches, sometimes the assembly is used as the implementation language;
  2. robustness - some microkernels (e.g. Minix) have the ability to detect failures of system services. In case of a failure the service is restarted.
  3. small memory footprint - microkernels tend to be smaller than other operating systems' kernels. As such, they take up smaller size of the computer's memory and can be cache-hot most of the time;
  4. real-time features - RTOS are often implemented as microkernels;
  5. observability - ability to find out what is/was the system doing; e.g. Solaris is very good in this;
  6. component design - separation and isolation of individual functional components such as the file system services in HelenOS. The components can only communicate via a well-defined protocol;
  7. desktop features - some microkernels (e.g. QNX or XNU) could have been used as a desktop operating system;
  8. high-end features - things like SMP support, but also hot-swap, hardware fault management etc.
  9. portability - the ability to run on multiple processor architectures and the ease of porting to a new architecture;
  10. security - smaller trusted code base should be more secure by virtue of having fewer LOC and thus fewer security bugs;
  11. standard compliance - compliance with POSIX or any other operating system API standard