Wednesday, February 6, 2008

Lestes wishlist

This is an incomplete list of things I would like to see in the Lestes project, sorted in the order of achievability (i.e. as I can achieve them):
  • integration of my GAS assembler generator patches
  • retarget the compiler to mips32
  • remove unnecessary frontend dependencies on backend from the frontend
  • make use of the branch delay slots for useful instructions
  • support for 64-bit long long type
  • retarget the compiler to amd64
  • full support for ia32's and amd64's disp(base, index, scale) addressing in the generated code
  • fix semantic errors (e.g. long long long long type)
  • implement the missing logical operators (i.e. || and &&)
  • implementation of the 'cee' frontend
  • retarget the compiler to other processors, especially those supported by HelenOS

Sunday, February 3, 2008

File system support for HelenOS

I feel we are close to a new HelenOS release which will bring basic file system support. So what has been brewing in the repository for some time already? We basically have two or three components that will be ready for a review by releasing it. Namely, we have VFS, which is the place where all information about registered file systems and open files live. VFS also provides a single interface to the rest of the world through which it is possible to communicate with the underlying file system implementations. We also have a prototype implementation, TMPFS, which is a file system that lives completely in virtual memory and has no disk layout. I have also started FAT support, but it is now resting in favor of work that needs to be done on TMPFS. Finally, we have DEVMAP, which will register and provide connection services to registered block devices.

So what is already working? We can do simple mounts of TMPFS as a root file system. We can look up files and directories on TMPFS. We also can open(), read(), write(), lseek() and ftrunc() files and opendir(), readdir(), rewinddir(), closedir() and mkdir() directories.

It's already better to say what functionality is missing. We don't yet support rmdir(), rename(), unlink() and close().

You may say so what, TMPFS will lose data on next reboot, it's like having no file system support at all. The reason of doing TMPFS first is, of course, developing VFS and libfs frameworks so that other file systems can be added faster and with more ease. It will also help to blunt the sharp edges of the current implementation.

Fixing tmpfs in Solaris

A year ago, I started to work on a problem one of our customers had with the tmpfs file system. Since the problem has been already fixed in all relevant Solaris and OpenSolaris releases, I feel I can share some of the more interesting technical bits.

The problem, as laid by the customer, was that the system would hang if an attempt to fill a tmpfs filesystem (e.g. /tmp) is made. In other words, a mere:

dd if=/dev/zero of=/tmp/testfile

would slowly hang the system.

During the course of investigation, we actually found two bugs that contributed to this hang, even though the system would be usually only extremely slow and non-responsive, but not completely hung.

The first bug showed when one process held the majority of all available memory pages in the dirty state and wanted to allocate yet another page for itself. When there were no more available pages, this process got blocked waiting for the pageout thread to pageout the dirty pages to swap and make them available again. The pageout daemon walks the pages and when it finds a dirty page, it uses the respective vnode's putpage routine to do the actual pageout. In case of tmpfs, the putpage routine has a check that verifies that the tn_contents rwlock (i.e. the lock protecting the contents of the tmpnode to which the dirty page belongs) is not being held. If it is, putpage simply gives up this page and moves on to another one. Now, the problem was that when the dd process went to sleep, it held the tn_contents lock of /tmp/testfile from our example command. Moreover, almost every single page in the system belonged to this file and was dirty. The result? The pageout daemon could not do any forward progress as it had to give up paging out majority of pages due to the held tn_contents lock and the dd process could not unlock the tn_contents lock because it wanted another page.

This bug got fixed by modifying the wrtmp() routine, which is on the write(2) execution call path. The fix simply dropped the tn_contents lock before the thread in wrtmp() would get blocked waiting for a page and reacquired it later. Nevertheless, the problem didn't go away completely and we learned we only cured half of it (for the side effects of this cure, read on), but maybe the more serious half.

It turned out that a fix for large ISM segments on systems with ZFS introduced a change in the reservation of anonymous memory which was a regression for tmpfs and the second half of the required solution. After the addition of ZFS , databases which needed to create large shared mappings started to fail (i.e. were unable to create these large shared mappings) due to ZFS caching memory, which would otherwise be necessary for the shared segment. This was fixed by inserting a call that reaped several different caches in front of the test in anon_resvmem() that checks the amount of reservable memory (i.e. availrmem) and fails if the anon_resvmem() request cannot be satisfied. The problem with this is that this fix made it actually much harder for a tmpfs allocation to fail. Additionally, the procedure which reaped the caches was pretty heavyweight and contained a loop, which could delay every anon_resvmem() by as much as by 60 seconds! What? I said the theoretical maximum delay per request was 60 seconds. In the lab, I was able to reproduce this and using a clever dtrace script, I measured the maximum delay of 13 seconds, which is still pretty horrible.

I fixed this by selectively disabling the cache-reap for tmpfs completely so that now when there is no reservable memory, the dd command above will fail. Users can still guarantee tmpfs reservations by creating disk swap of sufficient size. Everything which is beyond the disk swap size, is not guaranteed and the reservation may fail. Note that this behaviour is consistent with the documentation for swap.

As it turned out, the fix for the latter bug was sufficient because it prevented the former bug from occurring. But it was difficult to see the latter bug without first fixing the former one. Moreover, the customer didn't make use of the possibility to crop the tmpfs size by the "size" mount option, which would have also solved the problem.

Finally, let me go back to the side effects of dropping the tn_contents lock. Due to the implementation of wrtmp(), the act of increasing the file's size and creating the new portion of its content was no longer atomic as seen from the perspective of a process which writes to a tmpfs file and mmaps its end and tries to read it at a time. Although documented and forbidden, such a behaviour was a nuisance. So I slightly reordered wrtmp() and putback the fix for this last Tuesday.

As of now, I am still not completely done with tmpfs and will come back to this topic later, when there is a little more to add.

New assembler generator for Lestes

During the past week, I've been playing with the Lestes compiler again. I found it pretty non-satisfactory that the compiler could generate .asm files only in the nasm syntax. I decided to try to add the ability to generate gas syntax as well.

Besides nasm and gas are different assemblers, the former generates code in what is called Intel syntax, while the latter uses AT&T syntax. The main difference between the two is the swapped order of instruction operands and a different way of specifying memory operand sizes.

Fortunately, this adventure took me only several days and editing a couple of XML files which describe the target machine. Besides of that, I also had to fix several dependencies on nasm which were hardcoded directly in the compiler's source code. My fixes are now ready for code review and will hopefully be integrated into the project's mainline repository sometime soon. Looks like this experience will be useful in my future Lestes endeavours, especially if I decide to retarget the compiler to amd64 or any other platform. BTW, nasm doesn't support anything but ia32 and amd64, so if Lestes can generate code for sparc64 one day, it will have to be assembled by gas and not nasm.

So how do you actually make Lestes generate gas assembler? It's just a matter of replacing the machine description XML files with those that are meant for gas. The following should do:

$ cd target/machine/i686/tm_description/
$ build_md.sh gas
$ cd ../../../..
$ make