Tuesday, August 31, 2010

R.I.P. Sun Microsystems Czech s.r.o.

Tonight, Sun Microsystems Czech s.r.o. will see its last sunset. Tomorrow, it will be already preying in the happy hunting grounds with Sun Microsystems Inc.


Mourning and funeral rites will take place at dusk at Legenda.

Wednesday, August 18, 2010

FAT on a diet

Some time ago, we received a ticket that complained about large files taking too long to write to a FAT file system. Times of several minutes in a simulator were reported for writing a file which occupied a couple of megabytes. After some pondering about the problem, it became apparent that the problem is the FAT file system itself, more precisely the way it keeps track of blocks belonging to a file, and also the fact that HelenOS did not even try to do the smallest thing to alleviate the situation.

Used blocks of each file on a FAT file system are tracked in a linked list of consecutive clusters, where a cluster represents a fixed number of adjacent blocks. The use of the linked list structure in the on-disk file system format constitutes an interesting problem for the operating system, especially when doing sequential I/O. Whenever someone writes to or reads from some offset within the file, the operating system needs to find the block for that offset. Without any optimizations, this means walking the linked list from its beginning until the cluster and the block with the wanted offset is reached. In the worst case, one needs to walk the entire list until the last block number is found. Thus the algorithm for looking up the block with a single file offset in a file of n blocks has time complexity O(n). Now imagine that the application does not do this once, but searches the linked list repeatedly such as during sequential I/O like when loading an executable, creating an empty file, and copying or printing out an existing file. In these scenarios, the naive implementation will be traversing the list multiple times, starting from the beginning over and over again. Since each traversal is O(n), repeating it n times in the worst case gives us an O(n^2) algorithm where n is the number of blocks.

In comparison, for most disk file systems it is usually possible to devise a constant time algorithm for looking up a single block thanks to the on-disk index of used blocks. FAT as such, as I derived above, allows only for a linear algorithm because it only has the linked list.

The test case scenario in the ticket made me realize how unfortunate the situation was (I hereby admit that I was knowingly underestimating and derogating the issue prior to this experience, as Jiri can attest) so I resolved to fix it at least for the most common cases.

The fixes were actually two quite straightforward optimizations, both merged in changeset:mainline,624. The first optimization works by remembering the last cluster used by the file, making thus the append operation work in constant time. This comes handy when e.g. creating an empty file or making a copy of another. The second optimization is similar, but instead of the last cluster, it remembers the cluster where the last I/O took place. This is useful for reading the whole file in linear instead of quadratic time with respect to the size of the file. The first optimization works independently from the second one, while the second optimization can still be very useful for forward-going non-sequential I/O (i.e. when reads and writes are mixed with forward seeks) as the linked list traversal will start at the remembered cluster and not at the very beginning of the linked list. Of course, this second optimization will not be that much helpful anymore when multiple clients do sequential I/O at different offsets in the file, but let's face it (I am now going to derogate this issue again), this is not a common use case for a FAT file system anyway. When I wrote that both optimizations remember some cluster, it is also possible for both of them to forget that information under some circumstances, so don't forget these are mere optimizations that will help most of the time, but not always.

Below is a table and a chart that present some data I have collected in order to measure and visualize the expected improvement. I basically did a couple of experiments using a 10M FAT file system created using the mkfat HelenOS utility. For file sizes 1, 3, 5, 7 and 9 megabytes, I tried to measure the time to create a file of that size and also the time required to read the same file in its entirety. All times are in seconds and the commands I used to generate the desired workload were something like mkfile --size 5m /data/test and cat /data/test, respectively.


Write A1 Write A2 Write A3 Write B1 Write B2 Write B3 Read A1 Read A2 Read A3 Read B1 Read B2 Read B3
1MiB 6 6 6 8 8 9 5 5 5 9 9 9
3MiB 21 22 24 40 41 42 18 20 17 38 37 38
5MiB 36 35 34 94 90 101 29 28 28 106 99 101
7MIB 51 52 51 219 202 230 39 39 39


9MIB 66 69 67


53 53 52



Note that I did all my experiments using Qemu and HelenOS mainline revisions 623 (Reads/Writes B1 - B3) and 624 (Reads/Writes A1 - A3). As you can see, for most file sizes I did three repeated measurements while for some combinations of file size and operation there is no data. The lack of data after Write B3 for a 7MiB file is caused by a reboot I was forced to do by my wife who started to threaten me as it was getting late and I was still behind the computer instead of being in our bed. I tried to resume my tests after the reboot on the other day, but I was not able to repeat the results from the previous day, so I rather stopped the experiment there avoiding non-consistent data (the new measurements were consistent with each other, as were the older measurements, but unfortunately the two groups of measurements were not consistent together). I think I could not repeat the previous day results simply because of a different load on my host system where the Qemu simulator with the experiments was running. By the way, I never said I was going to do a scientifically correct experiment, ok? And when I am at it, the times measured need to be taken with a grain of salt as I basically used a slightly jerky stopwatch provided by my Linux box.


Knowing the weaknesses of my measurement method, I still think I got some usable and persuasive results as can be seen in the chart above. It clearly shows the quadratic behavior of both operations in revision 623 and the linear behavior in the fixed revision 624. Note that the lines were added and smoothed by the chart drawing software to better illustrate the trends. The points in the chart correspond to the values found in the table above.