Kernel Traffic #191 For 11 Nov 2002

By Zack Brown

Table Of Contents

Mailing List Stats For This Week

We looked at 1927 posts in 11542K.

There were 490 different contributors. 268 posted more than once. 180 posted last week too.

The top posters of the week were:

1. Speeding Up kmalloc() Kernel Function

26 Oct 2002 - 1 Nov 2002 (11 posts) Archive Link: "[PATCH,RFC] faster kmalloc lookup"

Topics: Efficiency

People: Manfred SpraulNikita DanilovMarcus AlanenAlan Cox

Manfred Spraul found a way to speed up the kmalloc() kernel function, by allowing it to quickly pass over memory regions that were obviously too small. He posted a patch against 2.5.44-mm5 and invited comment. Alan Cox asked about performance comparisons, and Manfred posted some numbers:

I've run my slab microbenchmark over the 3 versions:

The test reports the fastest time for 100 kmalloc calls in a tight loop (Duron 700). Loop/test overhead substracted.

32-byte alloc:
current:         41 ticks
generic_fls:     56 ticks
bsrl:            54 ticks
4096 byte alloc: 84 ticks
generic_fls:     53 ticks
bsrl:            54 ticks

40 ticks difference for -current between 4096 and 32 bytes - ~4 cycles for each loop. bit scan is 10 ticks slower for 32 byte allocs, 30 ticks faster for 4096 byte allocs.

No difference between generic_fls and bsrl - the branch predictor can easily predict all branches in generic_fls for constant kmalloc calls.

Nikita Danilov suggested, "Most kmalloc calls get constant size argument (usually sizeof(something)). So, if switch() is used in stead of loop (and kmalloc made inline), compiler would be able to optimize away cache_sizes[] selection completely. Attached (ugly) patch does this." Marcus Alanen remarked, "Perhaps a compile-time test to check if the argument is a constant, and only in that case call your new kmalloc, otherwise a non-inline kmalloc call? With your current patch, a non-constant size argument to kmalloc means that the function is inlined anyway, leading to unnecessary bloat in the resulting image." Nikita agreed with this, and Manfred said:

I agree, I have an old patch that does that.

Please ignore the part about fixed point division, it's not needed - the 'div' instructions is now outside of the hot path.

The problem is that the -mm tree contains around 10 slab patches, I want to see them in Linus' tree before I add further patches.

2. Dynamically Growing ext2 And ext3 Filesystems

29 Oct 2002 - 2 Nov 2002 (6 posts) Archive Link: "[PATCH] 2/11 Ext2/3 Updates: Extended attributes, ACL, etc."

Topics: FS: ext2, FS: ext3

People: Theodore Y. Ts'oJeff GarzikAlexander ViroStephen C. TweedieAndreas Dilger

Access Control Lists

Theodore Y. Ts'o posted a patch and explained, "This patch allows forward compatibility with future filesystems which are dynamically grown by using an alternate algorithm for storing the block group descriptors. It's also a bit more efficient, in that it uses just a little bit less disk space. Currently, the ext2 filesystem format requires either relocating the inode table, or reserving space in before doing the on-line resize." He gave a link to a USENIX article called Planned Extensions to the Ext2/3 Filesystem ( by himself and Stephen C. Tweedie. Jeff Garzik asked, "Is the interface for this going to be ext2meta? Al and sct seemed to agree that that was the best way act upon the filesystem metadata while it's online... I'll probably be updating that for 2.5.x VFS changes in a few weeks, that will provide safe online defrag and a good interface for other metadata interaction." And Theodore replied:

I'm not sure ext2meta will be sufficient. It's not just a matter of modifying the on-disk metadata, as would be needed for defrag, but I would also need to modify some of the in-core data structions in the ext2/3 filesystem data structures. For example, when you resize the filesystem, you need to increase the number of group descriptors, which means you need to kmalloc, copy, and then kfree sbi->group_desc out from under the mounted filesystem.

No doubt ext2meta could be modified so it could "reach out and touch" internal ext2/3 fileststem data structures in core. But the locking issues involved get really messy.

My original plan was to adapt Andreas Dilger's on-line resizing patch to use the new block group layout, which would obviate the need to take the filesystem off-line and run ext2prepare first. I'm not opposed to trying to do it via ext2meta, but it seems like it might get complicated and hairy quite quickly.

Alexander Viro remarked, "For all practical purposes, ext2meta is part of ext2 - same driver, two filesystem types. Locking isn't that scary, BTW - I'd looked into that some time ago and it looked feasible." Jeff agreed with this assessment, and said he'd post a patch in the next couple weeks.

3. New SquashFS Highly Compressed Filesystem

29 Oct 2002 - 2 Nov 2002 (16 posts) Archive Link: "ANNOUNCEMENT: Squashfs released (a highly compressed filesystem)"

Topics: Compression, FS: SquashFS, FS: ramfs, Samba, Small Systems

People: Phillip LougherRob LandleySamuel FloryJeff Garzik

Phillip Lougher announced:

First release of squashfs. Squashfs is a highly compressed read-only filesystem for Linux (kernel 2.4.x). It uses zlib compression to compress both files, inodes and directories. Inodes in the system are very small and all blocks are packed to minimise data overhead. Block sizes greater than 4K are supported up to a maximum of 32K.

Squashfs is intended for general read-only filesystem use, for archival use, and in embedded systems where low overhead is needed.

Squashfs is available from

The patch file is currently against 2.4.19. There is further info on the filesystem design etc. in the README.

Samuel Flory asked how squashfs compared with the existing cramfs project, and Phillip explained:

Cramfs was the inspiration for squashfs. Squashfs basically gives better compression, bigger files/filesystem support, and more inode information.

  1. Blocks upto 32K are supported - data is compressed in units of 32K which achieves better compression ratios than compressing in 4K blocks. Generally using bigger than 4K blocks are a bad idea, because the VFS calls the filesystem in 4K pages. Squashfs explictly pushes the extra block data into the page cache.
  2. Squashfs compresses inode and directory information in addition to file data. Inodes/directories generally compress down to 50%, or say on average 8 bytes or less per inode.
  3. All fs data is packed on byte alignments, saving a couple of bytes per inode and directory.
  4. Full 32 bit uids/guids are stored (4 bits stored in inode, uses a lookup table, to give 48 uids/16 gids). File sizes upto 2^32 are supported. Timestamp info is stored. Cramfs truncates uids to 16 bits, uids to 8 bits. Cramfs files sizes are upto 2^24. No timestamp info. Squashfs takes advantage of metadata compression to have more info with smaller metadata overhead.
  5. Symbolic link contents/file indexes are stored inside the inode table, giving better compression than if they were compressed individually, or not compressed.
  6. The mksquashfs program doesn't store/mmap the filesystem as it is created (it performs file duplicate checking against the partially written out compressed fs), and so allows larger filesystems to be created.

Further info on the fs is contained in the README...

Jeff Garzik mentioned that a read/write compressed filesystem would also be really great. Phillip replied:

A r/w compressed filesystem may be my next project... As a couple of people have mentioned there are compressed r/w filesystems already out there.

As you'll know, there are always tradeoffs with filesystem design, it is very difficult to get as good compression with a r/w fs than with a read only filesystem. I wanted to get maximum compression, and quite a few of the techniques I use rely on its read-only nature.

An append only (i.e. files can be added, but not modified), fs might be a useful compromise. With compressed metadata, any modification of files will inevitably achive different compression ratios, and so modification of metadata/files in place is not an option. Appending modified metadata/data brings you to log-structured (journalling) filesystems and compaction (log cleaning) requirements with consequent loss of compression.

Rob Landley replied:

A compressed filesystem with dynamically updated random-access files will fragment the heck out of itself darn fast. (Seek into the middle of a file somewhere, write a block, seek somewhere else, write another block, repeat 1000 times... Keep in mind that the new compressed block of data will almost certainly not be the same size as the old one... It's a mess.)

My potential usage is: I've got a little linux distribution I put together called "firmware linux", which builds from source and outputs a zisofs image that gets loopback mounted as the root filesystem. (A very alpha version could be sucked off of, edit "" to specify the output directory, and then run it. The point is, the whole OS and all applications can be upgraded as one file. No package management, it's basically a big firmware image.)

The reason I used a zisofs instead of cramfs is that cramfs has a LOT of problems with big filesystems. (The finished root partition, with apache and samba and ntop and python and rsync and openssh and everything, is currently around 100 megs. Yeah, I can trim that down by quite a bit if I get time, I'm currently compilling and developing it under itself so I have gcc in there and the full set of man pages and everything...)

Mkcramfs seems to barf at somewhere around 16 megs, which is really limiting. AND it seems to try to open every file simultaneously (hardlink detection?) so it runs out of file handles. (Again, that could be adjusted under /proc somewhere, but isn't worth it.)

So it seems that the thing to test this against isn't cramfs, but zisofs. Have you looked at that?

I'll take a look myself and get back to you...

And Phillip said:

I looked at both cramfs and ziosfs when writing squashfs. Zisofs is a nice fs, but tends to have greater overhead due to the isofs filesystem. On tests I've found zisofs images to be between 5% (a single directory with lots of small binaries) and 61% (lots of nested directories) bigger than the squashfs filesystem.

I believe that squashfs is useful for what you're doing. I'm a bit hesitant in saying that, because I'd rather people downloaded it and made up their own minds :-)

Thank you for trying it out, and I hope you like it. I'll obviously be interested in your thoughts on it.

4. Linux 2.5.45 Released

30 Oct 2002 - 1 Nov 2002 (22 posts) Archive Link: "Linux v2.5.45"

Topics: Device Mapper, Disk Arrays: LVM, Networking, Sound: ALSA, USB, Virtual Memory

People: Linus TorvaldsAlexander ViroRoman ZippelAaron Lehmann

Linus Torvalds announced 2.5.45 ( and said:

Big changes, lots of merges. A number of the merges are fairly substantial too.

Device mapper (LVM2), crypto/ipsec stuff for networking, epoll and giving the new kernel configurator a chance. Big things.

And a _lot_ of maintenance, from various architecture updates to USB and ISDN and ALSA. Merges with Andrew & Alan etc.. Go out and test

Aaron Lehmann pointed out that 'make oldconfig' and 'make menuconfig' now depended on the QT graphics library, which made no sense to him. Alexander Viro remarked, "Remove "false" from the rule that spits out annoying shit about absence of QT (_yes_, I _know_ that I don't have that shite installed, thank you very much for reminder). Doesn't solve the annoyance problem, though." And Roman Zippel, close by, said, "Yes, it's a bug. The patch below fixes this without breaking xconfig. Linus, please apply."

5. Kconfig Documentation

31 Oct 2002 - 2 Nov 2002 (22 posts) Archive Link: "Where's the documentation for Kconfig?"

Topics: Kernel Build System

People: Rusty RussellRoman ZippelRussell KingMatthew WilcoxChristoph Hellwig

Matthew Wilcox asked where he could find documentation for the new kconfig configuration system recently added to the kernel. Roman Zippel gave a pointer to Christoph Hellwig suggested updating Documentation/kbuild/config-language.txt with the new information, and Roman replied that he'd get to that soon. Rusty Russell remarked, "Doco is great, and it'd be nice to replace what's there, but I think it's remarkably easy to use in a monkey-see-monkey-do fashion, which is *really* good because that's how people will use it. Plus, I never realized how slow the old "make oldconfig" was."

Elsewhere, Russell King asked if any tool had been written to convert a file to a kconfig equivalent. He pointed out that the existing lkcc tool was too extensive, converting whole trees, but not individual files in their own. Russell had some partially converted directories that were in a somewhat inconsistent state, that lkcc wouldn't work on. Roman suggested, "You could put it into arch/tmp/ and do 'lkcc tmp'. But converting the whole tree is the prefered solution, because lkcc needs all the type information of every symbol used in the config file to do a good job. The easiest solution is probably to get the 2.5.44 patch from my page, generate a diff to your converted 2.5.44 tree and apply this patch to 2.5.45. If you send me a 2.5.44 patch of your tree, I can do it for you." This worked for Russell.

6. Status Of Xiafs In 2.5

31 Oct 2002 - 1 Nov 2002 (11 posts) Archive Link: "Xiafs inclusion in 2.5?"

Topics: FS: XiaFS, Feature Freeze, Forward Port

People: Carl-Daniel HailfingerLinus TorvaldsAndries BrouwerH. Peter Anvin

Carl-Daniel Hailfinger quoted a post from Linus Torvalds back in the year 2000, when he said, "Who still remembers xiafs? We have 33 different filesystems in the kernel tree - something that is quite impressive, and something that I don't think anybody else has ever tried to support. But we could have had 34.." Carl replied now, "Out of curiosity, would you reaccept xiafs in 2.5, if it was cleaned up and forward ported to use the new interfaces? And if you accept it, what's the latest date I could submit it? Technically, it is a regression, ;-) so the feature freeze date might not apply." H. Peter Anvin felt there would be no point whatsoever to this, but Carl replied that it would be fun! Andries Brouwer looked around his house and found an old floppy with an Xiafs format. He said he'd like to be able to read it again without booting a 2.0 kernel. Elsewhere, Linus replied to Carl's initial post, "Quite frankly, I probably _would_ accept it, if it's cleanly done. If only because of the fact that it's such a ridiculous thing to do, and thus gets high points on my "surreality meter"." And he added, "Yeah, I think xiafs has little to do with a feature freeze. It has little to do with sanity too, for that matter. I saw that Andries still has one xia floppy somewhere, and that probably puts him in a rather unique position. I can't imagine that very many people really care, but it's a ironic form of retrocomputing..." This inspired Carl to ask Andries for an image of his floppy, and Andries gave him a pointer ( .

7. Swap-Space Mini-Howto Documentation

1 Nov 2002 - 5 Nov 2002 (20 posts) Archive Link: "[announce] swap mini-howto"

People: Gabor MICSKORandy Dunlap

Randy Dunlap was surprised to find no mini-howto on the web that covered swap-space, so he created one and gave a temporary link ( to it. A number of folks also expressed surprise at not finding a similar doc anywhere. Various people read Randy's doc and offered suggestions; and a number of folks said that the Linux Documentation Project ( was the best place to ptu it. At one point Gabor MICSKO also said, "I translated this doc to Hungarian language. You can read the translated doc this url:"

8. Status Of initramfs

2 Nov 2002 - 4 Nov 2002 (18 posts) Archive Link: "[BK PATCHES] initramfs merge, part 1 of N"

Topics: FS: NFS, FS: initramfs, FS: ramfs, FS: rootfs, Klibc

People: Jeff Garzik

Jeff Garzik said:

The attached below is the first of several changes for initramfs / early userspace.

This change is intentionally very simple, not really proving its worth until next week when patches 2 and 3 in this series arrive in your inbox. A description of "the future" follows description of this specific cset.

  1. Introduce init/initramfs.c itself, which is a module that uncompresses a .cpio.gz archive, and uses it to populate rootfs with files early very in the bootup process (between signals_init and proc_root_init in init/main.c). People will see a small listing in dmesg of unpacked files. We need to keep this for now (and for now it's small), but we may want to remove this output or turn the knob down to KERN_DEBUG before 2.6.x release:

       -> file1
       -> file2
       -> etc...

    (architecture maintainers note!)

  2. Introduce ARCHBLOBLFLAGS in arch/$arch/Makefile, for turning an arbitrary binary object into a .o file using objcopy.
  3. Link the initramfs cpio archive in vmlinux image via arch/$arch/, in the init section.
  4. Introduce the new linux/usr directory. Currently it is not very interesting, only containing a small host-built proggie that generates the initial cpio archive, gen_init_cpio. This program will go away when early userspace is further along. It currently exists to show initramfs is working, by allowing us to remove three simple lines from init/do_mounts.c.

He went on to describe future development:

Early userspace is going to be merged in a series of evolutionary changes, following what I call "The Al Viro model." NO KERNEL BEHAVIOR SHOULD CHANGE. [that's for the lkml listeners, not you <g>] "make" will continue to simply Do The Right Thing(tm) on all platforms, while the kernel image continues to get progressively smaller. Here is the initial plan for early userspace, i.e. the patches you are going to be seeing next week:

#2 - merge klibc.

As I said earlier, I am not sure if we will wind up removing klibc just before 2.6.x release or not. Comments welcome. But for now, klibc will be merged into the kernel tarball, because otherwise version drift during the evolution of early userspace will be a huge PITA, and slow things down. It is a tiny libc written specifically for the kernel.

This patch will add klibc to the build system, and create a tiny, statically-linked binary "kinit". kinit is the beginnings of early userspace. Some tiny, token amount of do_mounts.c code will be moving into kinit in patch #2, only enough to prove the system is working.

#3 - move initrd to userspace

Unfortunately we don't start seeing tangible benefits to early userspace until this patch, but that's how evolution works :) Here, initrd unpacking code is moved to userspace, as much as possible. Some initrd code will inevitably stay in the kernel, because it is arch-specific how to grab the initrd image from bootmem [or whereever], but the vast majority of initrd code goes poof (yay!). No initrd behavior will change at all, from current kernels. It is simply getting moved to early userspace. Users will not need to do anything on their end to make sure their existing setups continue to work -- any such actions are a bug on my part.

This patch will also turn "kinit" into a shared binary, and introduce the gzip binary into early userspace. [see "Items For Discussion" below, too, WRT this.]

#4 - move mounting root to userspace

People probably breathed a sigh of relief at patch #3, they will heave a bigger sigh for this patch :) This moves mounting of the root filesystem to early userspace, including getting rid of NFSroot/bootp/dhcp code in the kernel.

#N - to infinity... and beyond!

I, and hopefully others, will continue in the series of evolutionary patches, moving more and more stuff to early userspace. There are a lot of possibilities, and I will be looking for input from others on useful things to move, as well as continuing my own work of finding items that can be moved.

9. New Open POSIX Test Suite

4 Nov 2002 - 5 Nov 2002 (24 posts) Archive Link: "[ANNOUNCE] Open POSIX Test Suite"

Topics: POSIX, Version Control

People: Geoff GustafsonLarry McVoyJeff GarzikAndreas DilgerRik van Riel

Geoff Gustafson announced:

I would like to announce a new project to develop and/or assemble a GPL test suite for POSIX APIs. The tests will focus on conformance to the IEEE Std 1003.1-2001, but will also include separate functional and stress tests.

The project's current approach to conformance testing is to record assertions from a close reading of the POSIX specifications, and write minimal test cases that prove or disprove these assertions. The test suite will be independent of specific API implementations, and will eventually be easily configurable to work with different implementations. The project aims for OS independence, using only POSIX APIs, the autoconf suite, and simple shell support. However, it is currently only being tested on Linux.

Ultimately, the plan is to use the test suite to evaluate current support in Linux, as well as new implementations being considered in the open source community, and then contribute patches or at least bug reports (with a minimal test case) to the appropriate places, like LKML.

Contributions of any test cases, review of the work, discussion of the approach, etc. are very welcome. Join the development mailing list, posixtest-discuss. The initial focus is on Signals, Message Queues, Threads, Semaphores, and Clocks & Timers, based on current interests and resources. You can help in these areas, or start work on another area of the spec. There will need to be some uniformity across the suite, but many details have yet to be worked out, so your involvement in those decisions help a lot.

For more information, see the project website at

Larry McVoy replied:

Great idea. We can help in the following way: BitKeeper has an extremely simple test harness used for regressions. It's well thought out in that it is trivial to write simple tests and run them in isolation or to run the whole suite. If you want the harness, we'll give it to you under whatever license you want, I assume GPL, but we don't care.

You can see what the tests look like in BK, if you have it installed, we ship all the tests, they are in `bk bin`/t

A simple test might be


        # test that touch creates a file
        touch foo
        test -f foo || {
                echo failed to create foo
                exit 1

The harness takes care of putting you in a clean isolated environment.

Geoff said this would be great, and that he'd check out BitKeeper.

Elsewhere, Jeff Garzik speculated:

I wonder if any vendors, or independent groups, would be interested in maintaining a POSIX compliancy patchkit for the Linux kernel?

IMO such a "POSIX Linux" project would be useful for several reasons. Overall, I think there is pressure from several directions to get all sorts of POSIX APIs into the kernel. On occasion, kernel hackers are confronted with a situation where complete POSIX compliancy may mean a compromise in some area, be it performance, security, API issues, code cleanliness issues, etc. Or simply that the POSIX-related code just isn't ready to be merged into the mainline kernel yet.

The vendors also benefit by this, because the barrier to entry in POSIX-related cases would be lowered, which would in turn satisfy the demands of customers. Which would in turn give the mainline kernel all the software engineering benefits that come from a more reasoned and gradual review and merge of new features.

Does something like this already exist? This would need to be an open, vendor-neutral project...

Rik van Riel volunteered to help, and Geoff added, "I agree this sounds very useful. I could do something like this as part of the test suite project; this would expand the scope to include testing and reporting the status of the latest patches." Andreas Dilger suggested that it might be better to use the existing test suite from X/Open; but a couple folks (including Geoff) pointed out that the X/Open tests of POSIX extensions more recent than 1990 were not free.

A number of folks suggested that Geoff's work really belonged in the LTP ( project. Geoff pointed out that LTP seemed to concern itself with testing interfaces that had already been implemented in the kernel; while his project focused on testing interfaces that had not yet been implemented, and testing them whether in the kernel or user-space. But several folks from LTP said LTP would love to extend their test suite in this area. The thread ended inconclusively.

10. Paying For Patches

5 Nov 2002 - 6 Nov 2002 (5 posts) Archive Link: "PATCH: Driver Maintainers"

People: Alan CoxLinus Torvalds

Alan Cox posted an odd patch and explained:

I've been getting more and more people talking to me looking to pay people to fix small Linux bugs but having problems finding smaller companies. Obviously wanting to send $1000 to have someone fix a driver simply doesn't work when you talk to big companies.

One thing the FSF do which is rather sensible is keep a list in the packages of people who you can pay to fix stuff in them. I asked on Linux-kernel and got a small initial set of company responses. hopefully more will appear once its merged.

The order is alphabetical logically enough

[Marcelo this seems to apply cleanly to 2.4 as well]

Linus Torvalds replied:

I would really prefer for there to be some kind of explicit requirements for this. Even if we don't endorse the thing, I'd hate to have a bad egg or two (assuming this expands a lot, which I think it might) causing trouble.

I'd also like for it to be explicitly only for individuals or small companies ( "less than x people" ), or some other way make sure that the thing is balanced and we set peoples expectations right (both users of the list as well as people who want to be on the list).

Also, is the kernel source really the right place for this, considering that many people will have sources that are years old and there is no way to remove potential problematic entries from already-released kernels? In other words, wouldn't it be better to have some nice place on the web and a pointer to that in the kernel sources?

Alan said:

Fair comment. I can happily put it on a web site with a pointer instead any preferences to a location like or just 'wherever'

Splitting it up by company is easy enough to do - split the web page into "Interested in contracts below $1000, $10000, $100000, $1M, ..." sections

One could construct a verifiable non-repuditable rating scheme I guess. That depends if its worth it

"People who paid for bug fixes in the 3c501 driver also bought MacIIfx support contracts..."

11. Graphing Kernel Development

6 Nov 2002 (1 post) Archive Link: "Charts of the evolution of 2.5"

Topics: Feature Freeze

People: Guillaume Boissiere

Guillaume Boissiere said:

I put together some graphs showing the evolution of features for the 2.5 kernel here:

Since most features evolve over time as opposed to being a one-time deal, it does pretend to be fully accurate but it does give a good sense of the development lifecycle.

Funny how the rate of merges grew rapidly just before feature freeze :-)







Sharon And Joy

Kernel Traffic is grateful to be developed on a computer donated by Professor Greg Benson and Professor Allan Cruse in the Department of Computer Science at the University of San Francisco. This is the same department that invented FlashMob Computing. Kernel Traffic is hosted by the generous folks at All pages on this site are copyright their original authors, and distributed under the terms of the GNU General Public License version 2.0.