Kernel Traffic
Home | News | RSS Feeds | Mailing Lists | Authors Info | Mirrors | Stalled Traffic

Kernel Traffic #208 For 7�Mar�2003

By Zack Brown

Table Of Contents

Mailing List Stats For This Week

We looked at 1867 posts in 8755K.

There were 430 different contributors. 248 posted more than once. 169 posted last week too.

The top posters of the week were:

1. Minutes From Kernel Conference Call

21�Feb�2003�-�1�Mar�2003 (297 posts) Archive Link: "Minutes from Feb 21 LSE Call"

Topics: SMP, Version Control, Virtual Memory

People: Hanna Linder,�Larry McVoy,�Martin J. Bligh,�Alan Cox,�Gerrit Huizenga,�Cliff White,�Ben LaHaise,�Dave McCracken,�Rik van Riel,�Andrew Morton

Hanna Linder said:

LSE Con Call Minutes from Feb21

Minutes compiled by Hanna Linder, please post corrections to

Object Based Reverse Mapping:

(Dave McCracken, Ben LaHaise, Rik van Riel, Martin Bligh, Gerrit Huizenga)

Dave coded up an initial patch for partial object based rmap which he sent to linux-mm yesterday. Rik pointed out there is a scalability problem with the full object based approach. However, a hybrid approach between regular rmap and object based may not be too radical for 2.5/2.6 timeframe.

Ben said none of the users have been complaining about performance with the existing rmap. Martin disagreed and said Linus, Andrew Morton and himself have all agreed there is a problem. One of the problems Martin is already hitting on high cpu machines with large memory is the space consumption by all the pte-chains filling up memory and killing the machine. There is also a performance impact of maintaining the chains.

Ben said they shouldnt be using fork and bash is the main user of fork and should be changed to use clone instead. Gerrit said bash is not used as much as Ben might think on these large systems running real world applications.

Ben said he doesnt see the large systems problems with the users he talks to and doesnt agree the full object based rmap is needed. Gerrit explained we have very complex workloads running on very large systems and we are already hitting the space consumption problem which is a blocker for running Linux on them.

Ben said none of the distros are supporting these large systems right now. Martin said UL is already starting to support them. Then it degraded into a distro discussion and Hanna asked for them to bring it back to the technical side.

In order to show the problem with object based rmap you have to add vm pressure to existing benchmarks to see what happens. Martin agreed to run multiple benchmarks on the same systems to simulate this. Cliff White of the OSDL offered to help Martin with this.

At the end Ben said the solution for now needs to be a hybrid with existing rmap. Martin, Rik, and Dave all agreed with Ben. Then we all agreed to move on to other things.

*ActionItem - someone needs to change bash to use clone instead of fork..

Scheduler Hang as discovered by restarting a large Web application multiple times:

Rick Lindlsey/ Hanna Linder

We were seeing a hard hang after restarting a large web serving application 3-6 times on the 2.5.59 (and up) kernels (also seen as far back as 2.5.44). It was mainly caused when two threads each have interrupts disabled and one is spinning on a lock that the other is holding. The one holding the lock has sent an IPI to all the other processes telling them to flush their TLB's. But the one witinging for the spinlock has interrupts turned off and does not recieve that IPI request. So they both sit there waiting for ever.

The final fix will be in mainline kernel version 2.5.63. Here are the individual patches which should apply with fuzz to older kernel versions:

Shared Memory Binding :

Matt Dobson -

Shared memory binding API (new). A way for an application to bind shared memory to Nodes. Motivation is for large databases support that want more control over their shared memory.

current allocation scheme is each process gets a chunk of shared memory from the same node the process is located on. instead of page faulting around to different nodes dynamicaly this API will allow a process to specify which node or set of nodes to bind the shared memory to.

Work in progress.

Martin - gcc 2.95 vs 3.2.

Martin has done some testing which indicates that gcc 3.2 produces slightly worse code for the kernel than 2.95 and takes a bit longer to do so. gcc 3.2 -Os produces larger code than gcc 2.95 -O2. On his machines -O2 was faster than -Os, but on a cpu wiht smaller caches the inverse may be true. More testing may be needed.

To Ben LaHaise's statement that none of the Linux distributions were supporting the really big systems, Larry McVoy said:

Ben is right. I think IBM and the other big iron companies would be far better served looking at what they have done with running multiple instances of Linux on one big machine, like the 390 work. Figure out how to use that model to scale up. There is simply not a big enough market to justify shoveling lots of scaling stuff in for huge machines that only a handful of people can afford. That's the same path which has sunk all the workstation companies, they all have bloated OS's and Linux runs circles around them.

In terms of the money and in terms of installed seats, the small Linux machines out number the 4 or more CPU SMP machines easily 10,000:1. And with the embedded market being one of the few real money makers for Linux, there will be huge pushback from those companies against changes which increase memory footprint.

Alan Cox said he thought people generally vastly overestimated the number of multi-processor machines on the market. He pointed to some big-machine bugs that had gone unnoticed for long periods of time, as evidence of this. Elsewhere, Martin J. Bligh pointed out to Larry, that multiple instances of the OS running on a single machine would not be as good a solution as Larry hoped. Martin said, "this doesn't work in practice. Workloads may not be easily divisible amongst machines, and you're just pushing all the complex problems out for every userspace app to solve itself, instead of fixing it once in the kernel." There followed a huge, unfocused debate about the profitability of the computer hardware industry.

2. Status Of GCC 3.3

24�Feb�2003�-�28�Feb�2003 (32 posts) Archive Link: "[PATCH] s390 (7/13): gcc 3.3 adaptions."

Topics: Compiler

People: Richard B. Johnson,�Arnd Bergmann,�Linus Torvalds,�Alan Cox,�Andreas Schwab,�Martin Schwidefsky

Martin Schwidefsky posted some patches to allow Linux to be compiled for the s390 architecture, using GCC version 3.3 pre-releases. One of his modifications entailed stopping the compiler from warning about comparisons between signed and unsigned numbers. Richard B. Johnson objected, "I think you must keep these warnings in! There are many bugs that these uncover uncluding loops that don't terminate correctly but seem to work for "most all" cases. These are the hard-to-find bugs that hit you six months after release." Arnd Bergmann replied, "Obviously the warning is a good idea in general, but I don't see the point of scrolling through hundreds of lines with the same warning in someone else's code. I actually plan to fix these warnings in arch/s390 and drivers/s390 as well as include/ and make the s390 kernel compile with -Werror, but the rest looks more like a task for the Janitors. Note that before gcc-3.3, -Wsign-compare has not been part of -Wall." Close by, Linus Torvalds remarked:

At least historically gcc has been so f*cking bad at the "unsigned vs signed" warnings that they are totally useless.

Maybe things are better in gcc-3.3.

Maybe not.

He posted an example of correct code that would cause GCC to produce a warning, and Andreas Schwab pointed out that there was no way for the compiler to distinguish between the code in Linus' example, and actual bad code. Linus replied:

Which is indeed my point. If you cannot distinguish it from incorrect uses, you shouldn't be warnign the user, because the compiler obviously doesn't know enough to make a sufficiently educated guess.

That said, a good compiler _can_ make a good warning. But to do so, you have to actually do value analysis, instead of just blindly warning about code that is obviously correct to a human.

Until gcc does sufficient value analysis, that signed warning is annoying, worthless and a damn pain in the ass.

Close by, Alan Cox remarked, "gcc gives the warning only when you ask it to annoy you. Seems a good trade off. There are about 15 bug fixes in 2.4.21-pre4ac4,ac5,ac6 solely from that, all real bugs and some very non obvious." Linus pointed out, "That _used_ to be true. Look at the subject line. gcc-3.3 gives the warning for -Wall." But Alan said, "gcc-3.3 doesnt exist yet. Maybe it wont do that now 8)." Linus replied:

Right now there are some other problems with gcc-3.3 too, ie the inlining is apparently broken enough that we'll either have to start using __attribute__((force_inline)) or we'd better hope that the gcc people decide to take the "inline" keyword more seriously (it's being discussed on the gcc lists, so we'll see)

But yes, these are all obviously with "early versions", and it may be that it changes before the real release.

3. PCI Hotplugging Updates

24�Feb�2003�-�1�Mar�2003 (20 posts) Archive Link: "[BK PATCH] PCI hotplug changes for 2.5.63"

Topics: FS: sysfs, Hot-Plugging, PCI

People: Greg KH,�Russell King,�Christoph Hellwig

Greg KH announced:

Here's some patches that clean up the remove logic a lot for the PCI hotplug drivers. The main PCI patches were done by Russell King and Christoph Hellwig, and then I went and cleaned up the PCI Hotplug drivers a lot based on their changes. I also fixed up some exit logic in the IBM PCI hotplug driver, as it was a mess.

Scott, I modified the cPCI core in order to get it to build and link properly again, but as I don't have the hardware to test it, you should probably look over the change and see if I messed anything up or not. Also, I think you are the last user of the pci_visit structure, make sure you really need it, otherwise we can get rid of it entirely from the PCI core.

Also included in his patches was code to migrate the entire PCI /proc interface to SysFS.

4. Replacing DevFS

25�Feb�2003�-�28�Feb�2003 (12 posts) Archive Link: "Patch: 2.5.62 devfs shrink"

Topics: FS: devfs, FS: ramfs

People: Adam J. Richter,�Maneesh Soni,�Richard Gooch,�Andrew Morton,�Steven Cole

Adam J. Richter announced:

Here is an update to my patch to shrink devfs for linux-2.5.62. The patch is a net deletion of 2407 lines. It contains the following new changes:

Presumably because of the size of all of the "-" lines in the patch, the linux-kernel mailing list filters it out, so I'll just post a URL for it:

Also, here is the URL for the latest devfs_helper user level program (version 0.2, unchanged). It is a reduced functionality replacement for devfsd.

I'll also describe my "to do" list for this software, in case anyone spots something I've forgotten:

Andrew Morton asked for a list of incompatibilities between Adam's new DevFS code and the existing one, along with a description of how to migrate from the old to the new setup. Adam replied, "OK. Here is a first draft of what I plan to put in linux/Documentation/filesystems/devfs/small-devfs. Corrections and comments are welcome." . He went on:

This document describes the differences between Richard Gooch's original devfs and my "small" devfs.

This new devfs replaces the internal devfs file system with one derived from ramfs, a reduction of more than 2400 lines of source code, although file systems based on ramfs rely on the 345 line file fs/libfs.c.

User level differences:

  1. devfsd replaced by devfs_helper

    devfs_helper implements a subset of devfsd functionality. devfsd is not a deamon. Instead, the new devfs invokes devfs_helper with argument for each event. The new devfs currently only calls devfs_helper for "LOOKUP" and "REGISTER" events. devfs_helper uses the existing /etc/devfsd.conf file and supports devfsd's regular expression matching. Like devfsd, devfs_helper is optional. It is available from the following FTP directory.

  2. Old device names not automatically installed.

    Unlike devfsd, devfs_helper does not install old "compatible" device names. This keeps devfs_helper small, which is particularly important since devfs_helper is invoked repeatedly.

    If you want to install a bunch of alternate device names (such as /dev/hda1 for /dev/ide/host0/bus0/target0/lun0/part1), you can do this at boot time after /dev has been mounted. For example, you could maintain a tree of device nodes to overlay on /dev in, say /dev.overlay, and then add something like the following to a boot script:

    ( cd /dev.overlay && tar cf - ) | ( cd /dev && tar xfp - )

    Note that you should not use "cp" or even "cp -a" for this operation, as that "cp" will always try to open devices and read from them.

    If you want to save the current /dev every time you shut your system down, you could add a line like the following to a halt script:

    ( cd /dev && tar cf - ) | ( cd /dev.overlay && tar xfp - )

    Note that if you want to support booting both with and without devfs, a simpler approach might be to convert your non-devfs system to use devfs-style names, at least for the devices that are needed for booting (/dev/vc/0, /dev/vc/1... for virtual consoles, /dev/discs/disc0/disc for the first whole hard disk, /dev/discs/discs0/part1 for the first partition of the first disk, /dev/floppy/0).

  3. Future: DEVFS_MOUNT and "devfs=nomount" may disappear.

    The option to have the kernel automatically mount /dev may disappear in the future. As with old devfs, you can already eliminate this feature by not defining DEVFS_MOUNT. If you do this, the kernel will not be able to open /dev/console before invoking /sbin/init. Eliminating DEVFS_MOUNT shrinks the kernel, allowing this functionality to be provided by user level programs (which don't necessarily remain resident in memory and which may want to do something different anyhow). The init program can do something like the following untested code to mount /dev and open /dev/console:

            mount("", "/dev", "devfs", 0, NULL);
            close(0); close(1); close(2);   /* Just to make sure. */
            open("/dev/console/index.html", O_RDONLY); /* This will return fd 0. */
            open("/dev/console/index.html", O_WRONLY); /* This will return fd 1. */
            dup2(1, 2);                     /* stderr = stdout. */
  4. Partition table support now matches non-devfs systems (i.e., no automatic partition table rereading, which was causing problems).

    The old devfs would automatically reread partition tables at various times. This was a functional difference with non-devfs systems, and made it nearly impossible to use drivers that returned incorrect "media changed" information such as with CompactFlash cards on systems that used user level partition reading programs like partx to keep the kernel small. Basically, the old devfs would make the kernel forget CompactFlash partition tables on nearly every operation. This misfeature is removed in smalldevfs. smalldevfs systems now handle partition tables just like non-devfs systems.

Kernel differences:

  1. "ops" argument to devfs_register is temporarily ignored

    If you are using devfs to register a character or block device, you should not notice any difference. The difference is that the ops argument to devfs_register is currently ignored. So, for the time being, access to all devices still go through major and minor device numbers. Eventually, I would like to restore the functionality of potentially eliminating major and minor device numbers, but, for now, this functionality is temporarily gone.

    Because this functionality is gone, you can only register character or block devices to get device-like behavior. The only users of this functionality were a couple of interfaces that duplicated /proc interfaces. They were removed from the kernel recently anyhow.

    In the future, I hope to restore this functionality in a way that will allow even more device support code to removed (or "configured out") as a result than under the old devfs. So, please continue to pass the character or block device operations pointer to devfs_register, even if devfs_register is currently not using it.

  2. devfs_only() always returns 0

    devfs_only() is supposed to return 1 on systems that always use the ops field in devfs_register and therefore do not need to reference devices by number. Because of #2, devfs_only() currently always return 0. This should change in the future, so please do not delete code that tests devfs_only(). The compiler will optimize out the unnecessary code in the meantime.

5. kexec Updates Ready For 2.5

25�Feb�2003�-�1�Mar�2003 (8 posts) Archive Link: "[KEXEC][2.5.63] Partially tested patches available"

Topics: Kexec

People: Andy Pfiffer,�Eric W. Biederman,�Bill Davidsen,�Werner Almesberger

Andy Pfiffer said to Eric W. Biederman:

I have carried forward the kexec patch set to 2.5.63. I have checked it on a 1-way system, and 2-way tests are still pending.

There were additional syscall hijinks in the merge to 2.5.63, so anyone that uses this patch set will need to recompile their kexec tools.

Minor changes to the base patch include the removal of two compile-time warnings for unused variables.

The patches are available for download from OSDL's patch lifecycle manageer (PLM):

Patch Stack for 2.5.63:

kexec base for 2.5.63 (based upon 2.5.54 version)

kexec hwfixes for 2.5.63 (based upon 2.5.5[89] version)

kexec usemm change (allowed 2-way to work for me):

optional change to defconfig to CONFIG_KEXEC=y

The patches are also available (with matching kexec-tools-1.8) here:

Eric was happy to see this work, but he said that for various reasons, he was too strapped for time to do much on the kexec patches. He added, "We need to get up some steam and see what it will take for Linus to notice and actually get this patch included." Bill Davidsen replied, "I hate to say it, but "notice" and "include" are two different things. He noticed the "write oops to disk" feature, he just didn't like it. Linus is a great developer, but he has limited sys admin experience, if any. Hopefully he will think it's cool, but don't assume that if you can get his attention he will respond as you wish. Best of luck on this." And Eric replied:

The code has already gotten tentative approval from Linus. And I suspect the biggest reason it isn't in is that I have gotten distracted lately and have not been asking for it to be included.

Being able to use this for processing panics is one of the side features of kexec. Admittedly one of the more useful ones, but definitely not a core feature.

Given the encouragement I have received until I actually get negative feedback from Linus I will continue to figure it has not made it into the kernel because Linus has limited hours in the day, and an overflowing inbox.

Werner Almesberger remarked, "After that tentative approval, kexec finally has gotten the attention it deserves, and there was quite a bit of development on and surrounding it, so I guess Linus may just have decided to wait until the storm has calmed down a little."

6. S4bios Updated; Troubles With Software Suspend In 2.5

26�Feb�2003�-�4�Mar�2003 (29 posts) Archive Link: "S4bios support for 2.5.63"

Topics: Disks: IDE, Ioctls, Software Suspend

People: Pavel Machek,�Alan Cox,�Roger Luethi,�Bert Hubert,�Nigel Cunningham

Pavel Machek announced, "This is S4bios support for 2.5.63. I'd like to see it in since it is easier to understand and more foolproof." Bert Hubert said he hadn't been able to get software suspend (swsusp) to work since 2.5.61, with or without the S4bios patches. Nigel Cunningham recommended trying the latest snapshot, which he thought had a fix. Bert tried this, but got a different error: "BUG_ON (HWGROUP(drive)->handler);". Alan Cox replied, "Looks like swsuspend attempted to run an operation while one was in progress. IDE tries to catch that because the result of missing it isnt very pretty at fsck time." Roger Luethi also said:

That problem has been around for a while. I reported it for 2.5.59 which just happened to be the first 2.5 kernel I tested with swsuspend.

I'm seeing the bug every time I try swsuspend on 2.5. The same Vanilla kernels seem to work for other people, though.

The only thing that came up at the time was a suggestion to replace BUG_ON with while (which I didn't try because I'd like to keep my data).

Alan replied, "That isnt far off what you want. IDE has proper command queuing functionality and providing you are suspending in a sleeping context you can do what you are trying to do through the IDE layer politely. Take a look at how the various ide taskfile ioctls issue commands." Close by, Bert reported that the most recent kernel that would give him working software suspend was 2.5.53; he and Roger managed to eliminate the compiler as a source of the problem, since they were both using fairly disparate GCC versions. Pavel Machek suggested the problem might show up on systems with two disk drives, but Roger and Bert pointed out that Bert's system had only one. Alan remarked that having two disk drives might trigger the bug more easily, while it still might take place on systems with only one. He remarked, "An IDE command can only be outstanding per interface not per device." A bunch of developers piled onto the problem, but the thread ended inconclusively.

7. ioctl32 Consolidation

26�Feb�2003�-�28�Feb�2003 (14 posts) Archive Link: "ioctl32 consolidation -- call for testing"

Topics: Ioctls

People: Pavel Machek,�Ben Collins,�David S. Miller

Pavel Machek announced:

This is next version of ioctl32 consolidation. At one point it compiled on x86-64 and sparc64. I'm not 100% sure it still does...

Could you try to apply it on your architecture, fix whatever breakage it causes, and submit patch back to me?

ia64 has very different ioctl32 emulation (and very short). What is going on there? Also not all architectures knew about register_ioctl32_translation. Ouch.

Ben Collins pointed out that this broke the Sparc64 code. It seemed that none of the 32-bit ioctls were registered, so the system, being entirely 32-bit, couldn't boot to usermode fully. Later he posted a patch, saying, "Here it is. Sparc64's macros for ioctl32's assumed that cmd was u_int instead of u_long. This look ok to you, Dave?" Pavel applied the patch, but David S. Miller didn't like it, as it doubled the size of the data structure on Sparc64. Ben felt there was no way out of it, and they went over some of the implementation details together.

8. Spell-Checking Kernel Comments

26�Feb�2003�-�5�Mar�2003 (47 posts) Archive Link: "[PATCH] kernel source spellchecker"


People: Dan Kegel

Dan Kegel posted a script to perform spellchecking on the kernel sources. He said, "Since the main remaining feature before release of the 2.6 kernel is fixing all the remaining spelling errors, this patch seems appropriate. This is against 2.4 but should apply to other versions as well. It's not very smart, but should help get us to our all-important goal of 100% correctly spellt kernel source. Todo: make it ignore names from the MAINTAINERS file, the list of signals and syscalls, and other well-known english words seem mostly in Webster's Posix edition; rewrite in Perl rather than C, or add real Makefile entry. Enjoy!" The script would go through the kernel sources, operating only on C comments, reporting on all words that appeared to be misspelled. Later he admitted he'd only been joking, but then he ran the script himself and got a lot of output. He posted a long list of words that were misspelled in five or more files. Matthias Schniederme posted a snippet of Perl to actually do the corrections. Dan Kegel pointed out that things like "borken", "dain bramaged", "controllen" and "callin" were not typos, and remarked, "The above examples make me think the list of corrections will have to be very carefully vetted before we turn this thing loose." A number of folks agreed with this. After some reworking of his and Matthias' work, Dan said:

My corrections file is up at and the patch that produces is The perl script took about an hour of 450MHz cpu time. (Might be worth adding a quick path to detect and skip files with none of the misspelled words. Or just run on a fast machine...)

I did a spot check, and it looked pretty good, but some of the fixes are just too pedantic. In particular,


should probably be dropped from the fix list.

Any other changes people want to see in the script or the corrections file? Should I add fixes for uncommon errors (those that happen only in one or two files)?

There followed a nice discussion of possible misspellings, Britishisms, Americanisms, and other corner cases. At one point Dan cautioned users of his and Matthias' tools, "BTW Linus has been accepting so many spell fixes it's probably important to work with very fresh sources..."

9. Linux 2.5.63-mm1 Released

27�Feb�2003�-�4�Mar�2003 (24 posts) Archive Link: "2.5.63-mm1"

Topics: FS: devfs, Kernel Release Announcement

People: Andrew Morton

Andrew Morton announced 2.5.63-mm1:

10. Rhine-II Stable For 2.5 And 2.4

27�Feb�2003�-�1�Mar�2003 (5 posts) Archive Link: "[0/2][via-rhine][ANNOUNCE] 1.17rc"

People: Roger Luethi

Roger Luethi said, "With these patches, the Rhine-II passes stress testing for the first time. There are still a few issues, but the driver doesn't break down under load like all previous ones did." There were no replies on the list, but he posted a little later:

the private feedback I have received so far on the recent changes has been excellent. The Rhine-II is now finally usable with via-rhine. Time to call it 1.17. -- Please apply.

FWIW I think the four patches (including this one) leading up to 1.17 are 2.4 material, too. The drivers were identical at 1.16, and some kind souls successfully tested 1.17 on 2.4. Given the low frequency of 2.4 releases and the brokenness of the driver until now, it would seem like a good idea to have it in 2.4.21.

11. Handling Out-Of-Memory

27�Feb�2003�-�3�Mar�2003 (9 posts) Archive Link: "Protecting processes from the OOM killer"

Topics: OOM Killer

People: Dan Kegel,�Alan Cox,�Jesse Pollard

Dan Kegel had spent a lot of time thinking about how to protect certain processes from the out-of-memory (OOM) killer. The OOM killer tried to intelligently guess which processes to kill when system RAM ran short, but it had never gotten the algorithm quite right. Dan suggested, "How about rewarding processes that have an RSS limit if they stay well below it? The operator can then mark processes that are important by using 'ulimit -m'." Alan Cox replied bluntly, "How about by not allowing your system to excessively overcommit. Everything else is armwaving "works half the time" stuff. By the time the OOM kicks in the game is already over. The rlimit one doesnt deal with things like fork explosions where you have lots of processes all under 1/4 of the rlimit range who cumulatively overcommit. In fact you now pick harder on other tasks..." Dan replied, "Even with overcommit disallowed, the OOM killer is going to run when my users try to run too big a job, so I would still like the OOM killer to behave "well"." James Antill and Jesse Pollard said the OOM killer shouldn't run in that case, because the user process itself would simply fail, when trying to allocate all that memory. Alan remarked, "The one case you can't cover cleanly in C is a stack grow exceeding memory usage. At that point it requires a tiny bit of magic. You can do it, but the overcommit blocker has to armwave a little for the kernel and other things so I've never seen it happen in a normal situation."

12. Documentation For The Virtual Memory Subsystem

27�Feb�2003 (2 posts) Archive Link: "VM Documentation Release Day"

Topics: Virtual Memory

People: Mel Gorman,�Martin J. Bligh

Mel Gorman announced:

This is a beginning of the end release of the VM documentation against 2.4.20 as it contains information on pretty much all of the VM. A lot of the older chapters have been cleaned up in terms of language, font usage and presentation and a few new chapters are new. Please excuse if the swapping chapter is a bit rough, I wanted to get this done by the weekend so I can head away offline and not have to worry about it.

The whole documentation is broken up into two major sets of documents. is the main document describing how the VM works and is a fairly detailed code commentary to guide through the sticky parts. It can be found in PDF(preferred format), HTML or plain text at

Understand the VM

Code Commentary

This is a huge milestone for me (I'm actually quite proud of myself!) It has come a *long* way since I wrote which was around when I first untarred the source with a view to seriously reading it :-) (The larger project never really got as far as I thought, I drastically underestimated how long this would take and it was large enough project as it was)

At this stage, I'm nearing the end of the documentation work for the 2.4.20 VM. If I write anything for 2.5, it'll be in the shape of addendums where I describe the differences rather than going through all this again. All that I have left really is to polish it (especially the later chapters like swap management) and fill in some gaps (particularly filling out the page cache management a bit more). I'm now hoping people will read through it, tell me where and if I've made technical errors, suggestions for improvements or tell me where I've missed on topics that really should have been covered.

When the final polish is done, the whole document, LaTeX source and all will be uploaded to somewhere more accessible than my webpage. At this stage, presuming people do not start pointing out horrible mistakes I've made, I'm hoping that the final version is not too far away. Suggestions, comments and feedback are welcome.

Martin J. Bligh said, "Congratulations - this must have been a huge amount of effort, and will be a most valuable resource to have ... and freely available to everyone too."

13. Support For The Promise PDC 20376 Serial ATA / RAID Controller

27�Feb�2003�-�28�Feb�2003 (5 posts) Archive Link: "Promise PDC 20376"

Topics: Disk Arrays: RAID, Disks: IDE, Serial ATA

People: Alan Cox,�David Monniaux,�Andre Hedrick

David Monniaux asked if anyone, perhaps Andre Hedrick, was working on support for the Promise PDC 20376 Serial ATA / RAID controller; Alan Cox replied:

No. The SII is supported and the HPT with SATA bridges should work. Some informal discussion has occurred with two other vendors who will be releasing SATA products in time.

It is probably possible to reverse engineer the 20376 since I suspect it will behave like the older devices but with the registers memory mapped.

For Andre's take on Promise support, see Issue�#206, Section�#6� (12�Feb�2003:�Promise Spits On Free Software)

14. DKMS: Dynamic Kernel Module Support

28�Feb�2003�-�3�Mar�2003 (4 posts) Archive Link: "[ANNOUNCE] DKMS: Dynamic Kernel Module Support"

Topics: Kernel Build System

People: Gary Lerhaupt,�Sam Ravnborg

Gary Lerhaupt from Dell announced:

DKMS is a framework where device driver source can reside outside the kernel source tree so that it is very easy to rebuild modules as you upgrade kernels. This allows Linux vendors to provide driver drops without having to wait for new kernel releases (as a stopgap before the code can make it back into the kernel), while also taking out the guesswork for customers attempting to recompile modules for new kernels.

For veteran Linux users it also provides some advantages since a separate framework for driver drops will remove kernel releases as a blocking mechanism for distributing code. Instead, driver development should speed up as this separate module source tree will allow quicker testing cycles meaning better tested code can later be pushed back into the kernel at a more rapid pace. Its also nice for developers and maintainers as DKMS only requires a source tarball in conjunction with a small configuration file in order to function correctly.

The latest DKMS version is available at It is licensed under the GPL. You can also find a sample DKMS enabled QLogic RPM to show you how it all works (or, a mocked-up tarball if you don't like RPMs). If you use the sample RPM, you'll have to install it with --nodeps as it requires the DKMS RPM to be installed (which I haven't provided).

===Using DKMS===

DKMS is one bash executable that supports 7 sub-actions: add, build, install, uninstall, remove, status and match.

add: Adds an entry into the DKMS tree for later builds. It requires that source be located in /usr/src/<module>-<module-version>/ as well as the location of a properly formatted dkms.conf file (each dkms.conf is module specific and is the configuration file that tells DKMS how to build and where to install your module).

build: Builds your module but stops short of installing it. The resultant .o files are stored in the DMKS tree.

install: Installs the module in the LOCATION specified in dkms.conf.

uninstall: Uninstalls the module and replaces it with whatever original module was found during install (returns your module to the "built" state).

remove: Uninstalls and expunges all references of your module from the DKMS tree.

status: Displays the current state (added, built, installed) of modules within the DMKS tree as well as whether any original modules have been saved for uninstallation purposes.

match: Allows you to take the configuration of DKMS installed modules for one kernel and apply this config to some other kernel. This is helpful when upgrading kernels where you would like to continue using your DKMS modules instead of certain kernel modules.

Check out the man page for more details.

A few days later he replied to himself:

I wanted to post a follow-up as I have seen only a few downloads of DKMS since my original posting and also given that the Linux Development Group here at Dell is very interested in feedback from the community. The problem of chasing kernel drops is a very real issue for Linux solution providers. With our constant work with new hardware and large deployments involving many customers, at times we simply cannot afford to wait for functional drivers in the kernel. This is especially true for the discovery and resolution of high severity issues. At the same time, we cannot just hand updated source tarballs to our customers and expect that to be an appropriate customer experience. Further, it is just not feasible for us to continue to produce kernel specific module RPMs for every kernel that we support for every module that we support.

What is needed instead is a framework that can hold module source and can recompile that source directly on user's systems for whichever kernel they are running. As well, this entire process must be non-painful. We believe that DKMS is this solution and we'd like to know if you agree and how it can be improved.

Lastly, as I realize some might take a *don't care* approach to such a problem given their personal Linux comfort level, I'd like to reiterate from my previous post how such a framework could possibly yield benefits to the entire process of Linux development. We at Dell are very committed to merging code into the kernel, and if a separate framework to deploy (and test) module source existed apart from the kernel, we envision both an improvement in the speed and quality of driver development that can later be pushed back into the kernel.

So, at your convenience we invite you to give DKMS a whirl (and to try out the sample QLogic driver included for the full experience). Thanks.

Sam Ravnborg replied:

I have made a brief look at the shell script. It assume .o for modules, which is not true for 2.5.

When building a module it simply executes $MAKE - which is plain wrong. As have been discussed in several threads you cannot reliably track changes in CFLAGS etc. without utilising the kbuild infrastructure.

DKMS is also highly connected to the usage of /lib/modules/... and naming of config files. It looks to me as it is very distribution specic.

And Gary replied, "I will take up your suggestion and remove the assumptions that modules end with .o. I should note that we don't see 2.6 making it into production environments within the next year so my focus has been solely on 2.4 at this point. Though, the kbuild infrastructure will actually mesh nicely with DKMS as it will simplify the mess of makefiles that it has to deal with. As for $MAKE, I believe there is some confusion here. $MAKE comes from sourcing in the dkms.conf file which is required for each module in DKMS. One of the directives in dkms.conf must be a MAKE which is the specific make command needed to build your module. So $MAKE should represent the right thing to do for the module in question." He added, "DKMS is very intertwined with /lib/modules as this is where it installs modules. I was not aware that this was distro specific. As for the kernel config files, you are correct. By default it does assume Red Hat's distro specific scheme, but when building your module, you can pass a --config option and specify the alternate path for your .config if it does not follow this scheme. I hope this clears this up."

15. ACPI Updated For 2.4 And 2.5

28�Feb�2003 (1 post) Archive Link: "ACPI patches updated (20030228)"

Topics: Power Management: ACPI

People: Andrew Grover

Andrew Grover announced:

The ACPI patches against 2.4 and 2.5 have been updated and are now available from The non-Linux-specific releases should be available from hopefully by tonight but possibly as late as Monday evening.

This includes a LOT of fixes for longstanding bugs. If you have had issues in the past with long delays or oopses on reads from the battery interface, hangs on boot, or excessive ACPI interrupts causing system slowness, please try this patch.

16. USB Updates

28�Feb�2003 (1 post) Archive Link: "[BK PATCH] USB changes for 2.5.63"

Topics: USB, Version Control

People: Greg KH,�Duncan Sands,�David Brownell

Greg KH announced:

Here are some more USB changes. There are a lot of speedtouch driver updates from Duncan Sands, and a bunch of usb-serial changes by me, as I go though and try to audit all of them for locking issues. There's also some ohci and ehci controller driver updates, and a bunch of other minor changes. David Brownell also created a new usb document for all of the related USB documentation (pulling it out of the kernel-api document).

Oh, and I fixed the bug that caused the unload of the usbcore module to hang, which a number of people have reported in the past.

Please pull from: bk://

17. perfctr 2.4.6 Code Profiler Released

1�Mar�2003�-�3�Mar�2003 (3 posts) Archive Link: "perfctr-2.4.6 released"

Topics: POSIX, Profiling

People: Mikael Pettersson,�Albert Cahalan

Mikael Pettersson announced perfctr-2.4.6 at, saying, "This is a minor maintenance release of the stable perfctr-2.4 branch, to fix compilation problems in the recent 2.4.21-pre5 and 2.5.63 kernels. It will NOT work in 2.4.21-pre1 to -pre4." Albert Cahalan asked what exactly perfctr was, and Mikael explained it was a code profiler. He said:

It virtualises the performance counters, so it's per-process just like the integer and f.p. state. Actually there are three components: a low-level x86-specific driver, a driver for per-process performance counters, and a driver for global non-virtualised performance counters. The latter is rather rudimentary.

The low-level driver caches control data and uses an accumulating- differences approach for event counting, which keeps context switching costs down in common cases. (Writing to the performance counter and control registers is expensive, so the driver avoids that as far as possible.)

The package also has a user-space access library. The driver allows a process to mmap() its counter state, and the library uses this to implement a low-overhead syscall-free algorithm for sampling the counters in user-space. The overhead for sampling a single counter is around 50-250 clock cycles, depending on CPU generation: approximately 45 cycles for P5 MMX, 115 cycles for P6, 50-60 cycles for K7, and 230 cycles for P4. Sampling all counters a process is using is less expensive than sampling them one by one.

Other people have higher-level libraries on top of this, for things like posix threads, user-friendly abstractions, and portable interfaces.

18. Status Of Major And Minor Device Number Allocation

1�Mar�2003�-�2�Mar�2003 (5 posts) Archive Link: "[PATCH] remove DEVFS_FL_AUTO_DEVNUM"

Topics: FS: devfs

People: Christoph Hellwig,�H. Peter Anvin,�Neil Brown

Christoph Hellwig posted a patch and explained:

Remove the DEVFS_FL_AUTO_DEVNUM flag that makes devfs_register() allocate a dev_t for it's caller.

Rationale: while dynamic major/minors are a good idea, devfs is the wrong layer to do it because all code relying on it would break with out devfs.

H. Peter Anvin said that dynamic major and minor device numbers was not necessarily a good idea at all. But Neil Brown reminded him that Linus had declared no new device numbers would be accepted, so dynamic numbers simply had to be accomodated. H. Peter replied, "It's also a totally nonrealistic premise, which is why new allocations are still happening at the request of Alan and Marcelo."

19. mdadm 1.1.0: Soft RAID Manager

2�Mar�2003 (4 posts) Archive Link: "ANNOUNCE: mdadm 1.1.0 - A tool for managing Soft RAID under Linux"

Topics: Disk Arrays: RAID

People: Neil Brown

Neil Brown announced:

I am pleased to announce the availability of
mdadm version 1.1.0
It is available at

as a source tar-ball and (at the first site) as an SRPM, and as an RPM for i386.

mdadm is a tool for creating, managing and monitoring device arrays using the "md" driver in Linux, also known as Software RAID arrays.

Release 1.1.0 contains a number of spell corrections, and bug fixes.
It has improved support for MULTIPATH arrays.
It has some new features including:
--daemonise for use with --monitor
--config=partitions to find devices by examining /proc/partitions
--update=super-minor to change the recorded minor-number for an array

Much of the improvements are due to user feed-back. Thanks are due to all who gave suggestions and reported problems.

I expect the next major release to be 2.0.0 which will include support for a new super-block format soon to be supported by 2.5 series kernels.

Development of mdadm is sponsored by CSE@UNSW:
The School of Computer Science and Engineering
The University of New South Wales

20. Linux 2.2.24-rc5 Released

3�Mar�2003 (1 post) Archive Link: "Linux 2.2.24-rc5"

Topics: Networking

People: Alan Cox,�Ion Badulescu,�Neale Banks,�James Morris,�Paul Gortmaker,�Paul Fulghum

Alan Cox announced:

Ok this should be it

Linux 2.2.24-rc5

o       Fix n_hdlc globals pollution                    (Paul Fulghum)
o       Fix initialisation of sk->sleep                 (Holger Smolinksi)
o       Handle init_ethdev returning null in tulip      (Neale Banks)
o       Backport rtc wildcard fix to 2.2                (Paul Gortmaker)
o       Correct wireless config help                    (Neale Banks)
o       Fix smc9194 build                               (me)

Linux 2.2.24-rc4

o       Fix ethernet as modules problems                (me)
o       Fix 8139too and rtl8139 padding                 (me)

Linux 2.2.24-rc3

o       Backport the ethernet padding fixes             (me)
        | All done except 8139too, rtl8139]

Linux 2.2.24-rc2

o       Apply AMD fix correctly                         (Bruce Robson)
o       Fix possible memory scribble in starfire        (Ion Badulescu)

Linux 2.2.24-rc1

o       Fix a typo in the maintainers                   (James Morris)   
o       Dave Niemi has moved                            (Dave Niemi)
o       Fix incorrect blocking on nonblock pipe         (Pete Benie)
o       Fix misidentification of some AMD processors    (Bruce Robson)
o       Fix a very obscure skb_realloc_headroom bug     (James Morris)
o       Fix warning in lance driver                     (Thomas Cort)
o       Fix sign handling bug in pms driver             (Silvio Cesare)
o       Drop mmap on /proc/<pid>/mem as 2.4/2.5 did     (Michal Zalewski)
        (also fixes some bugs)

Sharon And Joy

Kernel Traffic is grateful to be developed on a computer donated by Professor Greg Benson and Professor Allan Cruse in the Department of Computer Science at the University of San Francisco. This is the same department that invented FlashMob Computing. Kernel Traffic is hosted by the generous folks at All pages on this site are copyright their original authors, and distributed under the terms of the GNU General Public License version 2.0.