Kernel Traffic #40 For 25�Oct�1999

By Zack Brown

Table Of Contents


Thanks go out to Clemens Wehrmann for pointing out some embarrassing typos both last week and the week before. Thanks, Clemens! Can I come to Red Dwarf night? ;-)

Thanks also go to my Linuxcare compatriot Brett Neely for finding some other embarrassing typos. You rock, Brett!

Mailing List Stats For This Week

We looked at 1623 posts in 6557K.

There were 435 different contributors. 219 posted more than once. 137 posted last week too.

The top posters of the week were:

1. SMP CPU-Binding Discussion

3�Oct�1999�-�15�Oct�1999 (21 posts) Archive Link: "[PATCH] Binding processes to selected CPUs"

Topics: SMP

People: Andrea Arcangeli,�Matti Aarnio,�Tim Hockin,�Rik van Riel

Avi Kivity posted a patch against 2.3.18, to allow processes to specify which CPUs they may execute on. He added a small program to demonstrate the feature. Rik van Riel pointed out that the patch added some time-heavy code to goodness() ( , which is called once for each process on the runqueue for each reschedule. Any code added to goodness() would slow down every part of the system. He suggested setting the processor during a rebind instead.

Tim Hockin also pointed out that pset ( provided all the functionality of Avi's patch and more. He added that it was under active (though slow) development, and that as the author, he received daily emails from users.

Elsewhere, Andrea Arcangeli asked why the feature would be needed, Avi said he'd implemented it just in order to try his hand at kernel hacking. Andrea replied that he thought his (Andrea's) SMP scheduling patch would optimize better than merely assigning a process to a CPU. Avi did some benchmarks maximizing the CPU migration penalty, and found that there was no benefit to be gained from binding processes to specific CPUs, at least on smaller machines. He wasn't sure if larger machines would show a difference or not.

Matti Aarnio objected that folks' perspectives tended to be too narrow. He agreed that on UMA (Uniform Memory Access) machines, i.e. tightly-coupled SMP systems, Avi's observations and benchmarks might well apply; but on NUMA (Non-Uniform Memory Access) machines, which have fast access to local memory but slower access to other memory, CPU-binding facilities would be very useful.

2. Big Devfs Discussion

3�Oct�1999�-�14�Oct�1999 (631 posts) Archive Link: "USB device allocation"

Topics: FS: devfs, USB

People: Dan Hollis,�Alan Cox,�Steffen Grunewald,�Pavel Machek,�Linus Torvalds

This debate first came up in Issue�#25, Section�#2� (10�Jun�1999:�devfs) . This time it started innocently enough with a discussion of USB device number allocation. Pavel Machek pointed out that USB was finally starting to get useful, which meant it was time to allocate /dev entries for various USB devices. He allocated 32 entries for 16 devices, and Steffen Grunewald asked about other USB devices like monitors, speakers, etc.; and Dan Hollis replied, "The desperate need for devfs becomes all more clear." At this point there was no turning back. The debate raged for about a week and a half, generating over 600 posts. Linus Torvalds, although back from vacation, posted nothing in any of the related threads; while Alan Cox addressed his posts strictly to peripheral, technical details.

3. Universal BIOS Problems

10�Oct�1999�-�14�Oct�1999 (17 posts) Archive Link: "2.3.20 will not boot"

Topics: PCI

People: Linus Torvalds,�Michael Cummins,�Graham Murray,�Tim Waugh,�Horst von Brand

Graham Murray tried 2.3.20 but found his kernel would oops during bootup. He couldn't report the exact message because the text scrolled by too quickly; Horst von Brand confirmed the problem, as did Tim Waugh. Tim stuck a 'for(;;)' in the code to freeze the system right after the oops, to catch the text. He posted it to the list, and Linus Torvalds replied:

It's a almost certainly a buggy BIOS.

Not surprising, it's one of the issues we've always had - the 32-bit BIOS interfaces tend to be buggy because they are never tested. When the PCI code calls into the BIOS to find out the interrupt routing, the BIOS gets confused and craps all over. The code to call into the BIOS for the irq number is new as of 2.3.19 or so..

Martin, let's just change the defaults: NOT call the BIOS by default (and maybe have a kernel command line to say "pciirq=bios" for the two people who need it and have a working BIOS) because I'll bet this is not going to be the only report on machines not booting when more people start testing. And it's not as if we got the interrupt numbes wrong by just looking them up by hand.

Getting a irq wrong occasionally is better than crashing mysteriously at boot. A device may not work, but at least it is a lot more debuggable. And it's probably (almost certainly) more likely that there are more broken 32-bit BIOS interfaces than there are broken machines where we have trouble guessing the irq number without the BIOS.

Alex Nicolaou tried to have it both ways, suggesting comparing the kernel's results with the results of a BIOS call, and displaying an error if the two disagreed. But Linus replied:

It's not about "when you disagree"..

If the BIOS is buggy and you call into it, the machine will crash. Hard. There is no way to recover gracefully.

That's why I don't want BIOS calls. Every single time we've called a BIOS routine (APM, standard PCI config routines, and now PCI interrupt info), there has been a non-negligible subset of BIOSes that have been buggy enough to crash the machine when called.

That's why parsing tables in memory is fine - when you parse the tables you can at least try to recover from buggy tables.

Note that this is partially why I moved the APM stuff into a separate process: it still crashes, but now a crashing BIOS can at least occasionally be somewhat contained (I'm not saying it's secure or anything like that, but the stupid random bugs that are due to the BIOS expecting DOS and Windows data structures at certain addresses tend to cause a clean kill rather than anything worse).

But I don't want to have something as critical as PCI scanning be dependent on something that is known to be unreliable. Doing it by hand may be painful too, but at least we can fix the bugs and we can analyze what goes wrong when it is done by hand.

Michael Cummins said, "I bet you wish the bug reports generated from the buggy bioses problem ended up in the BIOS manufacturer E-mail box, not yours," and Linus replied:

Well, the thing is that even if they ended up there, they'd just shake their heads and say "tough luck". Even if they cared enough to fix the bug for newer versions of their BIOS (big if), that would still leave existing BIOSes with the bug. And while you can update the BIOS on pretty much all modern machines (that didn't use to be the case - remember EPROMs?), it's still not something most people want to do..

And yes, it's a bad circle. Because BIOSes have been buggy in this area, nobody uses them (including MS), so nobody tests them in real use, so they never get fixed, so...

This is why it's best to consider the BIOS just a glorified loader and not much more. Depend on it to set up the machine in a close to usable state, but be ready to do everything else on your own.

4. IDE SMP Messiness In The Stable Series

10�Oct�1999�-�12�Oct�1999 (7 posts) Archive Link: "SMP-CPU + IDE-HD + 2.2.13pre15"

Topics: Disks: IDE, SMP

People: Marc Duponcheel,�Alan Cox,�Andre Hedrick

Marc Duponcheel was getting an oops with 2.2.13pre15, and guessed that "something generic between 2.2.13pre14 and 2.2.13pre15 has given rise to this SMP-CPU + IDE-HD conflict." Alan Cox replied, "It is trickier than that. The problem is that 2.2.13pre14 has a deadlock condition in the IDE code for SMP. 2.2.13pre15 fixes the deadlock but opens a race condition in the request handling. Trying to fix that looks like someone will finally have to fix the locking in the ide driver and the request queue handling instead of continually hacking up an existing bad job." He added, "And it won't be me..."

Andre Hedrick replied, "I heard you the first time..........sheesh....." He went on to explain, "Yes it is going to be messy and long...... I am going to be out of pocket for two or three weeks during a transition/ any initial grunt work would be useful. Since the old guard is now effectively gone and not to return, I dread going back into history to try and catch the races during the introduction of SMP........2.1.90 -> 95 was the intro date/kernel. Recall that it was around 2.1.122 that offered to pick up the pieces and go.......Thus I have been playing catchup from before day one."

Alan said, "I dont envy any one trying to fix that IDE locking bug - its a nasty one." He went on, "I've been going through the locking and its really hard to follow quite what is being locked in places. The irq one is pretty nasty. We can't allow an IRQ to come in - even momentarily during the lock and disable irq sequence, yet we can't disable the irq with locks held as the IRQ might already be running. I have the request queue stuff partly fixed now, I need to sort out the error propogation bit."

Digging in, Andre reported, "I found several old __cli and ide__sti that dangle and never set a spinlock. These are not paired either and get set and cleard in different calls." Later in the same post, he added, "there are unlimited combinations of deadlock that can randomly clear one or another by know the old russian game...BANG!!!!"

Marc tried 2.2.13pre16 and reported, "The 2.2.13pre16 version 'fix' does work fine for several hours now so *thanks* everybody for the work (to be?) done!"

5. Paper On Fine-Grained OS Timers

12�Oct�1999 (1 post) Archive Link: "paper on fine-grained OS timers"

Topics: Networking

People: Mohit Aron

Mohit Aron announced, "I'd like to tell the Linux community about my paper entitled "Soft timers: efficient microsecond software timer support for network processing" that's going to appear in SOSP '99. The abstract for the paper is attached below. The gzip'd postscript for the paper can be downloaded from"

He included an abstract:

This paper proposes and evaluates soft timers, a new operating system facility that allows the efficient scheduling of software events at a granularity down to tens of microseconds. Soft timers can be used to avoid interrupts and reduce context switches associated with network processing without sacrificing low communication delays.

More specifically, soft timers enable transport protocols like TCP to efficiently perform rate-based clocking of packet transmissions. Experiments show that rate-based clocking can improve HTTP response time over connections with high bandwidth-delay products by up to 89% and that soft timers allow a server to employ rate-based clocking with little CPU overhead (2--6%) at high aggregate bandwidths.

Soft timers can also be used to perform network polling, which eliminates network interrupts and increases the memory access locality of the network subsystem without sacrificing delay. Experiments show that this technique can improve the throughput of a Web server by up to 25%.

6. Linus Weighs In On Direction Of PCI Development

13�Oct�1999�-�14�Oct�1999 (6 posts) Archive Link: "PCI patch for 2.3.21"

Topics: PCI, USB

People: Martin Mares,�Linus Torvalds,�Doug Ledford

Martin Mares announced, "If you have any PCI related problems with 2.3.21, please try my new patch available from"

He posted the changelog:

Someone reported that with USB IRQ disabled in BIOS, they'd get a hard lock; while with ESB IRQ enabled in BIOS, everything was fine. Regarding the lock-up, Linus Torvalds replied:

This seems to be due to the excessively clever IRQ routing code that is new as of 2.3.19, which thinks that it should fix up things that the BIOS left disabled.

Which is absolutely deadly, because the BIOS in this case apparently left the irq routing disabled for a very good reason: it is probably routing the USB IRQ into an SMI, and doing the magic "emulate old devices with USB" in SMM mode thing.

When the new PCI code then changes the IRQ routing without being aware of the two levels of drivers that are using the interrupts (the kernel driver for a PS/2 mouse, and the SMM-mode BIOS driver that has the USB device enabled), you end up with an endless stream of USB interrupts that go to the wrong driver (which won't know what to do with them, so they'll keep coming - PCI interrupts are level-triggered, and once they start with nobody to shut them off they will just never stop).

And regarding the nonlocking case, he went on:

That's because when the USB irq is enabled, the BIOS won't have activated the USB controller because it thinks that the OS will handle the USB interrupt natively (you don't happen to have the USB driver enabled, so in your case the OS will _not_ have enabled the USB controller either, and you never get an endless stream of interrupts, because in this case there won't be any confusion - the only driver that will actually use irq12 is the PS/2 driver and it will be able to correctly handle all incoming interrupts - none of the crossed wires as in the bad case).

Martin, the 2.3.19 code must go. It cannot be fixed up, and this cannot continue.

Maybe NOW you understand why I have harped and harped on the issue of NOT trying to fix up random PCI state without having the driver explicitly ask for it? This is going to keep on happening as long as the PCI subsystem continues to think that it can know what the "RightThing(tm)" to do is. But the PCI subsystem really doesn't know enough in the absense of a driver, and there may be some really good reason why an IO area is not mapped or an IRQ is not enabled.

This is why missing interrupt routing stuff and missing IO mappings etc should be enabled ONLY by the driver. Because by the time the driver enables them, we know that they will be managed properly (or at least at that time it can be considered a driver bug and fixed at the proper level). Not before.

7. FPU Emulation

15�Oct�1999�-�16�Oct�1999 (16 posts) Archive Link: "PATCH: (on Alpha) emulating missing instructions"

People: Linus Torvalds,�Alan Cox

Luke Deller and Daniel Potts patched the kernel to emulate some Alpha instructions, and in the course of discussion, Alan Cox opined that the kernel's floating-point emulation should probably have been done in userspace. Linus Torvalds replied:

I don't think you lived through the horror of Minix and having user-space FP emulation. It's a horrible pain - either overloading SIGFPE with all that implies, or another magic hidden signal with loading a FP library on demand etc - and then core-dumps tend to be impossible to figure out etc etc.

So it definitely was a lot better off in kernel space - the user space solution would just have made everybody go slightly mad after the five hundreth bug in some shared library configuration. With the kernel solution there were no surprising interactions, just a lot of reasonably complex code that had mostly been written earlier anyway.

HOWEVER, it may be that times have changed, and that it _now_ would be better off in user space simply because it's almost never an issue any more, and user space can do some hacks that you can't do in kernel space due to security reasons etc.

At the same time it also doesn't really matter any more. I don't think the FP emulator has been changed in the last year or so, and it works. It has some slightly painful issues with fp state saving etc, but they are at least _solved_ issues.

Sharon And Joy

Kernel Traffic is grateful to be developed on a computer donated by Professor Greg Benson and Professor Allan Cruse in the Department of Computer Science at the University of San Francisco. This is the same department that invented FlashMob Computing. Kernel Traffic is hosted by the generous folks at All pages on this site are copyright their original authors, and distributed under the terms of the GNU General Public License version 2.0.