Kernel Traffic #56 For 28 Feb 2000

By Zack Brown

Table Of Contents

Mailing List Stats For This Week

We looked at 1481 posts in 6099K.

There were 522 different contributors. 232 posted more than once. 191 posted last week too.

The top posters of the week were:


1. devfs Bug-Hunt And Reorganization
9 Feb 2000 - 17 Feb 2000 (32 posts) Archive Link: "[PATCH] devfs v99.11 available"
Topics: Disks: SCSI, FS: devfs
People: Richard GoochB. D. ElliottJames SimmonsVictor KhimenkoLinus Torvalds

Richard Gooch gave a pointer to version 99.11 ( of his devfs patch for the stable series. Sergey Kubushin took the opportunity to ask why 2.3.42 with devfs v.154 would crash on boot. He posted an oops, and Richard asked for his /etc/lilo.conf file to see what parameters Sergey was giving the kernel. Regarding the posted oops, Richard replied:

If it crashes during boot (say mounting root), I would have expected mount_root() to appear in this traceback. Odd that it's not. Can you start sprinkling printk()'s around to figure out where blkdev_open() is being called from when this happens?

Also, I'll need details of how you're using initrd (whether you leave it mounted after switching root and so on).

At this point B. D. Elliott stumbled in at 5 AM his time; he didn't think the problem was related to devfs at all, but had been introduced during the recent reworking of the block device interface, sometime after 2.3.35; he posted some patches to boot and run 2.3.42 with devfs. He rubbed his eyes and explained that the patches did not completely fix some recent ramdisk breakage, adding, "The symptom is that __sometimes__ files in the page cache are not flushed to the ramdisk buffer cache. Sometimes it works correctly." There was no reply.

Elsewhere, James Simmons replied to Richard's original announcement, asking why Richard had moved /dev/tty0 to /dev/vc/0; since (he asserted) the console system would become multihead-aware in 2.5, "we might end up with /dev/tty representing physical heads and /dev/vty to represent virtual consoles." Richard replied that the change was done for consistency with other virtual console devices. James asked where the /dev/vc idea came from. Richard explained, "I can't remember who chose /dev/vc/ in this particular case, but the principle of moving things into subdirectories is fully supported by Linus. In fact, he wanted me to go further than I had :-)" James asked what Linus Torvalds had had in mind, and Richard went on:

He wanted me to move the SCSI device entries from directories like /dev/sd, /dev/sg, /dev/sr and so on to a "logical physical" heirarchy under /dev/scsi, where subdirectory names map to the SCSI address components (i.e. bus, target, lun).

Also, he wanted me to rip out the compatibility entries so the kernel was left with a clean namespace.

James liked Linus' idea of the /dev/scsi hierarchy, and asked why Richard hadn't done it. Richard replied, "It's just not the way I did it originally, though I didn't mind doing it that way (after discussions with SGI I was heading in that direction anyway). I just wanted to keep the old names as well, but Linus wanted them removed from the patch."

Elsewhere and in the course of discussion, Victor Khimenko explained some of the new situation to James. He said that if there was more than one of a given type of device, there would be a directory for that device. For instance, "/dev/tts/* for serial ports, /dev/printer/* for line printer ports, etc." James hadn't understood that a directory hierarchy was so integral to devfs. He pointed out, "You know this is going to break alot of things :(" Richard replied, "If you mean that a devfs kernel has entries in different places, then yes, nearly everything breaks. The old scheme was a mess, and Linus wanted to take an iron broom to it. However, if you run devfsd, then you can get back all the old names if you so desire. In fact, that's what I do on my systems. So devfs+devfsd shouldn't break anything." And Victor added, "Default layout was designed to be clean and manageable, NOT to be 100% backword compatible."


2. Fixing Stalls At Heavy I/O Writes
9 Feb 2000 - 16 Feb 2000 (40 posts) Archive Link: "Re: elevator-starvation-4 (2.2.14 && 2.3.42) [was Re: 2.3.42 elevator latency] (fwd)"
People: Linus TorvaldsAndrea Arcangeli

This was first covered in Issue #25, Section #1  (31 May 1999: Performance/DoS Patch; Kernel Stabilizes) . This time Andrea Arcangeli had posted another patch to stop Linux from stalling during heavy I/O writes. Linus Torvalds replied:

I'd MUCH rather have something like:

  1. each IO queue has a sequence number
  2. each incoming request increments the sequence number, and sets req->seq to be the new sequence number allocated.
  3. re-do the request queue to be a regular "struct list_head" thing, so that we can go both forwards and backwards.
  4. start adding requests from the BACK instead of the front like we do now. That's usually the right thing to do anyway, so it makes us use less CPU to find the right position. It also makes the next rule trivial to implement:
  5. refuse to move a new request forward past a request that has a sequence number that is too much in the past. Here "too much" depends on what kinds of requests we're talking about.

I don't like the "writebomb" logic - rather than have a separate writebomb thing, it should be much easier to make the "too much in the past" check do this particular logic. So the logic may be something like

  • writes may occur earlier than reads, but we will do that ONLY if
    • the read is really recent (ie the distance between the "current sequence number" and the "read request sequence number" is short)
    • the write is closer to the proper elevator sequence than the read was.

Reads work the same way, except the "distance" requirement can be much less strict - let's say that writes can pass reads only if the read is within the last 10 requests handled, while reads can pass other reads as long as there have been less than 100 other reads in between (made-up numbers, you get the idea).

Passing old writes is even easier, so there the distance could be something like "it's ok to pass an old write as long as the old writes sequence number is within 1000 of the current one". This is also where we could easily have "generation of write" logic for sorting between two writes - to force a partial ordering on the queue level.

So I think the sequence numbers should be able to handle =both= the latency issue and the write bomb issue. With some simple rules like the above, you KNOW that you'll never starve a readfrom writes, in fact you'll be guaranteed to do the read with no more than X (in above example 10) writes coming between it and execution.

Comments? It doesn't seem to be too hard to do, and I'd hate to apply your current patch that does something similar but has other things I disagree with.

Andrea Arcangeli replied, "As far I can tell you disagree on the implementation that is the way best I could do in 2.2.x. Actually I developed the code for 2.2.x since the write hang is been reported to me as a bug in 2.2.x thus my primary object was to fix the production kernel since the bug is a showstopper. Now that I did the way best for 2.2.x I'll try to do the way best for 2.3.x starting from the working point I reach in 2.2.x. Also I am completly satisfyed by driving the I/O layer in the way I am just doing now, thus I'll only change the complexity of the implementation now to allow it to do the calculation faster. Fixing up all the drivers won't be a few hours work (how it's instead replacing the dirtyfing with a sequence number) so stay tuned and you'll get the -5 revision in a few days instead."

There followed a long implementation discussion.


3. To Do For 2.4: Saga Continues
10 Feb 2000 - 18 Feb 2000 (28 posts) Archive Link: "Linux Status For 2.3.x: v 2.3.43"
Topics: Big Memory Support, Disk Arrays: RAID, Disks: IDE, Disks: SCSI, FS: NFS, FS: NTFS, Networking, Power Management: ACPI, SMP, Version Control, Virtual Memory
People: Alan CoxMatthew WilcoxAndre HedrickStephen C. TweedieAlexander ViroIngo Molnar

Alan Cox posted his latest task list for 2.3.x; his first task list was covered in Issue #52, Section #1  (4 Jan 2000: ToDo Before 2.4) ; a revised version was covered in Issue #54, Section #1  (28 Jan 2000: ToDo Before 2.4: Saga Continues) . His latest was subdivided into sections:


  1. SCSI needs allocate/free functions to fix the gdth stuff
  2. Fixing scsi blocking and cleanups
  3. PAE36 failures (? - ok now )

In Progress

  1. Merge the network fixes (DaveM)
  2. Merge 2.2.13/14 changes (Alan, all done barring COMX and Sk98)
  3. Get RAID 0.90 in (Ingo)

Fix Exists But Isnt Merged

  1. Signals leak kernel memory (security)
  2. msync fails on NFS

To Do

  1. Truncate races (Debian apt shows it nicely)
  2. Restore O_SYNC functionality
  3. vmalloc(GFP_DMA) is needed for DMA drivers
  4. VM needs rebalancing
  5. Fix eth= command line
  6. Check O_APPEND atomicity bug fixing is complete
  7. Incredibly slow loopback tcp bug
  8. Finish softnet driver port over and cleanups
  9. Page cache high on PAE36 boxes is very slow, maybe disable ?
  10. Protection on isize (sct)
  11. Mikulas claims we need to fix the getblk/mark_buffer_uptodate thing for 2.3.x as well
  12. Fix SPX socket code
  13. NCR5380 isnt smp safe
  14. Finish 64bit vfs merges (lockf64 and friends missing)
  15. Make syncppp use new ppp code
  16. Fbcon races
  17. Fix all remaining PCI code to use new resources and enable_Device
  18. Get the Emu10K merged
  19. Fix module remove race bug (-- not in open so why did I see crashes ??? --)
  20. Per Process rtsigio
  21. VFS?VM - mmap/write deadlock
  22. initrd is bust
  23. rw sempahores on page faults (mmap_sem)
  24. kiobuf seperate lock functions/bounce/page_address fixes
  25. per super block write_super needs an async flag
  26. addres_space needs a VM pressure/flush callback
  27. per file_op rw_kiovec
  28. enhanced disk statistics
  29. Fix routing by fwmark
  30. put_user appears to be broken for i386 machines
  31. Some FB drivers check the A000 area and find it busy then bomb out
  32. NTFS needs updating/binning or something
  33. ACPI hangs on boot for some systems
  34. rw semaphores on inodes to fix read/truncate races ?
  35. Not all device drivers are safe now the write inode lock isnt taken on write
  36. File locking needs checking for races
  37. Multiwrite IDE breaks on a disk error
  38. AFFS doesn't work on current page cache
  39. DMFE is not SMP safe

To item 9 of the To Do list, Ingo Molnar hadn't noticed any slowdown in the page cache of highmem as opposed to lowmem. Stephen C. Tweedie also didn't notice any difference, but there was no reply.

To item 18 of the To Do list, Rui Sousa asked what else was needed for the Emu10K merge, and if Alan hadn't liked the patch for some reason. Alan explained, "Its not far off. I need to grab another copy from CVS and look at the osutils stuff and the IRQ stuff where it registers dynamic callbacks that are always going to be to the same function..." In his post, Rui had also asked if it would be good to put the emu10k1 driver in a new drivers/sound/emu10k1 directory, to which Alan replied, "definitely" .

To item 22 of the To Do list, Matthew Wilcox posted a patch he'd been using with the parisc port. He added, "It seems to work. I dislike the way we're constructing a dentry and a file, but at least we're no longer constructing our own inode. Credit should also go to Thomas Bogendorfer who also worked on this fix." Alexander Viro and Matthew then had a brief implementation discussion.

To item 37 of the To Do list, Andre Hedrick said, "I do not think that I am still sitting on that fix, but I know it is done (for now); however, I bet it is wrapped up in the code I am working on now, drat........"


4. Kernel Documentation
16 Feb 2000 - 21 Feb 2000 (26 posts) Archive Link: "Kernel developers"
Topics: Backward Compatibility
People: Brian ParrisMathias WaackAlan CoxAlessandro RubiniJonathan CorbetAki M LaukkanenJames W. Laferriere

Brian Parris praised the kernel developers, "i've been on this list for several months and have been watching all the developments emerge on this list, you guys are doing a great job. I'm amazed at how quickly bugs get taken care of and how much work gets accomplished here and i hope to join you someday in hacking the linux kernel, you all have been a great inspiration to me to keep on learning." Mathias Waack agreed, but added, "the most important part of the job of a good programmer is writing a good documentation for other people. This job is disregarded by most of the kernel programmers. Its very hard to find actual and good documentation." Alan Cox replied:

Docs are important. Very important. There are now some passable books on the Linux kernel although the rather nice device driver writing book is still due its much needed update.

BTW, if you look at 2.3.45 and friends you'll notice one or two files now using gdoc so that you can generate function references directly from the drivers.

James W. Laferriere clicked his heels together to hear about the use of gdoc, and Aki M Laukkanen asked if there was any news on when the Device Drivers book would be coming out. Alessandro Rubini, the original author of the book, replied, "Jonathan Corbet ( is working with me at updating the book. We are centering on 2.4, with backward compatibility to 2.2 and 2.0 (although 2.0 support will be missing for some advanced stuff). We are also going to cover more hardware platforms than x86/sparc/alpha. I can't tell the schedule for publication (I don't even know)." Jonathan Corbet also replied to Aki, "I am updating it, with Alessandro's help. It will cover 2.2 and 2.4 both. Publication is somewhat unclear... I'm rather behind compared to where I was supposed to be, but, given all of the driver changes that have gone in recently I don't feel entirely bad about that. It's going to be a little while yet, but it may just beat the 2.4 kernel :)"

Aki had also asked in his post, "I'd like to know if there is a more general consensus about the kernel needing an automatic documentation system. Or is this old news and there's already such an effort going on?" To which Alan replied, "I've been documenting my drivers right now to see how it works out. I'd love to get gdoc (or some variant for kernels 8)) outputting a full function spec for everything that is exported or inline and meant to be used by drivers"


5. Usenet Feeds For Linux Mailing Lists
8 Feb 2000 - 10 Feb 2000 (5 posts) Archive Link: "Access to fa.linux.kernel - please help"
Topics: Spam
People: Kjetil Torgrim HommeH. Peter AnvinMatti AarnioTom CraneFrank v Waveren

Tom Crane asked if anyone knew a reliable public read access nntpserver that carried fa.linux.kernel; the one he had been using stopped working, and he didn't want to go back to the mailing list if he could help it. Kjetil Torgrim Homme replied, "I run the fa.* gateways (it's not as well maintained as I'd like, but fa.linux.kernel works fine :-) All these are one-way, since that is the easy way of avoiding spews, spam and clueless posters. We feed this hierarchy onto the Usenet backbone." He offered to give Tom's news admin a feed to fa.linux.* if desired.

Elsewhere, H. Peter Anvin asked, "Anyone still think it was a good thing that linux.* was shut down?" Frank v Waveren replied that he was still getting a steady feed for those groups, and Matti Aarnio added his explanation:

Weird claims are made all the time... I don't know, how linux.* groups are working, and (more importantly) is there a feed from linux-kernel ( LISTS to that/those newsgroup(s).

Because BIDIRECTIONAL list/newsgroup interconnect is extremely difficult to do (and keep working) so that there won't happen any sort of loops, we are actively discouraging such things. (Do it, blunder, and your feed is killed.)

For UNIDIRECTIONAL list->newsgroup feed we listkeepers won't care (much), except perhaps dislike of address harvesters sending spams..

I don't now remember what was the motivation for forbidding UNIDIRECTIONAL (local) newsfeed, unless it was to cut down the number of "That is nice rule for average John Doe, but I am a Wizard at this.." blunderers.. (Which were abundant a year or two ago.)







We Hope You Enjoy Kernel Traffic

Kernel Traffic is grateful to be developed on a computer donated by Professor Greg Benson and Professor Allan Cruse in the Department of Computer Science at the University of San Francisco. This is the same department that invented FlashMob Computing. Kernel Traffic is hosted by the generous folks at All pages on this site are copyright their original authors, and distributed under the terms of the GNU General Public License, version 2.0.