Kernel Traffic #280 For 25 Oct 2004

By Zack Brown

Table Of Contents

Mailing List Stats For This Week

We looked at 1941 posts in 12166K.

There were 467 different contributors. 266 posted more than once. 177 posted last week too.

The top posters of the week were:

1. Redesigning Readahead

23 Sep 2004 - 5 Oct 2004 (33 posts) Archive Link: "[PATCH/RFC] Simplified Readahead"

Topics: Disk Arrays: RAID, Disks: IDE, Disks: SCSI, FS: JFS

People: Steven PrattAndrew Morton

Steven Pratt said:

The readahead code has undergone many changes in the 2.6 kernel and the current implementation is in my opinion obtuse and hard to maintain. We would like to offer up an alternative simplified design which will not only make the code easier to maintain, but as performance tests have shown, results in better performance in many cases.

We are very interested in having others review and try out the code and run whatever performance tests they see fit.

Quick overview of the new design:

The key design point of the new design is to make the readahead code aware of the size of the I/O request. This change eliminates the need for treating large random I/O as sequential and all of the averaging code that exists just to support this. In addition to this change, the new design ramps up quicker, and shuts off faster. This, combined with the request size awareness eliminates the so called "slow read path" that we try so hard to avoid in the current code. For complete details on the design of the new readahead logic, please refer to

There are a few exception cases which still concern me.

  1. pages already in cache
  2. I/O queue congestion.
  3. page stealing

The first of these is a file already residing in page cache. If we do not code for this case we will end up doing multiple page lookups for each page. The current code tries to handle this using the check_ra_success function, but this code does not work. check_ra_success will subtract 1 page each time an I/O is completely contained in page cache, however on the main path we will increment the window size by 2 for each page in the request (up to max_readahead) thus negating the reduction. My question is what is the right behavior. Reducing the size of the ahead window doesn't help. You must turn off readahead to have any effect. Once you believe all pages to be in page cache we should just immediately turn off readahead. What is this trigger point? 4 I/Os in a row? 400? My concern is that the savings for skipping the double lookup appears to be on the order of .5% CPU in my tests, but the penalty for small I/O in sequential read case can be substantial. Currently the new code does not handle this case, but it could be enhanced to do so.

The second case is on queue congestion. Current code does not submit the I/O if the queue is congested. This will result in each page being read serially on the cache miss path. Does submitting 32 4k I/Os instead of 1 128k I/O help queue congestion? Here is one place where the current cod gets real confusing. We reduce the window by 1 page for queue congestion(treated the same as page cache hit), but we leave the information about the current and ahead windows alone even though we did not issue the I/O to populate it and so we will never issue the I/O from the readahead code. Eventually we will start taking cache misses since no one read the pages. That code decrements the window size by 3, but as in the cache hit case since we are still reading sequentially we keep incrementing by 2 for each page; net effect -1 for each page not in cache. Again, the new code ignores the congestion case and still tries to do readahead, thus minimizing/optimizing the I/O requests sent to the device. Is this right? If not what should we do?

The third exception case is page stealing where the page into which readahead was done is reclaimed before the data is copied to user space. This would seem to be a somewhat rare case only happening under sever memory pressure, but my tests have shown that it can occur quite frequently with as little as 4 threads doing 1M readahead or 16 threads doing 128k readahead on a machine with 1GB memory of which 950MB is page cache. Here it would seem the right thing to do is shrink the window size and reduce the chance for page stealing, this however kill I/O performance if done to aggressively. Again the current code may not perform as expected. As in the 2 previous cases, the -3 is offset by a +2 and so unless > 2/3 of pages in a given window are stolen the net effect is to ignore the page stealing. New code will slowly shrink the window as long as stealing occurs, but will quickly regrow once it stops.


A large number of tests have been run to test the performance of the new code, but it is in no way comprehensive. Primarily I ran tiobench to ensure that the major code paths were hit. I also ran sysbench in the mode which has been reported to cause problems on the current/recent readahead code. I used multiple machines and disk types to test on. The small system is a 1 way pentium III 866MHz with 256MB memory and a dedicated IDE test disk. The second machine is an 8way pentium IV 2.0GHz with 1GB memory and test were on on both a dedicated 10k-rpm SCSI disk on on-board adaptec adapter and on 2GBit QLA2300 Fiber attached FAStT700 7 disk RAID0 array.

In summary, the new code was always equal to or better than the current code on all variation and all test configurations. Tests were run multiple time on freshly formatted JFS file systems on both and 2.6.9-rc2-mm1. Results for the 2 kernel versions were similar so I am including only the 2.6.9-rc2-mm1 results. Tiobench results:


For single threaded tests sequential reads the code is equal on IDE and single SCSI disks, but the new code is 10-20% fasted on RAID. For Multi threaded sequential reads the new code is always faster(20-50%). For random reads the new code is equal to the old for all cases where the request size is less than or equal to the max_readahead size. For request sizes larger than max_readahead the new code is as much as 50% faster.

tiobench --block n --size 4000 --numruns 2 --threads m
(where m is one of 4096,16384,524288 and n is one of 1,4,16,64)

Graph lines with "new7" are the new readahead code, all others are the stock kernel.

Single CPU IDE

8way Single SCSI

8way RAID0

Sysbench results:

sysbench --num-threads=254 --test=fileio --file-total-size=4G --file-test-mode=rndrw'

IDE Disk
Current: 1.303 MB/sec average
New: 1.314 MB/sec average

Currnent: 2.713 MB/sec average
New: 2.746 MB/sec average

For full results see:

Joel Schopp thought this design was superior to the existing code in all ways. And Andrew Morton agreed that the current design did get a bit ugly in places. Andrew offered some technical criticism of the patch, adding that it needed some cleaning up before it could be accepted. Steven and Andrew hashed out some of the technical details, along with various other folks, and the thread ended.

2. Linux 2.6.9-rc2-mm4 Released

26 Sep 2004 - 1 Oct 2004 (45 posts) Archive Link: "2.6.9-rc2-mm4"

Topics: Kernel Release Announcement, PCI, Version Control

People: Andrew Morton

Andrew Morton announced Linux 2.6.9-rc2-mm4, saying:

3. Making Keyboard LEDs Blink On Kernel Panic

29 Sep 2004 - 30 Sep 2004 (6 posts) Archive Link: "[PATCH] Readd panic blinking in 2.6"

Topics: FS: sysfs

People: Andi Kleen

Andi Kleen said:

Later 2.4 had a feature that would make the keyboard LEDs blink when a panic occurred. This patch adds it to 2.6 too.

This is useful when your machine is in X and locks up. With the blinking keyboard lights you at least know that a panic happened, not that it randomly locked up.

I cleaned it up a bit and ported it to the new keyboard driver. Unlike 2.4 it also works now with panic=... and doesn't rely on the timer interrupt ticking anymore. It's also clear, now it uses a generic callback, no ifdefs. Should also work now with modular keyboard driver and the panic blink frequency can be configured in sysfs (this is useful for a few KVMs who don't like this). Setting it to 0 turns it off.

Work left to do: find some way to use HLT in the busy loops. Currently machines eat a lot of power in panic and sometimes they hang in this state for days until somebody can reset them. Unfortunately this will require relying on the timer interrupts again.

P.S. Before anyone asks: no, i'm not interested in morse code output.

4. Linux 2.6.9-rc3

29 Sep 2004 - 1 Oct 2004 (31 posts) Archive Link: "Linux 2.6.9-rc3"

Topics: Kernel Release Announcement

People: Linus TorvaldsBill Davidsen

Linus Torvalds announced Linux 2.6.9-rc3, saying:

Ok, this 2.6.9 cycle is getting too long, but here's a -rc3 and hopefully we're getting there now.

Architecture updates, networking, drivers, sparse annotations. You name it.

Bill Davidsen commented, "A recent note related to read vs. write speed actually shows about a 40% degrade in write speed from 2.6.8. I hope 2.6.9 will be held back at least a few days in hopes of verifying or debunking that. I have some results showing "slower" by about 30%, but it was just production runs, not benchmarks."

5. Context-Switching Multiple Linux Instances

1 Oct 2004 (6 posts) Archive Link: "OS Virtualization"

Topics: Microkernels: Adeos, User-Mode Linux

People: Arvind KalyanFrederik DeweerdtMartin WaitzAdam HeathChris Wright

Arvind Kalyan asked:

I'm trying to load and run two linux kernels simultaneously; trying to demonstrate virtualization as a first step.

Anyone have pointers to where I can start? I looked into plex, bochs, vmware, usermode linux.. they only simulate an architecture upon which another kernel runs.

My intentions are to give control to both the kernels to directly control the hardware and do "context switch" between those two based on time-slice.

Frederik Deweerdt recommended, "Maybe you could have a look at Adeos:" . Martin Waitz suggested, "Have a look at Xen: They don't really allow direct hardware manipulation but use drivers of their own." To this, Adam Heath corrected, "For 2.0, they allow direct hardware manipulation." Chris Wright also recommended Xen.

6. Memory Defragmentation

1 Oct 2004 - 4 Oct 2004 (31 posts) Archive Link: "[RFC] memory defragmentation to satisfy high order allocations"

Topics: SMP

People: Marcelo TosattiNick Piggin

Marcelo Tosatti said:

I've been playing with memory defragmentation for the last couple of weeks.

The following patch implements a "coalesce_memory()" function which takes "zone" and "order" as a parameter.

It tries to move enough physically nearby pages to form a free area of "order" size.

It does that by checking whether the page can be moved, allocating a new page, unmapping the pte's to it, copying data to new page, remapping the ptes, and reinserting the page on the radix/LRU.

Its very uncomplete yet - for one SMP concurrent radix lookups will screw file page unmapping (swapcache lookup should be safe), and lots of other buggies inside. For example it doesnt re establishes pte's once it has unmapped them.

I'm working on those.

But it works fine on UP (for a few minutes :)), and easily creates large physically contiguous areas of memory.

With such a thing in place we can build a mechanism for kswapd (or a separate kernel thread, if needed) to notice when we are low on high order pages, and use the coalescing algorithm instead blindly freeing unique pages from LRU in the hope to build large physically contiguous memory areas.

Nick Piggin liked the idea, although he felt Marcelo might run into some kswapd problems. Marcelo replied:

I understand that kswapd is broken, and it needs to go into the page reclaim path to free pages when we are out of high order pages (what your "beat kswapd" patches do and fix high-order failures by doing so), but Linus's argument against it seems to be that "it potentially frees too much pages" causing harm to the system. He also says this has been tried in the past, with not nice results.

And that is why its has not been merged into mainline.

Is my interpretation correct?

Nick replied of the situation with his own patch, "Basically, it gets kswapd doing the work when it would otherwise have to be done in direct reclaim, *OR* otherwise indefinitely fail if the allocations aren't blockable." He added, "Linus was silent on the issue after I answered his concerns. I mailed him privately and he basically said that it seems sane, and he is waiting for patches. Of course, by that stage it was fairly late into 2.6.9, and the current behaviour isn't a regression, so I'm shooting for 2.6.10. Your defragmentor should sit very nicely on top of it, of course."

7. Some Clarification Of Patch Authentication Policy

1 Oct 2004 - 2 Oct 2004 (10 posts) Archive Link: "Loops in the Signed-off-by process"

People: Dave HansenLinus TorvaldsPaul Jackson

Dave Hansen had some questions about the recent patch submission policy changes, first covered in Issue #264, Section #19  (22 May 2004: Linus Proposes New Patch Attribution Convention) . Dave said:

With the recent ppc64 updates, a few patches in my tree didn't merge very easily. Being lazy, I asked one of the ppc64 developers to resync them for me. But, it happened to be someone other than the original author that did this.

When they got sent to me again, the original author's (and my) Signed-off-by: lines were gone, replaced by the nice fellow who merged them. This was certainly an artifact of how he generates patches and obviously not malicious, but I still wonder what the "right" thing to do is.

Do we show the logical flow?

Signed-off-by: original author
Signed-off-by: patch merger
Signed-off-by: tree maintainer

Or the actual flow of the patches, showing that they came back to the tree maintainer twice?

Signed-off-by: original author
Signed-off-by: tree maintainer
Signed-off-by: patch merger
Signed-off-by: tree maintainer

Or, does it even really matter?

Linus Torvalds replied:

I don't think it matters that much, although I personally prefer to see the person who sent it to me ("touched it last") be last in the list. That's partly because of the fact that especially with bigger merges (ie with Andrew), I just do a search-and-replace, and replace any "signed off by sender" with "signed off by sender and me".

At the same time, I think it's pretty unnecessary (and possibly confusing) to have somebody mentioned twice, so I'd actually prefer to see people just move their (previous) sign-off to be last when they send it on.

Side note: I also like seeing "Acked-by:" or "Cc:" things just above the sign-off lines, because it ends up being useful if there are any technical issues with the patch - if a bug is found, it's very convenient to just take all the sign-off people _and_ the other "involved" people and send off a query to them all. Even if that "Acked-by:" has no other meaning than as a mention of the fact that somebody else was involved in discussions, even if they may not have been involved in actually writing or passing off the ptch.

Paul Jackson said:

The protocol for adding an Acked-by line mystifies me a little.

If I submit a patch after having a good discussion of it with Joe Blow, is it appropriate for me to add an Acked-by line for Joe on my own, or should I get his consent (or know him well enough to know he consents) or should I only so add if Joe asks me to?

In other words, does the presence of such a line commit Joe to any position on the patch, beyond perhaps not being too annoyed if he gets queries on it?

Linus replied, "The "acked-by" thing doesn't mean anything, so you should just use your own judgement." He reiterated that the 'acked-by' line committed nobody to anything, and that "The annoyance factor is the only factor to take into account."

8. Linux 2.6.9-rc3-mm1 Released

2 Oct 2004 (10 posts) Archive Link: "2.6.9-rc3-mm1"

Topics: Kernel Release Announcement

People: Andrew Morton

Andrew Morton announced Linux 2.6.9-rc3-mm1, saying:

9. Software Suspend 2.1-rc1 For And 2.6.9-rc3

2 Oct 2004 (1 post) Subject: "[Announce]: Software Suspend 2.1-rc1 for and 2.6.9-rc3."

Topics: Software Suspend

People: Nigel Cunningham

Nigel Cunningham said:

Software suspend 2.1-rc1 for the above kernels has been uploaded to

Changes since are pretty minimal, amounting almost exclusively to compile fixes. There is no new functionality, but some refrigerator calls have been fixed (hvc console and pdflush in particular).

Since the announcement on the suspend-devel list, one minor issue has been found which applies to the version only. And additional patch (which can be applied after the others or popped into the patch directory before applying) is attached.

10. Merging DRM And fbdev

2 Oct 2004 - 4 Oct 2004 (18 posts) Archive Link: "Merging DRM and fbdev"

Topics: BSD, Framebuffer, PCI, Small Systems

People: Jon SmirlDave AirlieVladimir DergachevAlan CoxBill Davidsen

Jon Smirl said:

I've started on a merged fbdev and DRM driver model. It doesn't work yet but here's what the modules look like:

Module Size Used by
fbcon 38080 0
radeon 123598 1
fb 34344 2 fbcon,radeon
drm 59044 1 radeon

fbcon and fb modules are almost unmodified from the kernel source. radeonfb and radeondrm have been merged into a single driver. The merged driver uses both the drm and fb modules as libraries. It wasn't possible to build this model until drm supported drm-core.

The radeon and fb modules will get smaller, I'm just beginning to use the delete key on them. There is still a lot of duplicated code inside the radeon driver.

In this model a non-drm, fb only driver like cyber2000 could load only the fb and fbcon modules. I need to do some work rearranging generic library support functions to allow this.

This is the next phase in the work described in this email:

Dave Airlie remarked, "I think the stated issue with this is, how big the fb driver now becomes because all the DRM stuff is in it... I think a radeon common, with radeonfb/radeondrm is probably going to be needed." Jon said:

Resource reservations are not the central problem with merging fbdev and drm. The central problem is that both card specific drivers initialize the hardware, program it in conflicting ways, allocate the video memory differently, etc. Moving to a single card specific driver lets me fix that.

In the final form both the VGA scheme and my code provide shared resource reservation code. The main difference between the schemes is that the VGA scheme allows multiple independent card drivers while mine only allow a single merged one.

Multiple card drivers in the past has resulted in conflicting programming of the hardware. I suppose we could write a bunch of rules about how to share the hardware but that seems like a lot of complicated work. The radeon has over 200 registers that would need rules for what settings are allowed. It's a lot easier to simply merge 20K of radeonfb driver into the radeondrm and eliminate this error prone process.

If we could all just concentrate on fixing the radeondrm driver we could build a complete driver for the radeon cards instead of the ten half finished ones we have today. Once we get a complete driver the incentive for people to write new ones will be gone.

The two models look like this:

vga - attached to hardware
      drm - library
      fb - library
         fbcon - library

My model....

radeon - attached to hardware
   drm - library
   fb - library
      fbcon - library

vga - independent driver, there is only one VGA device even if multiple radeons. This driver is responsible for secondary card resets.

In the first model radeon-drm and radeon-fb can run independently. This requires duplication of the initialization code. Since the are separate drivers they can and do have completely different models for programming the hardware. At VT switch time the drivers have to save/restore state.

In the second model it is not required that a driver support both fb and drm. Something like cyber2000 does have to link in drm since it has no use for it.

A complaint in the second model might be that the radeon driver is 120K. If some embedded system is really, really tight on RAM and they are embedding a radeon but don't want to use its advanced abilities, there is nothing stopping someone from splitting the radeon driver up into pieces. I will happily take the patch. Doing this is probably a week's worth of coding and testing to get maybe 50K memory savings. Simplest way to do this is to add IFDEFs to remove drm support from the merged radeon driver.

Vladimir Dergachev said of Jon's model:

Can we add to this "km" library ? (That's the GATOS v4l module)

In particular, I can contribute the code that does Framebuffer->System Ram transfers over PCI/AGP. It is currently GPL licensed, but there is no problem if BSD folks want it too.

This is also potentially useful for any Mesa functions that want to transfer data back from video RAM - using plain reads for this is really slow.

Alan Cox said of the RAM transfers over PCI/AGP, "This will do *wonders* to X render performance if used properly on those cards we can't do render in hardware." And regarding the Mesa comment, Alan "Agreed - and Mesa tends to skip even tricks like SSE2 that can quadruple read performance." Vladimir replied:

I am glad to see such enthusiasm :)

The code I have only does it on ATI cards (all radeons, all rage128, some mach64). The radeon code is the one that is known to work well.

My personal interest is that Framebuffer -> System Ram transfer is needed if one wants to use Radeon GPUs for numerical computation. Thus, if there is an agreement on what needs to be done and what modifications are acceptable I can make this a priority.

What kind of interface would different projects want ? Should I wait for Jon's modifications to complete ? What people should we include on CC list ?

Also here is a short description of current km design:

The first two pieces can be ported with ease - there are few modifications to be made, just cut the code that registers the driver.

The km_api piece will need to be replaced with interface everyone agrees on.

Completely elsewhere in the discussion, Bill Davidsen also remarked, "Perhaps there might be some feedback from the embedded folks and/or those who decide if these changes are what they want to go in the kernel. If you're going to do something like this, one of the embedded vendors might want to contribute to development. Clearly smaller software parts have advantages, if resources were available to do it split as part of the modification. That would probably reduce the maintenence effort in the future as well."

11. Linux 2.6.9-rc3-mm2 Released

4 Oct 2004 - 6 Oct 2004 (48 posts) Archive Link: "2.6.9-rc3-mm2"

Topics: Kernel Release Announcement

People: Andrew Morton

Andrew Morton announced Linus 2.6.9-rc3-mm2, saying:

12. Software-Suspend Wakeup Behavior

4 Oct 2004 - 5 Oct 2004 (3 posts) Archive Link: "PATCH/RFC: driver model/pmcore wakeup hooks (0/4)"

Topics: FS: sysfs, PCI, Power Management: ACPI, USB

People: David BrownellLen BrownPavel Machek

David Brownell said:

There's been some discussion about limitations of the current pmcore for systems that want to be partially suspended most of the time. That is, where the power management needs to affect ACPI G0 states, not G1 states like S1/S3/S4, and isn't cpufreq.

One significant example involves USB mice. If they were to be suspended (usb_suspend_device) after a few seconds of inactivity, that change could often spread up the device tree and let the USB host controller stop DMA access. Some x86 CPUs could then enter C3 and save a couple Watts of battery power ... until the mouse moved, and woke that branch of the device tree for a while (until the mouse went idle again).

Most of the parts for that are now in place. But trying to use them will turn up places where the pieces don't fit together very well yet ... and wakeup support is one of them! So for example it's not possible to disable such an autosuspend mechanism for mice that can't actually issue wakeups.

So here are a few patches that add some driver model support for wakeup capabilities, and use it for PCI and USB.

The patches follow this, going to LKML.

Pavel Machek asked how fast the wakeup was, and David replied, "30+msec for the USB-specific signaling; plus a bit more for each layer of USB hubs in its tree. Lots more if power glitches force re-enumeration of anything; but if those happen, they'd happen during normal use too." Pavel also asked if the mouse would jump during wakeup, and David replied:

I actually don't have a wakeup-capable mouse (the one I have lies about it, fwiw) so I don't know how that acts. My current wakeup testing uses a USB keyboard instead.

It may even need to be a click that wakes up the mouse; you don't much want the system to wake up if you nudge the table (and hence mouse), or a truck goes by, etc.

Len Brown may have details, he was particularly keen on having this scenario work, given the number of Intel laptops that would last longer under Linux this way ... it was evidently a big win under Windows.







Sharon And Joy

Kernel Traffic is grateful to be developed on a computer donated by Professor Greg Benson and Professor Allan Cruse in the Department of Computer Science at the University of San Francisco. This is the same department that invented FlashMob Computing. Kernel Traffic is hosted by the generous folks at All pages on this site are copyright their original authors, and distributed under the terms of the GNU General Public License version 2.0.