Kernel Traffic #201 For 17 Jan 2003

By Zack Brown

Table Of Contents


Well, I haven't received any actual feedback on the People indices, but the web logs show a lot of people using them. Check them out (quotes.html) , and please let me know what you think, and what changes you'd like to see.

There have been a couple enhancements since last issue:

I'm also trying to change the publication schedule for Kernel Traffic. Instead of going up on Mondays, I'm going to shoot for Friday evening or Saturday morning. We'll see how that goes.

Mailing List Stats For This Week

We looked at 1431 posts in 6753K.

There were 425 different contributors. 211 posted more than once. 163 posted last week too.

The top posters of the week were:


1. Plans For Framebuffer Code
4 Jan 2003 - 10 Jan 2003 (18 posts) Archive Link: "[PATCH][FBDEV]: fb_putcs() and fb_setfont() methods"
Topics: Framebuffer
People: Antonino DaplasJames SimmonsPetr VandrovecGeert Uytterhoeven

Antonino Daplas posted a patch against 2.5.54, "to add putcs() and setfont() methods for fbdev drivers that require them" Some folks begain discussing implementation issues, when James Simmons said definitively, "Rejected. I have put thought into it and the whole point was to not allow the fbdev layer to touch console data. I stand firm on this!!! The reason being is the core console layer is going to change the next development cycle. We have to change to deal with things like the PC9800 type hardware that support more than 512 fonts. Do we realy want to break every fbdev driver again. This way the breakage is once and for all. Its is also a pandoras box. If we place these hooks in we end up with the same crappy driver problem we had before. I never heard anyone every say the old api we clean." Antonino had no problem with this, but he urged James to at least include portions of his patch that dealt with actual security fixes. James did so.

Petr Vandrovec felt James was taking an overly hard line on the whole issue, defending the old API. He said there was no need to rip out the guts and try to replace them en masse; the only problems he saw with the API were things that could be fixed incrementally. He explained, "It is like with modules - some believe in evolution, and some in revolution... Fortunately modules situation finally settled down and it is enough just install new app to handle module loading/unloading." But James said, "The current "core" console code screen_buf layout is designed after VGA text mode. 16 bits which only 8 bits are used to represent a character, 9 if you have high_fonts flag set. The other 8,7 bits are for attributes. This is very limiting and it does effect fbcon.c :-( I like to the console system remove these awful limitation in the future. This why I like to see fbdev drivers avoid touching strings from the console layer." Geert Uytterhoeven pointed out that Antonino's patch was actually quite generic. And Antonino explained:

Geert is correct that the functions are generic. The fb_putcs() and fb_setfont() can be compared to Tile blitting. Tile blitting is a common operation in some games such as Warcraft, Starcraft, and most RPG's. I'm think there is Tile Blitting support in DirectFB.

In a tile-based game, the basic unit is a Tile which is just a bitmap with a predefined width and height. The game has several tiles stored in memory each with it's own unique id. To draw the background/layer, a TileMap is constructed which is basically another array. Its format is something like this - TileMap[x] = y which means draw Tile y at screen position x.

In the fbcon perspective, we can think of each character as a Tile, and fontdata as the collection of tiles. is basically a TileMap. Of course, tile blitting in games is more complicated than this, since games have multiple layers for the background, so layer position, transparency, etc has to be considered.

So maybe if we can rename fb_putcs() to fb_tileblit(), fb_setfont() to fb_loadtiles(), struct fb_chars to struct fb_tilemap and struct fb_fontdata to struct fb_tiledata, maybe it will be more acceptable?

It can be even be expanded by including fb_tiledata.depth fb_tiledata.cmap so we can support multi-colored tiled blitting.

James said he had no problem with any of this, as long as data from the console layer was not touched. Antonino then posted a fresh patch, clearing out all cases that touched data in the console layer; and the thread ended.


2. Subtle Locking Bug In Quota Support In 2.5
5 Jan 2003 - 12 Jan 2003 (11 posts) Archive Link: "2.5.54 - quota support"
Topics: FS: ext2
People: Jan KaraLukas HejtmanekAndrew Morton

Lukas Hejtmanek couldn't get quota support working in 2.5.54, and asked if it was currently broken. Andrew Morton replied that it worked for him, and suggested quota-3.08 from Lukas checked his version number, and found he was using the standard Debian package of version 3.08; but still reported lockups when running 'quotaon'. Under 2.5.53 and 2.4.20 the program worked correctly; although under 2.5.53 he saw errors when running 'quotaoff'. Jan Kara speculated, "It seems like quotaon (or better quotactl()) waits on some lock forever... I'll try to reproduce it but in the mean time can you print list of processes, write down a few addresses from the top of the stack of quotaon and try to match it in the to function in which is process stuck?" Lukas ran some traces, and found that the lockups were not entirely predictable; and that sometimes there would be a lockup, and sometimes 'quotaon' would simply be unable to find the device. Jan replied:

Reporting 'No such device' was actually bug which was introduced some time ago but nobody probably noticed it... It was introduce when quota code was converted from device numbers to 'bdev' structures.

I also fixed one bug in quotaon() call however I'm not sure wheter it could cause the freeze. Anyway patch is attached, try it and tell me about the changes.

Lukas tried the patch and found that 'quotaon' would still crash under normal circumstances, but that some experimental circumstances would no longer cause a crash, when they had before. Jan took this as an encouraging sign, and dove back into the code for more bug hunting. Finally, he said, "Ok. So I found the bug. Fix was a bit nontrivial (at one path we tried to acquire one lock twice) but know it should work. The patch also contain fix in ext2 - at some time ext2_setattr was written and call of DQUOT_TRANSFER was missing so no quota was being transferred." Lukas replied, "Good job. This patch works for me (tested with kernel 2.5.55, successfully patched with no errors). Thanks a lot."


3. Userspace Test Framework For Module Loader Porting
5 Jan 2003 - 13 Jan 2003 (10 posts) Archive Link: "Userspace Test Framework for module loader porting"
People: Rusty RussellDavid MosbergerRichard Henderson

Rusty Russell announced:

The userspace test framework I used to develop module loading on different archs is up at:

I found it much easier to use for each arch than doing the crash/reboot cycle (and you can use a real debugger).

BTW, the change to use shared objects for modules is going to be a 2.7 thing: after 10 architectures, MIPS toolchain issues made it non-trivial. So the current stuff is what is going to be there for 2.6, so no point waiting 8)

David Mosberger asked, "What about all the problems that Richard Henderson pointed out with the original in-kernel module loader? Were those solved? My gut feeling is that we really want shared objects for kernel modules on ia64 (and probably alpha?)." Richard Henderson and Rusty both replied that yes, all of Richard's objections had been answered. But as far as actually having shared objects for kernel modules on various architectures, Richard said, "Well, most everyone wants it. Except that MIPS is terminally broken. They need a rewrite of bfd/elfxx-mips.c in order to be able to do non-pic ET_DYN images. Which leaves the rest of us out in the cold."

David Mosberger was pleased that what could be fixed, had been, and asked, "Rusty, have you maintained the ia64 support of your in-kernel loader? To be honest, I have less than zero interest in maintaining such code. I'd rather prefer the old (user-level loader) or the new shared-object loader. (Of course, if someone else wants to volunteer, that would be fine, too... ;-)" . Rusty said he hadn't maintained ia64 support, but would give it a shot the following week. For the various alternatives, he added, "I thought about letting archs choose which one they wanted to use, but it would really mess up the core code. Of course, the transition won't break userspace (kind of the whole point of the in-kernel module loader)."


4. IRQ Routing Performance In 2.5
7 Jan 2003 - 9 Jan 2003 (4 posts) Archive Link: "[2.5] IRQ distribution in the 2.5.52 kernel"
Topics: Hyperthreading, SMP
People: Nitin A Kamble

Nitin A Kamble from Intel reported:

We were looking at the performance impact of the IRQ routing from the 2.5.52 Linux kernel. This email includes some of our findings about the way the interrupts are getting moved in the 2.5.52 kernel. Also there is discussion and a patch for a new implementation. Let me know what you think at (

Current implementation:

We have found that the existing implementation works well on IA32 SMP systems with light load of interrupts. Also we noticed that it is not working that well under heavy interrupt load conditions on these SMP systems. The observations are:

  • Interrupt load of each IRQ is getting balanced on CPUs independent of load of other IRQs. Also the current implementation moves the IRQs randomly. This works well when the interrupt load is light. But we start seeing imbalance of interrupt load with existence of multiple heavy interrupt sources. Frequently multiple heavily loaded IRQs gets moved to a single CPU while other CPUs stay very lightly loaded. To achieve a good interrupts load balance, it is important to consider the load of all the interrupts together.

    This further can be explained with an example of 4 CPUs and 4 heavy interrupt sources. With the existing random movement approach, the chance of each of these heavy interrupt sources moving to separate CPUs is: (4/4)*(3/4)*(2/4)*(1/4) = 3/16. It means 13/16 = 81.25% of the time the situation is, some CPUs are very lightly loaded and some are loaded with multiple heavy interrupts. This causes the interrupt load imbalance and results in less performance. In a case of 2 CPUs and 2 heavily loaded interrupt sources, this imbalance happens 1/2 = 50% of the times. This issue becomes more and more severe with increasing number of heavy interrupt sources.

  • Another interesting observation is: We cannot see the imbalance of the interrupt load from /proc/interrupts. (/proc/interrupts shows the cumulative load of interrupts on all CPUs.) If the interrupt load is imbalanced and this imbalance is getting rotated among CPUs continuously, then /proc/interrupts will still show that the interrupt load is going to processors very evenly. Currently at the frequency (HZ/50) at which IRQs are moved across CPUs, it is not possible to see any interrupt load imbalance happening.
  • We have also found that, in certain cases the static IRQ binding performs better than the existing kernel distribution of interrupt load. The reason is, in a well-balanced interrupt load situations, these interrupts are unnecessarily getting frequently moved across CPUs. This adds an extra overhead; also it takes off the CPU cache warmth benefits.

    This came out from the performance measurements done on a 4-way HT (8 logical processors) Pentium 4 Xeon system running 8 copies of netperf. The 4 NICs in the system taking different IRQs generated sizable interrupt load with the help of connected clients.

    Here the netperf transactions/sec throughput numbers observed are:

    IRQs nicely manually bound to CPUs: 56.20K
    The current kernel implementation of IRQ movement: 50.05K

    The static binding of IRQs has performed 12.28% better than the current IRQ movement implemented in the kernel.

  • The current implementation does not distinguish siblings from the HT (Hyper-Threading(tm)) enabled CPUs. It will be beneficial to balance the interrupt load with respect to processor packages first, and then among logical CPUs inside processor packages.

    For example if we have 2 heavy interrupt sources and 2 processor packages (4 logical CPUs); Assigning both the heavy interrupt sources in different processor packages is better, it will use different execution resources from the different processor packages.

New revised implementation:

We also have been working on a new implementation. The following points are in main focus.

  • At any moment heavily loaded IRQs are distributed to different CPUs to achieve as much balance as possible.
  • Lightly loaded interrupt sources are ignored from the load balancing, as they do not cause considerable imbalance.
  • When the heavy interrupt sources are balanced, they are not moved around. This also helps in keeping the CPU caches warm.
  • It has been made HT aware. While distributing the load, the load on a processor package to which the logical CPUs belong to is also considered.
  • In the situations of few (lesser than num_cpus) heavy interrupt sources, it is not possible to balance them evenly. In such case the existing code has been reused to move the interrupts. The randomness from the original code has been removed.
  • The time interval for redistribution has been made flexible. It varies as the system interrupt load changes.
  • A new kernel_thread is introduced to do the load balancing calculations for all the interrupt sources. It keeps the balanace_maps ready for interrupt handlers, keeping the overhead in the interrupt handling to minimum.
  • It allows the disabling of the IRQ distribution from the boot loader command line, if anybody wants to do it for any reason.
  • The algorithm also takes into account the static binding of interrupts to CPUs that user imposes from the /proc/irq/{n}/smp_affinity interface.

Throughput numbers with the netperf setup for the new implementation:

Current kernel IRQ balance implementation: 50.02K transactions/sec
The new IRQ balance implementation: 56.01K transactions/sec
The performance improvement on P4 Xeon of 11.9% is observed.

The new IRQ balance implementation also shows little performance improvement on P6 (Pentium II, III) systems.

On a P6 system the netperf throughput numbers are:
Current kernel IRQ balance implementation: 36.96K transactions/sec
The new IRQ balance implementation: 37.65K transactions/sec
Here the performance improvement on P6 system of about 2% is observed.


5. Linux 2.5.55 Released
8 Jan 2003 - 9 Jan 2003 (8 posts) Archive Link: "Linux v2.5.55"
Topics: Device Mapper, FS: sysfs, Networking, USB, Version Control
People: Linus TorvaldsGreg KH

Linus Torvalds announced Linux 2.5.55 () , saying, "All over the map again: arm, alpha, ppc, sparc, usb, isdn, dm, sysfs, knfsd - you name it." Greg KH noticed that Linus had not included some USB patches Greg had sent him; and sent them again. Linus replied, "I did, but they got applied after 2.5.55 was released (they're part of the current BK tree)." This satisfied Greg.


6. New Kernel Bug Database Continues Development
9 Jan 2003 (7 posts) Archive Link: "[ANNOUNCE] Kernel Bug Database V1.10 on-line"
People: John BradfordAlex RiesenJan-Benedict GlawIngo Molnar

Continuing his work covered in Issue #199, Section #10  (30 Dec 2002: Possible Replacement For Bugzilla) , John Bradford announced:

Version 1.10 of my kernel bug database is now on-line at:

Main updates:

  • Automatic account creation

    No need to E-Mail a request for an account to me - there is a link to create one if you don't have one already.

  • Generate a config file with the same options as the one that was uploaded with the bug report.

    If the original submitter of a bug uploaded their config file, you can download a config file with the same options set.

  • Patch database

    Patches can be submitted against a bug report, along with comments, and the facility is in place to automatically test the patch to see if it applies against any number of kernel trees. This will probably not be enabled until the bug database is moved on to another machine which has more disk space for the uncompressed kernel trees.

    It's also possible to browse the available patches, search for strings in patches, and download the patches, (obviously).

  • Command line interface improvements

    Eventually intended to be accessible via E-Mail, you can currently test the command line interface via the web. I've added commands related to patch handling.

  • Minor enhancements

    Various enhancements, including categorising of drop down lists of kernel versions and config options.

  • Various bugfixes

    Various bugfixes and minor enhancaments to improve the bug database overall.

Important note

Bugs in the database are not assigned any kind of status, nor are they assigned to one or more people, for them to work on.

This is intentional - eventually, the best way to use this database will be like this:

  • A user uploads their config file, (or an oops, or searches using keywords).
  • No bugs are found, or only ones that are nothing to do with the bug the user is experiencing.
  • The user submits a bug report
  • That bug report is re-named, re-numbered, commented on, or even deleted if it is a duplicate, by developers, until eventually a patch is posted that fixes it.
  • The original user uploads their config file, again a week later and gets a list of bug reports back which match certain options in it, which the developers have identified as causing the bugs.
  • That list now includes the bug that the user is experiencing, and hopefully also includes a patch to fix it.
  • The user downloads the patch, and can also get information about which new kernel versions it can be applied to, and by going back to the bug list, can also find out which new kernel versions the bug is actually fixed in.

Note that if a user's original bug report is actually a duplicate of an existing bug in the database, the bug report can simply be deleted, (possibly after moving comments, patches, etc, from it to the original bug).

As long as the original user does not rely on tracking the bug report by number, and instead searches via config options, (which can be as easy as uploading the relevant .config file), they should still find any applicable comments and patches that the developers have submitted. A list of kernels that any available patches successfully apply to can easily be downloaded, saving even more time in cases where a patch is made against one tree, and the user wants to apply it to another tree, (for example, because of other bugs preventing the latest kernel version from being usable on their machine).

Jan-Benedict Glaw was very excited about this, and wanted more information. In terms of downloading config files that had set similar options as config files that had been uploaded as part of bug reports, he asked what specifically would be downloaded? Was it the original config file or something else? John replied:

No, you don't just get a copy of the original config file:

When a config file is uploaded to the system, it's parsed and the actual config options are stored in a database. If comments are present in a form that resembles what the existing kernel configurators use to indicate different sections, then those comments are used to categorise the config options in the database.

The main reason for this is so that if somebody reports a bug, and includes their config information, a developer can select one of their config options from a list, and indicate that the bug is triggered by it.

Re-generating the config file from that database, so that somebody else can download it was added as an afterthought :-). Comments are re-inserted, as well as an additional comment showing which kernel version the config was originally intended for.

Elsewhere, Ingo Molnar asked why it was required that users register before they actually browse the bug database. John replied that username "guest" and password "guest" would let anyone into the system. But Alex Riesen felt it was pointless to require a login at all, if someone just wanted to browse around. This made sense to John, and he said he'd fix it in the next release.


7. Linux Test Project Version 20030110 Released
10 Jan 2003 (1 post) Archive Link: "[ANNOUNCE] LTP-20030110"
Topics: Bug Tracking, Version Control
People: Jeff Martin

Jeff Martin announced:

The Linux Test Project test suite LTP-20030110.tgz has been released. Visit our website ( to download the latest version of the testsuite, and for information on test results on pre-releases, release candidates & stable releases of the kernel. There is also a list of test cases that are expected to fail, please find the list at (

The highlights of this release are:

  • Many new tests from Wipro.
  • Many new SPIE tests ported.
  • More than 40 new tests.
  • LTP now has over 900 tests.
  • Many bug-fixes

We encourage the community to post results, patches, or new tests on our mailing list, and to use the CVS bug tracking facility to report problems that you might encounter. More details available at our web-site.


8. NGTP Threading Library Version 2.2.0 Released
Topics: POSIX
People: Bill AbtJeff GarzikDan KegelLinus TorvaldsValdis KletnieksMarc-Christian Petersen

Bill Abt from IBM announced:

NGPT - Next Generation POSIX Threading

NGPT Release 2.2.0, released today, 10 January 2003, is the next release of the "Next Generation" of Linux pthreads support. This release is fully suitable as a replacement for LinuxThreads by either a single user or group or an entire distribution.

In this release, the primary focus was performance. Significant performance and scalability enhancements have been made to this release making it the fastest and most scalable POSIX compliant threads package available on the Linux platform.

In this release, performance and scalability were the key focus of NGPT developers. Performance and scalability were improved to the point where NGPT bests both LinuxThreads and the new NPTL threading package in benchmarks. No changes were made to the kernel patches and thanks to the NPTL effort, all changes required to run NGPT on the latest 2.5.x kernels are already included.

Performance and scalability were measured using a benchmark program developed by Sun Microsystems to "prove" that a 1:1 threading model is better than the M:N threading model. As can be seen in the benchmark results NGPT is the performance and scalability leader on both a 2-way and 4-way machine running this benchmark. The benchmark results can be found on the NGPT website. The benchmark itself can be downloaded from the Sun Microsystems site.

The NGPT website can be found at

Marc-Christian Petersen was doubtful of Bill's performance claims, but some guy named Joe at Lexus said the benchmarks were probably pretty accurate. He added that for a more accurate measurement of NPTL, tests would have to be done with a recent glibc that contained NPTL-specific enhancements. Jeff Garzik confirmed, "You are correct: you need a recent 2.5 kernel and a recent glibc." Valdis Kletnieks asked if Red Hat's 2.3.1 RPM would qualify as recent enough, and Jeff said:

AFAIK, yes, it was included in the Phoebe beta.

However, I also pretty sure that fixes have been made since then, so I would grab the latest glibc from cvs... This is unfortunately a better question for the glibc lists ;-)

Dan Kegel also said to Valdis: lists what sources are needed for the latest nptl. Phoebe beta had a slightly earlier snapshot of nptl and glibc.

As far as the kernel goes, it's rumored ( that you're better off using a recent 2.5 kernel than the 2.4 backport in phoebe.

I haven't tried NPTL myself, though, so what do I know...

Way at the beginning of the thread, Marc-Christian noticed that the NGPT web site ( had an apparently misleading quote by Linus Torvalds. As presented on the site, it went like this:

Linus Torvalds: Look at Next Generation POSIX Threads (NGPT) for the future of threads, he advised. "pthreads are horrible, and Linux has a very different model, and there was no glue between the two." NGPT could be that glue.

Notice how a nonquote appears to be presented as a quote, until you read far enough into it. To IBM's credit, the part in actual quotes can be accurately attributed to Linus. But in archives going back to 1999, I can find no email where Linus recommends that people look to NGPT for the future of threads (or even an email where he mentions the project). there was no reply to Marc-Christian's query on the list.


9. Linux 2.5.56 Released
10 Jan 2003 - 11 Jan 2003 (7 posts) Archive Link: "Linux v2.5.56"
Topics: Forward Port, Power Management: ACPI, USB
People: Linus TorvaldsDave Jones

Linus Torvalds announced Linux 2.5.56 ( :

Trying to make releases slightly more often and slightly smaller.

ACPI, USB, networking (mainly netfilter) updates. Some syscall path updates and a thread bug in mm_release() that would miss updating the TID and cause a few extra traps at exec time.

And a watchdog forward port from 2.4.x by DaveJ.

Dave Jones added, "just to stem the number of 'this still isnt finished' reports I'm getting, I'm working through the 2.4 diffs incrementally. I'm not done yet, so please, be patient.."


10. Mysterious New Linux Project Seeks Developers
10 Jan 2003 (1 post) Archive Link: "new linux site: message inviting participation by top linux advocates"
Topics: Spam
People: Luke Kenneth Casson Leighton

Luke Kenneth Casson Leighton announced:


a new linux project is soon to be announced and as part of the preparation for its launch, this is an invitation for the top linux and open source people to participate.

once announced, the project will be open to everyone, world-wide, to the benefit of linux and open source, and the advance participation of a few key people will help enormously to pave the way.

this message is therefore intended to reach, in what i believe to be an appropriate way (all things considered), the top linux developers and the most active and recognised open source advocates.

for those people who believe that this approach is inappropriate, i can only apologise in advance: please simply hit delete, now: (hit it _really_ hard - get it out your system, that's right :), and save everyone some further bandwidth.

please contact me direct

if you are one of the ten or so people that have received an email directly from me recently, and you read this first, i would greatly appreciate you taking the time to locate my message to you, or to email me at for more information, if that is more appropriate.

advice sought on reaching the top linus and OS community leaders

if you know of any more appropriate forums, or any more appropriate methods by which the top linux developers and advocates and the pioneers of open source may be contacted

... bearing in mind that they are incredibly busy and receive hundreds of email messages per day...

i would love to hear from you (at my address).

please help me contact the linux and OS community leaders

if you are personally in touch with, on a regular basis, one of the recognised leaders of the linux and open source communities, then i would greatly appreciate it if you could draw their attention to this message and also ask them to contact me at my email address.

if you are NOT in touch with, on a regular and day-to-day basis, the recognised leaders of the linux and open source communities, please do NOT spam their inboxes irresponsibly with "oh, there's this guy who posted on the linux mailing lists who wanted to get in touch with you" style messages, you will only alienate them.



if you believe that someone, anywhere in the world, is a recognised leader in the open source community and is actively involved in promoting open source and linux, then please email me with:

  • their name
  • their email address and web site, and best contact method.
  • whether you are willing to help assist in contacting them (you know them personally)
  • references to some appropriate URLs that describe what they have achieved.

to everyone with the patience and time to read this far:

many, many thanks.

if you love linux and believe in open source, i believe that you will love the new project when it is announced and ready to launch.

There was no reply.


11. sl82c105 Driver Updates For 2.4 And 2.5; IDE Code Stability In 2.4
11 Jan 2003 - 12 Jan 2003 (8 posts) Archive Link: "[PATCH] sl82c105 driver update"
Topics: Disks: IDE
People: Benjamin HerrenschmidtRussell King

Benjamin Herrenschmidt announced:

Enclosed is an update to the sl82c105 driver against 2.4.21-pre3, I'll produce a 2.5 version once this is accepted by Alan.

It adds a pio_speed field to the generic IDE struct drive. This field is currently only used by this driver, not by the core, and stores the last used PIO speed for use when disabling DMA.

This patch fix the current oops caused by this driver on boot, along with other fixes & HW bugs workarounds by Russel King and me.

Alan, please send to Marcelo if you are ok. Currently tested on a briQ HW (one channel, one master disk).

Note that I intentionally stop force-enabling the second channel (the old driver did that) since this cause problems on machines with only one channel wired and no pull down resistor on D7. It's the responsibility of the BIOS or arch fixup of machines with 2 channels to properly set the enable bits for the second one. The first one is always assumed enabled for now (though I have nothing against changing that too).

He posted a quick fix on top of his patch, for a small bug, but Russell King felt the patch was still broken. He said:

Its still broken - if it uses DMA, the ide core will call ide_dma_on, which will call config_for_dma(), which will call ide_config_drive_speed, which will then call ide_dma_on, etc.

Sorry, I don't have a solution off hand for this. I just wish that the IDE core didn't change in these incompatible ways during a stable kernel release.

Benjamin replied that he didn't see the DMA behavior Russell described. A few folks talked it over, with no conclusion during the discussion.


12. 2.5.56-mm1 Released; Subtle Race Condition Fixed
11 Jan 2003 - 13 Jan 2003 (4 posts) Archive Link: "2.5.56-mm1"
Topics: FS: ext3, FS: ramfs
People: Andrew MortonJeff GarzikDipankar SarmaIngo Oeser

Andrew Morton announced:

Nothing much new here except for a fix for the ext3-related memory leak which Con reported recently.

The main items which remain unmerged from the -mm patch series are now:

  • red/black-tree based insertion and sorting for the I/O scheduler.

    Jens will be submitting this next week. It's completely stable, and the patch includes the addition of the I/O scheduler tunables in /sys/block/hda/iosched/, which is fairly important.

  • Code to automatically unplug request queues on the basis of their occupancy and a timeout.

    Jens will be reviewing this soon.

  • dcache-RCU.

    This was recently updated to fix a rename race. It's quite stable. I'm not sure where we stand wrt merging it now. Al seems to have disappeared.

  • Ingo Oeser's user page walking rework. This appears to be stable, although I'm not sure what testing it has had apart from a lot of direct-io testing.
  • Quite a lot of misc stuff which I need to go through and either send or toss.

Regarding the dcache-RCU fix, Jeff Garzik replied:

I talked to him in person last week, and this was one of the topics of discussion. He seemed to think it was fundamentally unfixable. He proceed to explain why, and then explained the scheme he worked out to improve things. Unfortunately my memory cannot do justice to the details.

Next time he explains it, I will write it down :)

Sorry for so lame a data point :)

Dipankar Sarma replied:

The rename race is fixed now. Yes, it was unfixable using *existing* RCU techniques, but one has to invent new tricks when the old bag of tricks is empty :)

Fundamentally what happens is that rename may be *two* updates - delete from one hash chain and insert into another hash chain. In order for lockfree traversal to work correctly, you must have a grace period after each update. If we do a grace period between these two updates in a rename, it slows down renames to unacceptable levels. So we had a problem there.

The solution lies in the dcache itself - it has a fast path (cached_lookup) and a slow path (real_lookup). So all we had to do was to detect that a rename had happened to the dentry while we looked it up lockfree. This is done by a generation counter (d_move_count) in the dentry and is protected by the per-dentry spinlock which we take during rename and a successful cache lookup.

Two things can happen due to the rename race - lookup incorrectly succeeds or lookup incorrectly fails. The success case is easily handled by the lockfree lookup code.

He posted some sample code and continued:

If the lookup fails due to rename race, then there will anyway be a slow real_lookup which is serialized with rename.

Maneesh did a lot of testing using many ramfs and many millions of renames with millions of lookups going on at the same time and slow path was hit only 100 times or so. For practical workloads, this should have absolutely no performance impact.


13. Virtual Memory Subsystem Documentation
11 Jan 2003 - 13 Jan 2003 (10 posts) Archive Link: "Linux VM Documentation - Draft 1"
Topics: Big Memory Support, Version Control, Virtual Memory
People: Mel GormanWilly TarreauMarcus Alanen

Mel Gorman announced:

Well, despite numerous setbacks, disasters and various panic-attacks, I've finally got a first draft together for documentation of the Linux VM. This is still incomplete but will hopefully still be a valuable resource to those wishing to understand the VM.

It is based on 2.4.20 as the 2.5.x one still changes too much too regularly to make documenting it feasible. I do believe though that having a good understanding of the 2.4.20 VM is 80% of the work to understanding the 2.5.x one at least. There is a few notable areas not covered yet but will be over the next month or two but I am releasing this early so I can start getting feedback and correcting any errors or poor assumptions now rather than later. The areas are;

  • Swap area management (swap.c, swapfile.c etc)
  • High memory management (highmem.c)
  • Memory locking (mlock.c)
  • Mem init (May not cover as it's very arch specific and there is docs out there on the subject already)
  • Shared memory (May not cover this at all as it is really an IPC field)
  • Buffer management (Same, except it's of more importance to IO)

The documentation comes in two parts. The first is "Understanding the Linux Virtual Memory Manager" and it does pretty much as described. It is available in three formats, PDF, HTML and plain text.

Understanding the Linux Virtual Memory Manager

The second part is a code commentary which is literally a guided tour through the code. It is intended to help decipher the more cryptic sections as well as identify the code patterns that are prevalent through the code. I decided to have the code separate from the first document as maintaining the code in the document would be too painful

Code Commentary on the Linux Virtual Memory Manager

Any feedback, comments or suggestions are welcome from anyone with a VM interest but I would appreciate if people already familiar with the VM would even give a brief read to check for technical accuracy. There was rarely an authoritative source to check to make sure I was right and I didn't want to be asking questions every 5 minutes on IRC or mailing lists :-)

Willy Tarreau replied, "one feedback : Thanks a lot !!! This is invaluable work. I don't have the skills to tell you if/where you let mistakes, but your documents will help me (and probably many people) understanding this important kernel part."

Marcus Alanen was also overjoyed at this accomplishment, and asked if Mel would take patches for his docs. Mel replied:

I wasn't sure how suitable patches would be for documentation but I'll try anything once. A tar ball of the current tex source is at . There is a CVS tree but it's on a computer thats already heavily loaded so I don't want to have it hammered.

The tex sources are in tex/understand and tex/code . To create a DVI, simply ./make dvi . If you add "understand" or "code", it'll just generate that book.

Andrea Glorioso suggested starting a sourceforge project for this, and Mel replied, "There is a savannagh project called the Linux Kernel Documentation Project (LKDP) ( set up by Abhishek Nayani but it has been inactive for some time. I will eventually merge with it (I have made contributions to it in the past) but am waiting to get the last chapters finished first. It might be me being awkward but it's difficult to have a number of people working on one document and keeping the writing style consistent."


14. Moderated linux-kernel Forum
12 Jan 2003 (10 posts) Archive Link: "Moderated forum for linux-kernel"
Topics: Mailing List Administration
People: Andrew WalrondRussell KingDavid TruogOliver NeukumOlivier GalibertAxel Siebenwirth

Andrew Walrond suggested:

Forgive if this has been discussed before, but has anyone considered hosting the linux-kernel on a web-based forum as used extensively elsewhere?

I can think of advantages;

Better Thread organisation and seperate topic areas for drivers, patches, ide, ...

Being able to cheery pick threads of interest, and completely ignore others Not having to dump your inbox after a week away just to catch up Moderated forums (Off-topic threads policed and deleted) Read only forums (write for registered/invited members)

I'm sure somebody will enlighten me regarding the disadvantages. :)

Russell King replied:

Web-based - pain in the ass to use. Especially for people who are not on-line all the time.

Moderated linux-kernel - lots of traffic, too much to be individually moderated.

Certainly the second has been discussed before many many many times.

People, please, if you think you have an damned obvious answer to a problem, at least check the many archives before posting it.

Let us *ALL* try to avoid linux-kernel turning into tens of trolling flamewars.

David Truog also said to Andrew, "large posts (patches) and exporting data would be the two biggest" [disadvantages] "i personnaly see. also, some of us (I) use various methods to sort/search posts." Oliver Neukum also replied to Andrew's initial suggestion. He said, tongue in cheek, "Sure. How many full time moderators are you willing to employ?" Olivier Galibert also listed the disadvantages he saw:

  • Much slower than a local mailbox.
  • No filtering.
  • No choice of presentation (or not enough).
  • No scoring.
  • Much higher bandwidth needs.
  • Hard to archive.
  • Can't forward posts.
  • Can't grep posts.
  • Can't save some posts in a contiguous mailbox and patch -p1 them.

And the most annoying part, people feel anonymous on web forums and as a result post any crap just because they can, while most of tend to take having to put their email address with usually they real name in it more seriously.

He suggested that a good mail client would solve most linux-kernel problems better than a moderator. A number of folks agreed with this throughout the thread, and Axel Siebenwirth added, "I'm using procmail to filter certain patterns in lkml subjects into different mailboxes."

At a certain point, Andrew had had enough. He said, "That'll be a no then :) Ok I'm convinced. Please - no more replys!" He also asked which mail client was used by "folks in the know". Axel directed him to Mutt ( .

For the record, I use Mutt to write Kernel Traffic each week. It's the only tool I know that can really handle such a large list. The Debian package also has a feature by Cedric Duval that allows dynamic restructuring of broken threads. And on a big thread, a few missing References headers can really ruin your day. Anyone interested in Cedric's patches should check out his Mutt patch site ( .


15. Linux 2.5.58 Released
13 Jan 2003 (2 posts) Archive Link: "Linux v2.5.58"
Topics: FS: sysfs, USB, Version Control
People: Linus Torvalds

Linus Torvalds announced Linux 2.5.58 ( and said:

I'm still on my accelerated release schedule, trying to make slightly smaller patches more frequently, instead of having humungous patches and having people forced to either wait or use the BK trees.

HOWEVER, that's going to change. I'm actually leaving for a two-week vacation on Friday, so not only will we have a lull in the merges due to that (I probably won't be on the 'net at all, since I'm travelling with my family), but I'll also have to slow down patches before leaving to try to leave with a fairly stable kernel.

While I'm away, I'm sure the regular suspects are going to work on merging stuff (Andrew & co), so it shouldn't be a big deal, but it still helps to not have major quakes just before going away for a while.

The 2.5.58 stuff is largely a merge of a lot of smaller stuff (tons of trivial patches, for example), with some bigger things: a parisc update, IPMI driver, USB updates, sysfs updates, and RPCSEC_GSS support.


16. Linux 2.5.58-mm1 Released
13 Jan 2003 (1 post) Archive Link: "2.5.58-mm1"
Topics: FS: ReiserFS, FS: ext3, POSIX
People: Andrew Morton

Andrew Morton announced:

  • Added an implementation of posix_fadvise().

    This can be used for providing the kernel hints about desired readahead patterns, and for launching asynchronous readahead (what sys_readahead does).

    But its main application is for program-directed freeing of pagecache against large streamed files. This is what O_STREAMING gives, only posix_fadvise() is harder to use, less efficient and standards-based.

    There is a test app in

  • The direct-to-BIO readahead for reiserfs works fine.
  • Ported one of Andrea's -aa patches into 2.5: merging of file-backed VMAs.







We Hope You Enjoy Kernel Traffic

Kernel Traffic is grateful to be developed on a computer donated by Professor Greg Benson and Professor Allan Cruse in the Department of Computer Science at the University of San Francisco. This is the same department that invented FlashMob Computing. Kernel Traffic is hosted by the generous folks at All pages on this site are copyright their original authors, and distributed under the terms of the GNU General Public License, version 2.0.