Kernel Traffic #272 For 5 Sep 2004

By Zack Brown

Table Of Contents


I'd like to thank the folks who sent me pointers to tech-writing jobs, or who recommended me to their companies. It's making a big difference, and I really appreciate it. Thanks!

For the moment, though, I'm still available, and anyone who wants to check out my resume (resume.html) is more than welcome.

Mailing List Stats For This Week

We looked at 1660 posts in 10280K.

There were 432 different contributors. 222 posted more than once. 142 posted last week too.

The top posters of the week were:

1. More Sistina Software Available; Clustering Workshop Organized

4 Jul 2004 - 26 Jul 2004 (55 posts) Archive Link: "[ANNOUNCE] Minneapolis Cluster Summit, July 29-30"

Topics: Clustering, Ottawa Linux Symposium

People: Daniel PhillipsChristoph HellwigLars Marowsky-BreeArjan van de VenSteven DakeNick PigginAndrew Morton

Daniel Phillips said:

Red Hat and (the former) Sistina Software are pleased to announce that we will host a two day kickoff workshop on GFS and Cluster Infrastructure in Minneapolis, July 29 and 30, not too long after OLS. We call this the "Cluster Summit" because it goes well beyond GFS, and is really about building a comprehensive cluster infrastructure for Linux, which will hopefully be a reality by the time Linux 2.8 arrives. If we want that, we have to start now, and we have to work like fiends, time is short. We offer as a starting point, functional code for a half-dozen major, generic cluster subsystems that Sistina has had under development for several years.

This means not just a cluster filesystem, but cluster logical volume management, generic distributed locking, cluster membership services, node fencing, user space utilities, graphical interfaces and more. Of course, it's all up for peer review. Everybody is invited, and yes, that includes OCFS and Lustre folks too. Speaking as an honorary OpenGFS team member, we will be there in force.

Tentative agenda items:

Further details, including information on travel and hotel arrangements, will be posted over the next few days on the Red Hat sponsored community cluster page:

Unfortunately, space is limited. We feel we can accommodate about fifty people comfortably. Registration is first come, first served. The price is: Free! (Of course.) If you're interested, please email me.

Let's set our sights on making Linux 2.8 a true cluster operating system.

Christoph Hellwig replied, "Don't you think it's a little too short-term? I'd rather see the cluster software that could be merged mid-term on KS (and that seems to be only OCFS2 so far)" . Daniel said that the project was not short term; on the contrary, he felt the project was several months overdue. He asked, "Don't you think we ought to take a look at how OCFS and GFS might share some of the same infrastructure, for example, the DLM and cluster membership services?" Lars Marowsky-Bree said ominously, "Indeed. If your efforts in joining the infrastructure are more successful than ours have been, more power to you ;-)"

Daniel asked what problems Lars was talking about, as the technical issues did not seem insurmountable; Lars replied:

The problems were mostly political. Maybe we tried to push too early, but 1-3 years back, people weren't really interested in agreeing on some common components or APIs. In particular a certain Linux vendor didn't even join the group ;-) And the "industry" was very reluctant too. Which meant that everybody spend ages talking and not much happening.

However, times may have changed, and hopefully for the better. The push to get one solution included into the Linux kernel may be enough to convince people that this time its for real...

There still is the Open Clustering Framework group though, which is a sub-group of the FSG and maybe the right umbrella to put this under, to stay away from the impression that it's a single vendor pushing.

If we could revive that and make real progress, I'd be as happy as a well fed penguin.

Now with OpenAIS on the table, the GFS stack, the work already done by OCF in the past (which is, admittedly, depressingly little, but I quite like the Resource Agent API for one) et cetera, there may be a good chance.

Daniel, a Red Hat employee, admitted that Lars' reference to a "certain Linux vendor" referred to Red Hat. But he said that Red Hat was fully behind the idea now, and had a lot of recently acquired Sistina code to put into the pot. He invited everyone who had expressed interest in this in the past to join in the discussion and in the project. And he promised to look over all available relevant code. He and Lars, during this conversation, also went back-and-forth on various technical issues; which Daniel summarized at one point, saying:

OK, what I've learned from the discussion so far is, we need to avoid getting stuck too much on the HA aspects and focus more on the cluster/performance side for now. There are just too many entrenched positions on failover. Even though every component of the cluster is designed to fail over, that's just a small part of what we have to deal with:

Out of that, we need to pick the three or four items we're prepared to address immediately, that we can obviously share between at least two known cluster filesystems, and get them onto lkml for peer review. Trying to push the whole thing as one lump has never worked for anybody, and won't work in this case either. For example, the DLM is fairly non-controversial, and important in terms of performance and reliability. Let's start with that.

Furthermore, nobody seems interested in arguing about the cluster block devices either, so lets just discuss how they work and get them out of the way.

Then let's tackle the low level infrastructure, such as CCS (Cluster Configuration System) that does a simple job, that is, it distributes configuration files racelessly.

I heard plenty of fascinating discussion of quorum strategies last night, and have a number of papers to read as a result. But that's a diversion: it can and must be pluggable. We just need to agree on how the plugs work, a considerably less ambitious task.

In general, the principle is: the less important it is, the more argument there will be about it. Defer that, make it pluggable, call it policy, push it to user space, and move on. We need to agree on the basics so that we can manage network volumes with cluster filesystems on top of them.

The technical discussion continued, with contributions from Daniel, Lars, Arjan van de Ven, Nick Piggin, Steven Dake, and even a small contribution from Andrew Morton.

2. New 'Voluntary' Preemption Patch Avoids Existing Preemption Problems

9 Jul 2004 - 29 Jul 2004 (267 posts) Archive Link: "[announce] [patch] Voluntary Kernel Preemption Patch"

Topics: FS: ReiserFS, FS: ext3, FS: sysfs, Framebuffer, I2C, Real-Time, SMP, Sound: ALSA

People: Ingo MolnarAndrew MortonRobert LoveArjan van de VenMikulas Patocka

Ingo Molnar said:

as most of you are probably aware of it, there have been complaints on lkml that the 2.6 kernel is not suitable for serious audio work due to high scheduling latencies (e.g. the Jackit people complained). I took a look at latencies and indeed 2.6.7 is pretty bad - latencies up to 50 msec (!) can be easily triggered using common workloads, on fast 2GHz+ x86 system - even when using the fully preemptible kernel!

to solve this problem, Arjan van de Ven and I went over various kernel functions to determine their preemptability and we re-created from scratch a patch that is equivalent in performance to the 2.4 lowlatency patches but is different in design, impact and approach: mpt-2.6.7-bk20-H2 (

(Note to kernel patch reviewers: the split voluntary_resched type of APIs, the feature #ifdefs and runtime flags are temporary and were only introduced to enable a easy benchmarking/comparisons. I'll split this up into small pieces once there's testing feedback and actual audio users had their say!)

unlike the lowlatency patches, this patch doesn't add a lot of new scheduling points to the source code, it rather reuses a rich but currently inactive set of scheduling points that already exist in the 2.6 tree: the might_sleep() debugging checks. Any code point that does might_sleep() is in fact ready to sleep at that point. So the patch activates these debugging checks to be scheduling points. This reduces complexity and impact quite significantly.

but even using these (over one hundred) might_sleep() points there were still a number of latency sources in the kernel - we identified and fixed them by hand, either via additional might_sleep() checks, or via explicit rescheduling points. Sometimes lock-break was necessary as well.

as a practical goal, this patch aims to fix all latency sources that generate higher than ~1 msec latencies. We'd love to learn about workloads that still cause audio skipping even with this patch applied, but i've been unable to generate any load that creates higher than 1msec latencies. (not counting driver initialization routines.)

this patch is also more configurable than the 2.4 lowlatency patches were: there's a .config option to enable voluntary preemption, and there are runtime /proc/sys knobs and boot-time flags to turn voluntary preemption (CONFIG_VOLUNTARY_PREEMPT) and kernel preemption (CONFIG_PREEMPT) on/off:

        # turn on/off voluntary preemption (if CONFIG_VOLUNTARY_PREEMPT)
        echo 1 > /proc/sys/kernel/voluntary_preemption
        echo 0 > /proc/sys/kernel/voluntary_preemption

        # turn on/off the preemptible kernel feature (if CONFIG_PREEMPT)

the 'voluntary-preemption=0/1' and 'kernel-preemption=0/1' boot options can be used to control these flags at boot-time.

all 4 combinations make sense if both CONFIG_PREEMPT and CONFIG_VOLUNTARY_PREEMPT are enabled - great for performance/latency testing and comparisons.

The stock 2.6 kernel is equivalent to:

voluntary_preemption:0 kernel_preemption:0

the 2.6 kernel with voluntary kernel preemption is equivalent to:

voluntary_preemption:1 kernel_preemption:0

the 2.6 kernel with preemptible kernel enabled is:

voluntary_preemption:0 kernel_preemption:1

and the preemptible kernel enhanced with additional lock-breaks is enabled via:

voluntary_preemption:1 kernel_preemption:1

it is safe to change these flags anytime.

The patch is against 2.6.7-bk20, and it also includes fixes for kernel bugs that were uncovered while developing this patch. While it works for me, be careful when using this patch!

A huge discussion ensued. There was a good bit of support initially among some developers, and a lot of folks piled on with comments and criticisms. But Andrew Morton did not feel that the situation was at all clear. He thought it would be useful to examine the particular workloads that folks had experienced when the problems occurred; and he said:

Certainly 2.6+preempt is not as good as 2.4+LL at this time, but 2.6 isn't too bad either. Even under heavy filesystem load it's hard to exceed a 0.5 millisecond holdoff. There are still a few problem in the ext3 checkpoint buffer handling, but those seem pretty hard to hit. I doubt if the `Jack' testers were running `dbench 1000' during their testing.

All of which makes me suspect that the problems which the `Jack' testers saw were not directly related to long periods of non-preemption in-kernel. At least, not in core kernel/fs/mm code. There have been problem in the past in places like i2c drivers, fbdev scrolling, etc.

What we need to do is to encourage audio testers to use ALSA drivers, to enable CONFIG_SND_DEBUG in the kernel build and to set /proc/asound/*/*/xrun_debug and to send us the traces which result from underruns.

As for the patch, well, sprinkling rescheduling points everywhere is still not the preferred approach. But adding more might_sleep() checks is a sneaky way of making it more attractive ;)

In a nearby post he also added:

Let me repeat that I am unconvinced as to the diagnosis of the current audio problems - more analysis might prove me wrong of course.

And I'm unconvinced that we need to do anything apart from identifying and fixing the remaining spinlocks which are holding off preemption for too long.

IOW, I am questioning the very need for a "voluntary preemption" feature at all when "involuntary preemption" works perfectly well.

A few posts down the road, Ingo replied:

the reason is difference in overhead (codesize, speed) and risks (driver robustness). We do not want to enable preempt for Fedora yet because it breaks just too much stuff and is too heavy. So we looked for a solution that might work for a generic distro.

here are the code size differences. With a typical .config (debugging options disabled), the 2.6.7-mm7(+voluntary-preempt) UP x86 kernel gets the following .text sizes:

orig:      1776911 bytes
   preempt:   1855519 bytes  (+4.4%)
   voluntary: 1783407 bytes  (+0.3%)

so if voluntary-preempt can get close to real preempt's numbers for practical stuff then we get most of the benefits while excluding some of the nastiest risks and disadvantages.

(Long-term i'd like to see preempt be used unconditionally - at which point the 10-line CONFIG_VOLUNTARY_PREEMPT Kconfig and kernel.h change could go away.)

Robert Love suggested that in this case, instead of coming up with complicated workarounds, the thing to do was to work on getting preemption working on Fedora. He also disagreed with the idea that preemption would have too much overhead associated with it. He said, "I have seen no specific arguments that show a significant overhead. Heck, when people tried to show that kernel preemption hurt throughput, we saw tests that showed improved throughput (probably due to better utilization of I/O)." Andrew also added, close by, "I don't recall any testing results which showed a significant performance difference from CONFIG_PREEMPT." He also felt the thing to do was to fix preemption on Fedora, or at least identify the bugs so they could be addressed. Arjan van de Ven replied, "just look over all the "fix preempt" stuff that got added to the kernel in the last 6 months. Sometimes subtle sometimes less so. From a distribution POV I don't want a potential slew of basically impossible to reproduce problems, especially this young in 2.6, there are plenty of other problems already (and before you ask "which", just look at how many bugs got fixed in the last X weeks for any value of X, and look at say acpi issues). Yes I understand this puts you into a bit of a bad position, several distros not enabling preempt means that it gets less testing than it should. However.. there's only so much issues distros can take and with 2.6 still quite fresh..."

Andrew felt that Arjan was equivocating here; and suggested that Arjan really couldn't identify any specific bugs and was just spreading FUD. Mikulas Patocka replied with a concrete case. He said, "For example the recent race that corrupted file content on ext3 and reiserfs when fsync and write were called simultaneously ... it was possible on SMP too, but with tiny probability --- CONFIG_PREEMPT triggered it wide open." But there was no reply to this.

Elsewhere, some audio folks actually followed Andrew's initial suggestion of performing more tests, and so a large subthread was taken up with preemption debugging session. Throughout the discussion, Andrew reiterated that "preemption is the way in which we wish to provide low-latency. At this time, patches which sprinkle cond_resched() all over the place are unwelcome. After 2.7 forks we can look at it again." Though even then he affirmed that he had not yet gone over Ingo's and Arjan's patch.

Elsewhere, and sometimes overlapping with the preempt discussion, Ingo and Arjan continued to develop and release new versions of their patch.

3. Linux 2.6.8-rc1-mm1 Released; Quick Fix For Big Slowdown Followed

13 Jul 2004 - 22 Jul 2004 (19 posts) Archive Link: "2.6.8-rc1-mm1"

Topics: Big Memory Support, FS: ext3, Kernel Release Announcement, Ottawa Linux Symposium

People: Andrew MortonJ. A. MagallonJens Axboe

Andrew Morton announced Linux 2.6.8-rc1-mm1, saying: akpm/patches/2.6/2.6.8-rc1/2.6.8-rc1-mm1/ (

He posted a quick patch six hours later, at around midnight, saying, "This kernel runs like a dessicated slug if you have more than 2G of memory due to a 32-bit overflow." He acknowledged that it wasn't a great fix, but it would do for the moment. Elsewhere, J. A. Magallon reported that 2.6.8-rc1-mm1 would oops if the user tried to write to a CD without a CD in the driver. He said:

Who would do such a stupid thing ? Me the impatient trying to write before the drive ends to load the disc...

The bad thing is that it leaves the drive in an ususable state:

Error trying to open /dev/hdc exclusively (Device or resource busy)... retrying in 1 second.
Error trying to open /dev/hdc exclusively (Device or resource busy)... retrying in 1 second.

It does not happen with 2.6.8-rc2.

Jens Axboe posted a fix, adding that Andrew did have possession of the fix, though Jens wasn't sure if it had actually been merged yet.

4. I2C Updates For 2.6

14 Jul 2004 - 29 Jul 2004 (23 posts) Archive Link: "[BK PATCH] I2C update for 2.6.8-rc1"

Topics: FS: sysfs, I2C

People: Greg KHJean DelvareMark M. HoffmanEugene Surovegin

Greg KH, working with Alexandre d'Alton, Eugene Surovegin, Luiz Capitulino, Andras Bali, Mark M. Hoffman, and primarily with Jean Delvare, posted a bunch of I2C patches, split up into bite-sized chunks for easy approval. He said, "Here are some i2c driver fixes and updates for 2.6.8-rc1. There are a few new i2c chip drivers, and the biggest chunk is the new w1 (1-wire) driver subsystem contributed by Evgeniy Polyakov. The sysfs interface for w1 isn't finished quite yet (just need to create some more sysfs files, nothing major), but the main functionality is there, and this allows more w1 drivers to be contributed easier." One patch aimed at providing support for adm1030 and adm1031 sensors chips. Another, as Greg quoted an email from Jean, "adds support for the LM86, MAX6657 and MAX6658 sensor chips to the lm90 driver. These are less popular than the LM90 and ADM1032 but several users have reported to use these, so I added support to the lm90 driver. All these chips are fully compatible so that's just a matter of accepting the new chip ids. I also slightly simplified the detection code." Another, as Greg quoted an email from Andras, "adds support for the LM77 sensor chips made by National Semiconductor. Formerly this was claimed by the LM75 driver but when I got hold of an embedded board (built around the National Geode SC1100 CPU), which was equipped with an LM77, it turned out that the two chips are not compatible." Another, as Greg quoted another email from Jean, "is my port of the adm1025 driver to 2.6. It has been tested by a few users and reported to work OK." Another patch added a Dallas 1-wire protocol driver subsystem. As stated in the configuration help text included in the patch, "Dallas's 1-wire bus is usefull to connect slow 1-pin devices such as iButtons and thermal sensors." The remaining patches were bug fixes and documentation updates (and the removal of i2c/i2c-pport documentation because the relevant driver was not in the official sources and had not yet even been ported to 2.6).

5. Limiting the Number Of Concurrent Hotplug Processes

20 Jul 2004 - 27 Jul 2004 (13 posts) Archive Link: "[PATCH] Limit number of concurrent hotplug processes"

Topics: Hot-Plugging, Virtual Memory

People: Hannes ReineckeChristian BorntraegerAndrew Morton

Hannes Reinecke said:

the attached patch limits the number of concurrent hotplug processes. Main idea behind it is that currently each call to call_usermodehelper will result in an execve() of "/sbin/hotplug/index.html", without any check whether enough resources are available for successful execution. This leads to hotplug being stuck and in worst cases to machines being unable to boot.

This check cannot be implemented in userspace as the resources are already taken by the time any resource check can be done; for the same reason any 'slim' programs as /sbin/hotplug will only delay the problem.

Andrew Morton was a little surprised to hear about Hannes' symptoms. He asked what was causing enough module probes to trigger the lockup. Christian Borntraeger replied:

I dont know exactly which scenario Hannes has seen, but its quite simple to trigger this scenario with almost any s390/zSeries.

Using the Hardware Management Console or z/VM you are able to hotplug (deactive/activate/attach/detach) almost every channel path. As a channel path can connect a big bunch of devices lots of machine checks are triggered, which causes lots of hotplugs. The same amount of machine checks could happen if a hardware failure happens.

Some month ago I played around with a diet version of hotplug. This program was fast and small enough to make my scenario work properly. Nevertheless, as hannes already stated this will only delay the problem.

And Hannes confirmed, "As Christian Borntraeger already said, it's not so much an explosion of module probes but rather the triggering of quite a lot of events. Imagine loading scsi_debug with 512 devices or more ..."

Andrew and Hannes went back and forth on it for a bit, but had a strange sort of disconnect, in which Andrew wasn't able to understand what Hannes was attempting, and Hannes wasn't able to explain it clearly. Patches came in from both sides throughout, and the thread ended abruptly, with no conclusion.

6. Possible Scheduler Improvements

22 Jul 2004 - 27 Jul 2004 (21 posts) Archive Link: "[RFC] Patch for isolated scheduler domains"

Topics: Hyperthreading, SMP

People: Dimitri SivanichIngo MolnarNick Piggin

Dimitri Sivanich said:

I'm interested in implementing something I'll call isolated sched domains for single cpus (to minimize the latencies involved when doing things like load balancing on certain select cpus) on IA64.

Below I've included an initial patch to illustrate what I'd like to do. I know there's been mention of 'platform specific work' in the area of sched domains. This patch only addresses IA64, but could be made generic as well. The code is derived directly from the current default arch_init_sched_domains code.

The patch isolates cpus that have been specified via a boot time parameter 'isolcpus=<cpu #>,<cpu #>'.

Ingo Molnar liked the idea, and suggested, "put it into sched.c. Every architecture benefits from the ability to define isolated CPUs." Nick Piggin asked if Dimitri or anyone else had tried actually running the code, and Dimitri said he'd been running it on an Altix. Nick also pointed him to a thread Nick had started, with Subject: "[PATCH] consolidate sched domains ( ", and suggested that Dimitry consider working on top of the patch Nick described there.

In that thread, Nick said:

The attached patch is against 2.6.8-rc1-mm1. Tested on SMP, UP and SMP+HT here and it seems to be OK.

I have included the cpu_sibling_map for ppc64, although Anton said he did have an implementation floating around which he would probably prefer, but I'll let him deal with that.

Anyway, x86-64 is not equivalent before and after this patch. The main thing is that they've been using SD_CPU_INIT for NUMA nodes, but will now use SD_NODE_INIT. Probably neither is optimal, but I don't think Andi has had much time to look at it. I should be able to take a look at it soon.

Nick's changelog entry for the patch read:

Teach the generic domains builder about SMT, and consolidate all architecture specific domain code into that. Also, the SD_*_INIT macros can now be redefined by arch code without duplicating the entire setup code. This can be done by defining ARCH_HASH_SCHED_TUNE.

The generic builder has been simplified with the addition of a helper macro which will probably prove to be useful to arch specific code as well and should be exported if that is the case.

Dimitri waded into the code, asking questions and offering bug reports, along with several other developers.

7. Process Aggregates (PAGG) For Grouping Processes

22 Jul 2004 - 26 Jul 2004 (2 posts) Archive Link: "[PATCH] (update) Process Aggregates (PAGG) for 2.6.7"

Topics: SMP

People: Erik JacobsonAndrew Morton

For an earlier PAGG patch, see Issue #190, Section #10  (17 Oct 2002: CSA System Resource Accounting Tool)

This time, Erik Jacobson said:

Attached is the PAGG patch for kernel 2.6.7. Some may recall I posted an earlier PAGG patch for 2.6.7. This version has improved handling for the init function pointer functionality introduced earlier.

We'd be very interested in seeing this be accepted in to the kernel. If there is anything we should adjust to make it more likely to be accepted, please reply.

The patch included documentation as well, in the form of a Documentation/pagg.txt file:

The process aggregates infrastructure, or PAGG, provides a generalized mechanism for providing arbitrary process groups in Linux. PAGG consists of a series of functions for registering and unregistering support for new types of process aggregation containers with the kernel. This is similar to the support currently provided within Linux that allows for dynamic support of filesystems, block and character devices, symbol tables, network devices, serial devices, and execution domains. This implementation of PAGG provides developers the basic hooks necessary to implement kernel modules for specific process containers, such as the job container.

The do_fork function in the kernel was altered to support PAGG. If a process is attached to any PAGG containers and subsequently forks a child process, the child process will also be attached to the same PAGG containers. The PAGG containers involved during the fork are notified that a new process has been attached. The notification is accomplished via a callback function provided by the PAGG module.

The do_exit function in the kernel has also been altered. If a process is attached to any PAGG containers and that process is exiting, the PAGG containers are notified that a process has detached from the container. The notification is accomplished via a callback function provided by the PAGG module.

The sys_execve function has been modified to support an optional callout that can be run when a process in a pagg list does an exec. It can be used, for example, by other kernel modules that wish to do advanced CPU placement on multi-processor systems (just one example).

Andrew Morton replied, "Seems straightforward enough, but until we've decided to merge features which actually _use_ the mechanism, I'd be reluctant to send any of this Linuswards."

8. Joining Keyboards To Braille Displays

23 Jul 2004 (5 posts) Archive Link: "User-space Keyboard input?"

Topics: Braille

People: Mario LangMarcel HoltmannSamuel Thibault

Mario Lang said:

I'm working on BRLTTY[1], a user-space daemon which handles braille displays on UNIX platforms. One of our display drivers recently gained the ability to receive (set 2) scancodes from a keyboard connected directly to the display. This is a very cool feature, since the display in question has a bluetooth interface, making it effectively into a complete wireless terminal (input and output through the same connection).

However, this creates some problems. First of all, we now have to deal with keyboard layouts. Additionally, since we currently insert via TIOCSTI I think this might get problematic as soon as one switches to an X Windows console and modifiers come into play.

Does anyone know (and can point me into the right direction) if Linux has some mechanism to allow for user-space keyboard data to be processed by the kernel as if it were received from the system keyboard? I.e., keyboard layout would be handled by the same mapping which is configured for the system.

Marcel Holtmann suggested, "Take a look at the user level driver support (uinput)." And Samuel Thibault also said to Mario:

About modifiers, I submitted a patch to Dave to handle them properly.

But ascii to scancode translation still depends on scancode to ascii translation performed by the kernel indeed and the question still applies. I'll have a look at uinput.

Several days later, Mario posted an update, "uinput support is now committed to scr_linux.c. I am using the exernal keyboard of my bluetooth capable braille display to type this email already via uinput :-). The same layout is used as is configured on the box. Our generic AT2 support maps to VAL_PASSKEY commands and the AT2 support for Linux maps the AT2 scancode set to what Linux internally uses for scancodes (a sort of XT scancode set, but not really)."

9. Kernel Events Layer For Asynchoronous Communication

23 Jul 2004 - 27 Jul 2004 (54 posts) Archive Link: "[patch] kernel events layer"

Topics: FS: sysfs

People: Robert LoveChris WedgwoodAndrew MortonGreg KHInaky Perez-Gonzalez

Robert Love said:

Following patch implements the kernel events layer, which is a simple wrapper around netlink to allow asynchronous communication from the kernel to user-space of events, errors, logging, and so on.

Current intention is to hook the kernel via this interface into D-BUS, although the patch is intended to be agnostic to any of that and policy free.

D-BUS can be found here:

Other user-space utilities (including code to utilize this) can be found here:

This patch only implements a single event, processor temperature detection. Other useful events include md sync, filesystem mount, driver errors, etc. We can add those later, on a case-by-case basis. I would like to be more careful with adding events than we are with adding printk's, with some aim toward a stable interface.

Usage of the new interface is simple:

send_kmessage(group, interface, message, ...)

Credit to Arjan for the initial implementation, Kay Sievers for some updates, and the netlink code.

A lot of folks went over the patch and offered comments and criticism; Andrew Morton also asked approximately how many send_kmessage() calls there were likely to be in total, in a couple of years. Robert replied:

Predicting the future is hard, but I suspect this number to be small. Let's say 10 in core kernel code?

If this takes off as a solution for error reporting, that number will be much larger in drivers.

Chris Wedgwood was OK with the 10 calls in the core kernel, but a large number in driver code would worry him. He said:

I would alsmost rather all possible messages get stuck somewhere common so driver writes can't add these ad-hoc and we can avoid a proliferation of either similar or pointless messages.

Forcing these into a common place lets people eyeball if a new messages really is necessary --- and it makes writing applications to deal with these things easier (since you don't have to scan the entire kernel tree).

Robert compared the send_kmessage() situation with that of printk()s, in that printk() messages were intended simply to be messages the developer thought worthwhile to post. The send_kmessage() call would presumably be similar. But he still agreed that refining and collecting them into a single group might make sense as well. He said, "the common base of errors could be certified as supported by the error daemon, translated, etc. etc. I am not sure how realistic this goal is, but I do like it, at least for the general case of the usual errors in drivers." Chris guessed that most messages would be selected from a small pool of common errors. The discussion continued, but at one point Andrew Morton said:

I must say that my gut feeling here is that bolting an arbitrary new namespace into the kernel in this manner is not the way to proceed.

I hope we'll hear more from Greg on this next week - see if we can come up with some way to use the kobject/sysfs namespace for this.

Although heaven knows how "tmpfs just ran out of space" would map onto kobject/sysfs.

Inaky Perez-Gonzalez suggested using kobject representations if any existed for a given circumstance, and dealing with other situations on a case-by-case basis. But Robert said, "That introduces two orthogonal name spaces, and that really doesn't cut it. If Greg can come up with a solution for using kobjects, I am all for that - that would be great - but I really do not see kobject paths working out. I think the best we have is the file path in the tree." Greg KH replied, "Give me a few days, I'm working on it, but have been traveling too much. Robert and I will sit down during OSCON this week and try to work out something along these lines, and then post it again here."

10. ketchup Version 0.8 Released

23 Jul 2004 - 24 Jul 2004 (6 posts) Archive Link: "[ANNOUNCE] ketchup 0.8"

Topics: Version Control

People: Matt MackallLee Revell

Matt Mackall said:

ketchup is a script that automatically patches between kernel versions, downloading and caching patches as needed, and automatically determining the latest versions of several trees. Available at:

New in this version by popular demand:

Example usage:

$ ketchup 2.6-mm
2.6.3-rc1-mm1 -> 2.6.5-mm4
Applying 2.6.3-rc1-mm1.bz2 -R
Applying patch-2.6.3-rc1.bz2 -R
Applying patch-2.6.3.bz2
Applying patch-2.6.4.bz2
Applying patch-2.6.5.bz2
Downloading 2.6.5-mm4.bz2
Downloading 2.6.5-mm4.bz2.sign
Verifying signature...
gpg: Signature made Sat Apr 10 21:55:36 2004 CDT using DSA key ID 517D0F0E
gpg: Good signature from "Linux Kernel Archives Verification Key
< ( >"
gpg: aka "Linux Kernel Archives Verification Key
< ( >"
gpg: WARNING: This key is not certified with a trusted signature!
gpg: There is no indication that the signature belongs to the
Primary key fingerprint: C75D C40A 11D7 AF88 9981 ED5B C86B A06A 517D 0F0E
Applying 2.6.5-mm4.bz2

Lee Revell gave ketchup a try, but had trouble getting it to work under Debian Unstable; and the two of them went back and forth on it for a bit.

11. GPL Being Tested In German Courts

23 Jul 2004 - 29 Jul 2004 (3 posts) Archive Link: "[OT] German court says the GPL is effective"

People: Adrian BunkPrakash K. CheemplavamMatthias Andree

Adrian Bunk said:

I know this is off-topic, but a court in my home town Munich has decided that a cease and desist letter Harald Welte sent to a router producer (Sitecom) who used netfilter/iptables in his router but didn't publish the sources of the firmware with a penalty of up to 100 000 Euro is valid (a German version of the decision of the three judges is at

It's quite nice to hear that a court has decided that the GPL is enforceable under German law.

Prakash K. Cheemplavam remarked, "As far as I have understood it: This was just a "quick" decision making, a "real" court case comes later. So far the court finds it reasonable to enforce the cease and desist letter until the final decision comes, as the "probability" is high, that the GPL is compatible with German laws and even if not, the company wouldn't be allowed to use the GPLed software."

Matthias Andree asked if the verdict had become final yet, and Adrian replied that no, it had not.

12. Driver For IBM Multiport Serial Adapters

26 Jul 2004 - 27 Jul 2004 (3 posts) Archive Link: "new device driver to enable the IBM Multiport Serial Adapter in Linux"

Topics: Modems

People: Janice M. GirouardAndrew MortonJeff Garzik

Janice M. Girouard said:

The patch below enables the IBM Multiport Serial Adapter for the Linux OS. This driver is for a family of multiport serial adapters including 2 port RVX, 2 port internal modem, 4 port internal modem and a split 1 port RVX and 1 port internal modem. We have applied & tested this patch against our iSeries and pSeries machines (there are no known existing defects following this testing).

We would like this code accepted into the kernel. It was previosly submitted to the linux-serial mailing list, and we believe we have addressed any coding issues raised. Are there any additional concerns? How do you suggest we proceed to have this code accepted?

Andrew Morton replied:

Looks sane. Did you really intend that it be buildable for all architectures? (There's nothing wrong with this - it's good. But it's a bit unusual for this sort of driver).

"IBM multiport serial adapter" seems to be a rather generic description of this device. Is this the only multiport serial adapter which IBM has ever, or will ever make? If not, then perhaps we could make the description a little more specific?

There was no reply to this, but elsewhere, Jeff Garzik gave a line-by-line code commentary, with many suggestions and bug reports.

13. IRQ Threads; Real-Time Issues

27 Jul 2004 - 29 Jul 2004 (30 posts) Archive Link: "[patch] IRQ threads"

Topics: BSD, Disks: IDE, Microkernels: Adeos, Patents, Real-Time: RTAI, SMP

People: Scott WoodIngo MolnarBill HueyKarim YaghmourPhilippe GerumAlbert D. CahalanLee Revell

Scott Wood said:

I have attached a patch for implementing IRQ handlers in threads, for latency-reduction purposes. It requires that softirqs must be run in threads (or else they could end up running inside the IRQ threads, which will at the very least trigger bugs due to in_irq() being set). I've tested it with Ingo's voluntary-preempt J7 patch, and it should work with the TimeSys softirq thread patch as well (though you might get a conflict with the PF_IRQHANDLER definition; just merge them into one).

Some notes:

  1. This may not work properly with some interrupt controller code, which doesn't do the obvious thing with mask_and_ack() and end(). This includes the IO-APIC code, which has an empty end() for edge triggered interrupts and an empty mask_and_ack() for level-triggered interrupts. The mask_and_ack() needs to really mask the interrupt, as otherwise the hardware will not deliver lower-priority (to it) interrupts, which may have a higher-priority thread.
  2. This patch does not disable local interrupts when running a threaded handler. SMP-safe drivers shouldn't be directly bothered by this (as the interrupt could as easily have happened on another CPU), but there may be some interactions with softirqs and per-cpu data, if a softirq thread preempts an IRQ thread, or an IRQ thread gets migrated to a different CPU. I'm particularly worried about the network code. If possible, I'd like to find and fix such breakages rather than use local_irq_disable(), as that would prevent IRQ proritization from working, and prevent IRQ threads from being used to isolate the rest of the system from long-running IRQs (such as non-DMA IDE).
  3. The i8042 driver had to be marked SA_NOTHREAD, as there are non-preemptible regions where it spins, waiting for an interrupt. Ideally, this driver (and others like it) should be fixed to either do a cond_resched() or use a wait queue.
  4. This might be a good time to get around to moving the bulk of the arch/whatever/kernel/irq.c into generic code, as the code said was supposed to happen in 2.5. This patch is currently only for x86 (though we've run IRQ threads on many different platforms in the past).
  5. Is there any reason why an IRQ controller might want to have its end() called even if IRQ_DISABLED or IRQ_INPROGRESS is set? It'd be nice to merge those checks in with the IRQ_THREADPENDING/IRQ_THREADRUNNING checks.
  6. This patch causes in_irq() to return true if an IRQ thread is running, as some drivers use it in common code to determine how to act. in_interrupt(), however, will return false in such a case. The exact meaning of these macros in the presence of IRQ threads isn't very well defined, and I hope this results in sane behavior.

Ingo Molnar had some specific criticisms, but affirmed, "I agree with the concept of using multiple threads for interrupts - i'll add that to the voluntary-preempt patch too. This is an essential feature to prioritize interrupts." Bill Huey remarked, "The way I picture the problem permits those threads to migration across CPUs and therefore kill interrupt performance from cache thrashing. Do you have a solution for that ? I like the way you're doing it now with irqd() in that it's CPU-local, but as you know it's not priority sensitive." Scott replied:

Wouldn't the IRQ threads be subject to the same heuristics that the scheduler uses with ordinary threads, in order to avoid unnecessary CPU migration? Plus, IRQs ordinarily get distributed across CPUs, and in most cases shouldn't have a very large cache footprint (especially data; the code can be in multiple CPU caches at once), so I don't think this is a susbtantial degradation from the way things already are.

If desired by the user, an IRQ thread could be bound to a specific CPU to avoid such problems (in which case, they'd probably want to set the smp_affinity of the hard IRQ stub to the same CPU).

Bill replied, "I get a number of gripes from SMP aware folks that the context switching overhead is significant as well as cache issues. That's what the concern is about." He agreed with Scott's suggestion to bind IRQ threads to specific CPUs, saying, "this is an obvious next step in order to get better performance." Ingo, on the other hand, said of Scott's suggestion, "i fixed this problem in -M5 the other way around: the IRQ threads follow the affinity settings. They will bind themselves to the first CPU in the affinity mask and they migrate only at 'safe' points (between hardirqs). This way e.g. user-space irqbalance will automatically move the IRQ threads around too."

Elsewhere, in respons to Scott's original post, Karim Yaghmour leveled some criticism, saying:

My experience with clients who have been using TimeSys' stuff has been abysmal. The fact of the of the matter is that most people who used this were practically locked-in to TimeSys' services, unable to download anything "standard" off the net and using it with their kernel. In one example, we had to ditch the kernel the client got from TimeSys because we had spent 10+ hours trying to get LTT to work on it without any success whatsoever.

As I had said on other lists before, I don't see the point of creating that much complexity in the kernel in order to try to shave-off a little bit more off of the kernel's interrupt response time. The fact of the matter is that neither this patch nor most of the other patches suggested makes the kernel truely hard-rt. These patches only make the kernel respond "faster". If you really need hard-rt, then you should be using the Adeos nanokernel. With Adeos, you can even get a hard-rt driver without using RTAI or any of the other rt derivatives.

Lee Revell objected to Karim's vague characterizations of TimeSys, saying it was difficult to defend against such assertions; Karim agreed, but said it was the only way he could make the point he wanted to make, without violating his client's confidentiality. Lee cautioned Karim to just try to be more careful in future; and that particular disagreement ended.

Bringing the discussion back on track, Bill said:

there's really two camps that are emerging in the real time Linux field, dual and single kernel. The single kernel work that's current being done could very well get Linux to being hard RT, assuming that you solve all of the technical problems with things like RCU, etc... in 2.6.

The dual kernels folks would be in less of position to VAR their own stuff and sell proprietary products if Linux were to get native hard RT performance if you accept that economic criteria. Who knows what the actual results will be.

It could be that all of this work with Linux could bury prioprietary OS product (such as LynxOS here) or it could open doors to other things unknown things that were never possible previous to Linux getting some kind of hard RT capability. It's certainly a scary notion to think about with many variables to consider. Linux getting hard RT is inevitable. It's just a question of how it'll be handled by proprietary OS vendors, witness IBM for a positive example. A negative one would be Sun.

Now that Windriver System (the idiot folks that never understood Linux before laying off tons of folks and disbanned the rather famous BSD/OS group which I was apart of, etc...) and Red Hat is in the picture, it's all starting to cook up.

Karim pointed out that the dual-kernel concept was patented; and reiterated that he advocated a third approach: the Adeos nanokernel. Philippe Gerum also said:

The hard RT people I know of and work with want to be able 1) to get microsecond level bounded interrupt latency with no exception to this rule, and 2) to be able to choose the right level of dispatch latency on a thread-by-thread basis, from a few microseconds to a few hundreds of, but in any case _bounded_ and predictable in the worst case. For this to happen, they are willing to accept stringent limitations functionaly-wise if need be to obtain the first, but still get access to the regular Linux programming model and APIs if the second fits their apps. They already know how they could mix both properly in what would look like a single system from the application pov.

For these people, the current undergoing work aimed at improving the current determinism of the vanilla kernel is everything but a danger: it's fundamental and very good news, because it could make 2) a reality sooner or later. However, point 1) remains an issue, and unless you find a solution for mixing fire and water, i.e. determinism which requires unfairness by design and throughput seeking fairness on the average case, you would likely end up considering that the Linux RT people's radical approach of using a dual-kernel does not make them uneducated bozos (Ok, except me perhaps, but this is not my point). To get microsecond level guaranteed interrupt latencies, the problem is far beyond solving random latency spots here and there: it's an architectural issue.

To achieve this, we (i.e. the educated ones like Karim helping the uneducated bozo like myself; yep, this is a teamwork) have come to the conclusion that we needed a portable infrastructure that allows a complete prioritization of interrupts, and hw events of interest in general (e.g. traps/exceptions). Some infrastructure that exposes the same interface regardless of the platform it runs on. It's called Adeos, the source code identation is terrible (after 20 years practicing it, I still find means to worsen my coding style, funky eh?!) but it's a working example of such kind of infrastructure. The advantage of such kind of thin layer is that you can plug any hard real-time core over it. This layer can remain silent when unused, it can be configured out, it is just an enabler. You don't have to put the hell of a havoc in a stable GPOS core to modify some key architectural characteristics of the Linux kernel in order to buy hard RT capabilities to everyone, which could be construed as smashing a squadron of flies with nukes.

Elsewhere but close by, Albert D. Cahalan said:

There is no such thing as hard-RT in the real world.

In reality, there's no point in making the software far more reliable than the hardware, power supply, and so on. Somebody may pour a can of Mountain Dew into the vent holes.

Your software is OK as long as other causes of failure are much more likely. One might even say you spent too much of your budget perfecting the software! In the end it all comes down to $$$ (or Euros, or Yen...), doesn't it?

People don't mathematically demonstrate anything about modern systems, at least not while being honest.

The thread petered out around here.

14. loop-AES Update

28 Jul 2004 (1 post) Archive Link: "Announce loop-AES-v2.1c file/swap crypto package"

Topics: Advanced Encryption Standard

People: Jari RuusuRussell King

Jari Ruusu announced a new version of loop-AES, saying:

loop-AES changes since previous release:

bzip2 compressed tarball is here:
md5sum b404d9d679b7096dd3fb089345c52320

Additional ciphers package changes since previous release:

bzip2 compressed tarball is here:
md5sum 1a5e1d967bca0cde71a32e533ef26ce9

15. PPC8xx Maintainership

29 Jul 2004 (1 post) Archive Link: "PPC8xx Maintainer patch"


People: Paul MackerrasTom Rini

Paul Mackerras posted a patch to the MAINTAINERS file, listing Tom Rini as the official maintainer of Linux for PowerPC embedded PPC8xx and the PowerPC boot code.







Sharon And Joy

Kernel Traffic is grateful to be developed on a computer donated by Professor Greg Benson and Professor Allan Cruse in the Department of Computer Science at the University of San Francisco. This is the same department that invented FlashMob Computing. Kernel Traffic is hosted by the generous folks at All pages on this site are copyright their original authors, and distributed under the terms of the GNU General Public License version 2.0.