Kernel Traffic
Latest | Archives | People | Topics
Latest | Archives | People | Topics
Latest | Archives | People | Topics
Home | News | RSS Feeds | Mailing Lists | Authors Info | Mirrors | Stalled Traffic

Kernel Traffic #171 For 16 Jun 2002

By Zack Brown

Table Of Contents

Mailing List Stats For This Week

We looked at 1177 posts in 5497K.

There were 376 different contributors. 179 posted more than once. 163 posted last week too.

The top posters of the week were:

1. Status Of kbuild And CML2

2 Jun 2002 - 12 Jun 2002 (25 posts) Archive Link: "Announce: Kernel Build for 2.5, release 3.0 is available"

Topics: Kernel Build System

People: Keith OwensEric S. Raymond

Keith Owens announced kbuild 3.0 for the unstable kernel series. One of the changes in that version was the removal of CML2 support, and Hayden A. James asked why this had been done. Someone gave a link into the email archives, and Keith explained that Eric S. Raymond "has dropped off the list. CML2 and kbuild 2.5 are completely independent and having the two in the same patch was getting messy. The config rules in kbuild 2.5 are clean, support for other variants of CML can be added at any time."

2. Adeos, A New Nanokernel Under Linux

3 Jun 2002 - 8 Jun 2002 (59 posts) Archive Link: "[ANNOUNCE] Adeos nanokernel for Linux kernel"

Topics: BSD, Microkernels: Adeos, Patents, Real-Time: RTAI, Real-Time: RTLinux, SMP, User-Mode Linux

People: Karim YaghmourAlessandro RubiniErik AndersenDaniel Phillips

Karim Yaghmour announced:

We have released the initial implementation of the Adeos nanokernel. The following is a complete description of its background, its implementation, its API, and its potential uses. Please also see the press release ( and the project's workspace ( The Adeos code is distributed under the GNU GPL.

The Adeos nanokernel is based on research and publications made in the early '90s on the subject of nanokernels. Our basic method was to reverse the approach described in most of the papers on the subject. Instead of first building the nanokernel and then building the client OSes, we started from a live and known-to-be-functional OS, Linux, and inserted a nanokernel beneath it. Starting from Adeos, other client OSes can now be put side-by-side with the Linux kernel.

To this end, Adeos enables multiple domains to exist simultaneously on the same hardware. None of these domains see each other, but all of them see Adeos. A domain is most probably a complete OS, but there is no assumption being made regarding the sophistication of what's in a domain.

To share the hardware among the different OSes, Adeos implements an interrupt pipeline (ipipe). Every OS domain has an entry in the ipipe. Each interrupt that comes in the ipipe is passed on to every domain in the ipipe. Instead of disabling/enabling interrupts, each domain in the pipeline only needs to stall/unstall his pipeline stage. If an ipipe stage is stalled, then the interrupts do not progress in the ipipe until that stage has been unstalled. Each stage of the ipipe can, of course, decide to do a number of things with an interrupt. Among other things, it can decide that it's the last recipient of the interrupt. In that case, the ipipe does not propagate the interrupt to the rest of the domains in the ipipe.

Regardless of the operations being done in the ipipe, the Adeos code does __not__ play with the interrupt masks. The only case where the hardware masks are altered is during the addition/removal of a domain from the ipipe. This also means that no OS is allowed to use the real hardware cli/sti. But this is OK, since the stall/unstall calls achieve the same functionality.

Our approach is based on the following papers (links to these papers are provided at the bottom of this message):

  1. D. Probert, J. Bruno, and M. Karzaorman. "Space: a new approach to operating system abstraction." In: International Workshop on Object Orientation in Operating Systems, pages 133-137, October 1991.
  2. D. Probert, J. Bruno. "Building fundamentally extensible application- specific operating systems in Space", March 1995.
  3. D. Cheriton, K. Duda. "A caching model of operating system kernel functionality". In: Proc. Symp. on Operating Systems Design and Implementation, pages 179-194, Monterey CA (USA), 1994.
  4. D. Engler, M. Kaashoek, and J. O'Toole Jr. "Exokernel: an operating system architecture for application-specific resource management", December 1995.

If you don't want to go fetch the complete papers, here's a summary. The first 2 discuss the Space nanokernel, the 3rd discussed the cache nanokernel, and the last discusses exokernel.

The complete Adeos approach has been thoroughly documented in a whitepaper published more than a year ago entitled "Adaptive Domain Environment for Operating Systems" and available here: The current implementation is slightly different. Mainly, we do not implement the functionality to move Linux out of ring 0. Although of interest, this approach is not very portable.

Instead, our patch taps right into Linux's main source of control over the hardware, the interrupt dispatching code, and inserts an interrupt pipeline which can then serve all the nanokernel's clients, including Linux.

This is not a novelty in itself. Other OSes have been modified in such a way for a wide range of purposes. One of the most interesting examples is described by Stodolsky, Chen, and Bershad in a paper entitled "Fast Interrupt Priority Management in Operating System Kernels" published in 1993 as part of the Usenix Microkernels and Other Kernel Architectures Symposium. In that case, cli/sti were replaced by virtual cli/sti which did not modify the real interrupt mask in any way. Instead, interrupts were defered and delivered to the OS upon a call to the virtualized sti.

Mainly, this resulted in increased performance for the OS. Although we haven't done any measurements on Linux's interrupt handling performance with Adeos, our nanokernel includes by definition the code implementing the technique described in the abovementioned Stodolsky paper, which we use to redirect the hardware interrupt flow to the pipeline.

In terms of implementation, the Adeos' code is rather short since we focused on setting the foundations for sharing interrupts between domains. Here are the files we added:
kernel/adeos.c: Architecture-independent domain code.
arch/i386/kernel/adeos.c: Architecture-dependent domain code.
arch/i386/kernel/ipipe.c: Interrupt pipeline code.
include/asm-i386/adeos.h: Arch-dependent Adeos header.
include/linux/adeos.h: Main (arch-independent) Adeos header.

As you can see, only the i386 is currently supported. Nonetheless, most of the architecture-dependent code is easily portable to other architectures.

We also modified some files to tap into Linux interrupt dispatching (all the modifications are encapsulated in #ifdef CONFIG_ADEOS/#endif):


We modified the idle task so it gives control back to Adeos in order for the ipipe to continue propagation:

We modified init/main.c to initialize Adeos very early in the startup.

Of course, we also added the appropriate makefile modifications and config options so that you can choose to enable/disable Adeos as part of the kernel build configuration.

Here is Adeos' public API:

int adeos_register_domain(adomain_t *adp, adattr_t *attr);
Register a new domain using the properties defined in the attribute.

void adeos_unregister_domain(adomain_t *adp);
Remove "adp" domain.

void adeos_renice_domain(adomain_t *adp, int newpri);
Change "adp"'s priority in the ipipe.

void adeos_suspend_domain(void);
This domain is done dealing with the current interrupt. This signals the ipipe to provide the interrupt to the next ipipe stage.

void adeos_virtualize_irq(unsigned irq, void (*handler)(void), int (*acknowledge)(unsigned));
Provide a handler and an acknowledgment function for "irq".

void adeos_control_irq(unsigned irq, unsigned clrmask, unsigned setmask);
Change the current domain's handling mask for irq "irq". "clrmask" is applied first and then "setmask" is applied.

void adeos_stall_ipipe(void);
Stall the current domain's ipipe stage. Alternative to cli.

void adeos_unstall_ipipe(void);
Unstall the current domain's ipipe stage. Alternative to sti.

void adeos_restore_ipipe(unsigned x);
Restore the ipipe from its saved state. Alternative to __restore_flags() and local_irq_restore(). This is used with the following defines:
#define adeos_test_ipipe() test_bit(IPIPE_STALL_FLAG,&adp_current->status)
#define adeos_test_and_stall_ipipe() test_and_set_bit(IPIPE_STALL_FLAG,&adp_current->status)
Which replace __save_flags and local_irq_save(), respectively.

In Linux's case, adeos_register_domain() is called very early during system startup. set_intr_gate() in arch/i386/kernel/traps.c is then modified to call on adeos_virtualize_irq() so that Linux would tell the ipipe that it needs the irq passed to set_intr_gate().

To add your domain to the ipipe, you need to:

  1. Register your domain with Adeos using adeos_register_domain()
  2. Call adeos_virtualize_irq() for all the IRQs you wish to be notified about in the ipipe.

That's it. Provided you gave Adeos appropriate handlers in step #2, your interrupts will be delivered via the ipipe.

During runtime, you may change your position in the ipipe using adeos_renice_domain(). You may also stall/unstall the pipeline and change the ipipe's handling of the interrupts according to your needs.

We currently don't support SMP, but we do have APIC support on UP.

Here are some of the possible uses for Adeos (this list is far from complete):

  1. Much like User-Mode Linux, it should now be possible to have 2 Linux kernels living side-by-side on the same hardware. In contrast to UML, this would not be 2 kernels one ontop of the other, but really side-by-side. Since Linux can be told at boot time to use only one portion of the available RAM, on a 128MB machine this would mean that the first could be made to use the 0-64MB space and the second would use the 64-128MB space. We realize that many modifications are required. Among other things, one of the 2 kernels will not need to conduct hardware initialization. Nevertheless, this possibility should be studied closer.
  2. It follows from #1 that adding other kernels beside Linux should be feasible. BSD is a prime candidate, but it would also be nice to see what virtualizers such as VMWare and Plex86 could do with Adeos. Proprietary operating systems could potentially also be accomodated.
  3. All the previous work that has been done on nanokernels should now be easily ported to Linux. Mainly, we would be very interested to hear about extensions to Adeos. Primarily, we have no mechanisms currently enabling multiple domains to share information. The papers mentioned earlier provide such mechanisms, but we'd like to see actual practical examples.
  4. By incorporating Adeos into the main kernel tree (I know my inbox is probably going to fill up because of this one), kernel debuggers' main problem (tapping into the kernel's interrupts) is solved and it should then be possible to provide patchless kernel debuggers. They would then become loadable kernel modules.
  5. Drivers who require absolute priority and dislike other kernel portions who use cli/sti can now create a domain of their own and place themselves before Linux in the ipipe. This provides a mechanism for the implementation of systems that can provide guaranteed realtime response.

Of course, we are interested in hearing about comments and suggestions you have about Adeos.


Links to papers:

  1. (not working)

There were two main prongs of discussion that emerged from the ensuing discussion. Erik Andersen asked about performance. Traditionally, because micro-kernels employ a formal wall of message-passing between the upper and lower layers of the system, it has been thought that micro-kernels would incur significant performance overhead that monolithic kernels like Linux wouldn't suffer from. But Karim replied that according to some preliminary tests, their implementation was less than 1% slower than the standard kernel.

Erik had also asked in his post, about the impact of existing software patents. Would Adeos violate anyone's intellectual property or not? There was some dispute about this. Alessandro Rubini felt that "To me this looks definitely clear of the FMSLabs patent, since RT and non-RT live side by side, not on a master-slave relationship." Karim agreed, and added, "grab the papers, the code and the patent and have a look for yourself, you will see that we're clear. Apart from having the kernels side-by-side, Adeos is based on classic early '90s nanokernel work. No secrets there." But Erik was unconvinced. He asked, "will we soon be seeing a port of RTAI to a linux kernel module which is implemented as a separate Adeos domain, allowing RTAI apps to bypass US patent 5995745? A quick glance over that patent leaves me uncertain whether this indeed bypasses the fundamental "invention" of a "process for running a general purpose computer operating system using a real time operating system". It still looks to me like a real time operating system (Adeos) running real time and non-real time tasks with a general purpose operating system as one of the non-real time tasks... Could you summarize (for non-lawyers such as myself) how this bypasses the claims in the patent?" Alessandro replied

I'll quote the patent for you:

A process for [...] providing a general purpose operating system as one of the non-real time tasks; preempting the general purpose operating system as needed for the real time tasks; and preventing the general purpose operating system from blocking preemption of the non-real time tasks.

Nothing of this is in adeos. And nothing of this will be in the adeosized RTAI.

At one point, Daniel Phillips gave a link to all the RTLinux patent claims.

3. Status Of kbuild 2.5 Integration

3 Jun 2002 - 10 Jun 2002 (32 posts) Archive Link: "If you want kbuild 2.5, tell Linus"

Topics: Development Philosophy, Kernel Build System, Sound: ALSA

People: Keith OwensNicolas PitreKai GermaschewskiKai HenningsenDaniel PhillipsTomas SzepeBill DavidsenJesse PollardLinus Torvalds

In frustration at being ignored by Linus Torvalds, Keith Owens posted a request for folks to pester Linus about getting kbuild into the mainstream sources. He remarked, "It is a sad day when a fully tested and documented system that is faster and, above all, more accurate, cannot get into the kernel. Linus is judging kbuild 2.5 on its popularity and on personalities, not on its technical merits." Nicolas Pitre replied, "Linus became interested in kbuild-2.5 when someone else than you decided to feed him with small patches, exactly what I told you a while ago and what you called a "stupid comment"." A couple posts later, he added, "Keith is certainly doing excellent work, no doubt. Unfortunately he just showed how inapt he is to deal with Linus."

Elsewhere, Daniel Phillips gave a pointer to a post from Linus several weeks before, asking Kai Germaschewski to handle the kbuild integration. A few posts down the road in the current thread, Kai also added, "I'm currently on it! Please don't cry out so badly." But elsewhere, Kai Henningsen said:

I fail to see how this is supposed to work, and I guess so does Keith.

Kai (a different Kai!) does not seem to want to integrate the core part of kbuild2.5. He seems to want to only pick the low-hanging fruits and make unsupported (and unbelievable) noises about the rest.

And Linus seems to want to ignore the fact that the core portion of kbuild2.5 is, by its very nature, not something that can be merged "gradually" - just like ALSA, or a new architecture, can't meaningfully be merged "gradually". (And he *also* said that he wasn't interested in pseudo-gradually, i.e. getting the stuff in parts but still making a big exchange.)

Frankly, I see *absolutely no way* how the current Kai-Linus "merge" can possibly end with something even remotely like Keith's kbuild2.5. Unless Linus changes his approach radically.

If I were Keith, I'd be rather upset, too.

But Kai G. replied, "Anyway, he shouldn't be. There are lots of people appreciating his work, and many of us are very grateful for it. So am I. It's going to be a long way, but finally kbuild-2.4 can't stand while kbuild-2.5 (hopefully) goes on. So we'll get it in by some time in the future." But Daniel said, "There's certainly been time enough for action, and I don't see any. I'd say Kai" [Germaschewski] "is stalling and not being cooperative at all."

Jesse Pollard offered a proposal for how to merge kbuild effectively, but Tomas Szepe complained, "Please note that there have already been innumerable proposals of how to merge kbuild 2.5, and all of them have been silently rejected." And Bill Davidsen replied, "As I see the problem, the proposal has been either rejected or postponed, and some people refuse to accept that. I would love to see the best of O1 and preempt in 2.4, too, but I'm not going to ask anyone to write Linus, or Marcel, or Sen. Hollings and tell them my idea is a good idea. As long as patches are available and I'm able to apply them, I will grumble under my breath and move on. That's my vote for a solution to the kb25 merge problem, accept that a decision has been made and move on."

4. Scheduler Hints

4 Jun 2002 - 12 Jun 2002 (17 posts) Archive Link: "[PATCH] scheduler hints"

Topics: Big O Notation, Real-Time, SMP, Scheduler

People: Robert LoveRick BresslerSimon Trimmer

Robert Love implemented scheduler 'hints' on top of the O(1) scheduler. He explained, "scheduler hints are a way for a program to give a "hint" to the scheduler about its present behavior in the hopes of the scheduler subsequently making better scheduling decisions." For example, he said, "consider a group of SCHED_RR threads that share a semaphore. Before one of the threads were to acquire the semaphore, it could give a "hint" to the scheduler to increase its remaining timeslice in order to ensure it could complete its work and drop the semaphore before being preempted. Since, if it were preempted, it would just end up being rescheduled as the other real-time threads would eventually block on the held semaphore." Rick Bressler replied:

Sequent had an interesting hint they cooked up with Oracle. (Or maybe it was the other way around.) As I recall they called it 'twotask.' Essentially Oracle clients processes spend a lot of time exchanging information with its server process. It usually makes sense to bind them to the same CPU in an SMP (and especially NUMA) machine. (Probably obvious to most of the folks on the group, but it is generally lots better to essentially communicate through the cache and local memory than across the NUMA bus.)

As I recall it made a significant difference in Oracle performance, and would probably also translate to similar performance in many situations where you had a client and server process doing lots of interaction in an SMP environment.

But Robert said this wasn't necessary because Linux 2.5 implemented a command to bind a process to a specific CPU already.

Elsewhere, Simon Trimmer mentioned:

This isn't my thing but my flatmate had left a copy of solaris internals on the table ;)

This is briefly mentioned around about page 384 and appears to be targetted at userspace processes for exactly the cases you're suggesting (holding global resources).

A good entry point into the sun online documentation for this stuff is schedctl_init() -

Robert said he'd thought Solaris Internals would have it in there, but hadn't had his copy around to confirm. But he also said regarding the online doc:

Hm, what they export is a bit different. I wonder what the internal kernel interface is like (i.e. how close to sched_hint it is)?

Since they have a start_hint and stop_hint, that is where they are able to enforce their fairness. When you call stop, I suspect they penalize your timeslice by some amount similar to the duration from start to stop. If you don't call stop before you reschedule, then you probably forfeit a large chunk of your timeslice.

This would be doable with our scheduler - and perhaps even with minimal impact (which is my goal). However, since I wrote this more as an exercise in fun than something to merge, I do not know if it is worth it to make a whole infrastructure around this. Those who really see benefit (scientific computing or real-time or whatever) could just grab the patch, remove the permission check, and code their applications to fit -- they trust their application base.

Anyhow, to pique interest, here are some benchmark numbers. I have 5 pthreads contesting over a single semaphore. They loop, doing some busy looping, down the semaphore, busy loop, and then up the semaphore. Thus they use a lot of their timeslice and spend the rest of the time blocking on the semaphore. I let them loop a fixed number of times before exiting.

(These are average of ~10 runs)

With a call to sched_hint(HINT_TIME) after successfully downing the semaphore the avg total duration is 7233459 us. Without the sched_hint, the avg total duration is 7683220 us.

That is an improvement of 6% - with only 5 threads.

A quick glance shows a reduction in context switches, but what really matters is if we are entering schedule and neither (a) rescheduling the same task, or (b) running another thread that quickly blocks on the semaphore.

It is all academic anyhow...

5. Laptop Battery Conservation

4 Jun 2002 - 6 Jun 2002 (31 posts) Archive Link: "[rfc] "laptop mode""

Topics: Disks: IDE, FS: sysfs, Laptop Support, Virtual Memory

People: Andrew MortonAndreas Dilger

Andrew Morton announced:

Here's a patch which is designed to make the kernel play more nicely with portable computers. I've been using it for a couple of days and it seems to do the right thing. I'm wondering if anyone has any comments/suggestions/etc.

To test this code you'll also need (hmm. Server seems to be dead. So the patches are here, as attachments)

Here's the algorithm, from the Documentation/filesystems/proc.txt section describing /proc/sys/vm/:


Setting this entry to '1' will put the kernel's dirty data writeout algorithms into a mode which is better suited to laptop/notebook computers. This mode is specifically designed to minimise the frequency of disk spinups. Laptop mode works as follows:

- Dirty data remains in memory for longer periods of time (controlled by laptop_writeback_centisecs).

- If there is pending dirty data and the disk is spun up for any reason (even for a read) then all dirty data will be written back shortly afterwards. ie: when the disk is spun up, make good use of it.

- When the decision is made to write back some dirty data, the kernel will write back all dirty data.


This tunable determines the maximum age of dirty data when the machine is operating in Laptop mode. The default value is 30000 - five minutes. This means that if applications are generating a small amount of write traffic, the disk will spin up once per five minutes.

If the disk is spun up for any other reason (such as for a read) then all dirty data will be flushed anyway, and this timer is reset to zero.

laptop_writeback_centisecs has no effect when the machine is not operating in Laptop mode.

This implementation doesn't try to be very smart - there's a direct call out of do_ide_request() into the writeback code. This couldn't be done from within ll_rw_blk.c because then a write to the ramdisk would spin the disk up. Even as-is, a read from the IDE CDROM drive will cause the IDE hard disk to spin up and flush data, so probably that call in do_ide_request() should only be made if the device is writable. Suggestions are sought, but let's try not to get too fancy here...

There was general praise for the idea, and some technical discussion. Andreas Dilger made a small point regarding the five-minute spin-up time for hard disks. He said, "FYI, this is probably an optimally bad choice for the default disk spinup interval, as many laptops spindown timers in the same ballpark. I would say 15-20 minutes or more, unless there is a huge amount of VM pressure or something. Otherwise, you will quickly have a dead laptop harddrive from the overly-frequent spinup/down cycles." Andrew replied:

Twenty it is, thanks.

BTW, the "use a gigabyte of readahead" idea would cause VM hysteria if you access a 600 megabyte file, so I've wound that back to twenty megs.

Also, it has been suggested that the feature become more fully-fleshed, to support desktops with one disk spun down, etc. It's not really rocket science to do that - the `struct backing_dev_info' gives a specific communication channel between the high-level VFS code and the request queue. But that would require significantly more surgery against the writeback code, so I'm fishing for requirements here. If the current (simple) patch is sufficient then, well, it is sufficient.

There were various feature requests and some implementation discussions.

6. Kernel Versioning

5 Jun 2002 - 6 Jun 2002 (6 posts) Archive Link: "Question Regarding "EXTRAVERSION" Specification"

Topics: Kernel Build System, Source Tree

People: John L. MalesKeith OwensAlan CoxMarcelo Tosatti

John L. Males asked:

I have had a recent experience in using the "EXTRAVERSION" in the Linux 2.2.x Kernel series. The context of my question applies to both the 2.2.x and 2.4.x Linux Kernels.

The questions are:

  1. Is there a specification that states the maximum length that the "EXTRAVERSION" string may be?
  2. Does the Kernel make/build process enforce any specified limit of (1) above?

To question 1, Keith Owens replied, "The total length $(VERSION).$(PATCHLEVEL).$(SUBLEVEL)$(EXTRAVERSION) must not exceed 64 characters. Break that limit and you get garbage in uname -r." While to question 2, he said, "kbuild 2.5 enforces the limit, the existing kernel build code does not. I sent a patch to Linus four times back in the 2.4.15 days, he completely ignored it. Linus does not care about kernel build problems." He added that he'd dig up the patch and send it to Marcelo Tosatti for inclusion in 2.4, and Alan Cox replied, "Please CC me a copy and I'll merge it into -ac in case Marcelo loses it or doesnt want it before 2.4.19 final." John asked if there were any chance the patch would make it back into the 2.2 world, but there was no reply.

7. New EVMS Version Released

7 Jun 2002 - 8 Jun 2002 (3 posts) Archive Link: "[ANNOUNCE] EVMS Release 1.1.0-pre1"

Topics: Disk Arrays: EVMS, Disk Arrays: RAID, FS: JFS, FS: ReiserFS, FS: ext2, FS: ext3

People: Kevin CorrySvetoslav Slavtchev

Kevin Corry announced:

The EVMS team is announcing the next development release of the Enterprise Volume Management System, which will eventually become EVMS 2.0. Package 1.1.0-pre1 is now available for download at the project web site:

As this package is just a pre-releaes, only the source tarball is available for download. RPM files will be available when 1.1.0 is released.

Also, please use the appropriate level of caution when using this version! There are several very new features which have not yet undergone extensive testing! In other words, you probably shouldn't run this version on any critical systems.

Please report any problems or bugs to the EVMS mailing list:

Highlights for version 1.1.0-pre1 include:

v1.1.0-pre1 - 6/7/02

Svetoslav Slavtchev asked, "is there any chance to run it with a 2.5 kernel? tha latest cvs is synced with 2.5.11 and latest 2.5 is 2.5.20 [ 2.5.20-dj3]" Someone replied to him privately, saying that EVMS would likely be resynced with the latest 2.5 kernel in a week or so, and Svetoslav thanked them for the info.







Sharon And Joy

Kernel Traffic is grateful to be developed on a computer donated by Professor Greg Benson and Professor Allan Cruse in the Department of Computer Science at the University of San Francisco. This is the same department that invented FlashMob Computing. Kernel Traffic is hosted by the generous folks at All pages on this site are copyright their original authors, and distributed under the terms of the GNU General Public License version 2.0.