Kernel Traffic #298 For 6 Mar 2005

By Zack Brown

Table Of Contents

Mailing List Stats For This Week

We looked at 1994 posts in 11MB. See the Full Statistics.

There were 623 different contributors. 238 posted more than once. The average length of each message was 98 lines.

The top posters of the week were: The top subjects of the week were:
74 posts in 556KB by Tejun Heo
66 posts in 470KB by Eric W. Biederman
62 posts in 260KB by Bartlomiej Zolnierkiewicz
53 posts in 495KB by Adrian Bunk
43 posts in 186KB by Greg KH
63 posts in 303KB for "Patch 4/6 randomize the stack pointer"
44 posts in 213KB for "i8042 access timings"
39 posts in 228KB for "[PATCH] OpenBSD Networking-related randomization port"
38 posts in 245KB for "[PATCH] Dynamic tick, version 050127-1"
36 posts in 144KB for "[PROPOSAL/PATCH] Remove PT_GNU_STACK support before 2.6.11"

These stats generated by mboxstats version 2.2

1. kexec And crashdump

18 Jan 2005 - 4 Feb 2005 (99 posts) Archive Link: "[PATCH 0/29] overview"

Topics: Executable File Format, Kexec

People: Eric W. BiedermanAndrew MortonVivek GoyalKoichi SuzukiItsuro Oda

Eric W. Biederman said:

This patchsset is a major refresh of the kexec on panic functionality in the kernel. The primary aim of which was to take the requirements capture of the kernel crashdump patches and start integrating the functionality cleanly into the kexec patches.

Major accomplishments:

The crashdump code is currently slightly broken. I have attempted to minimize the breakage so things can quick be made to work again.

With respect to a final design discussion there are two remaining open issues. The first is how little hardware shutdown we can get away with in the kernel that is panicing. I believe we can reduce this to a simply NMI to the other cpus telling them to stop. This has been address as a major concern in previous conversations.

The second is an issue is the most significant with respect to the design of a kernel based crash dump capture implementation. How does the crashdump capture process discover relevant information about the kernel that just crashed? There are two options.

  1. As represented by the current crashdump patches the crashdump kernel and the kernel in which it loads are kept in sync so that it has uptodate versions all of crashed kernels data structures because it is built from the same source. So it only needs to find the address of the data structures it would like to look at.
  2. The relevant information if it is available when sys_kexec_load is called is exported to user space, or the machine_crash_shutdown method marshalls what little information must be captured when the machine dies in a well known standard format (most likely ELF notes). Allowing the crashdump capture process to simply pass on the information or utilize it as appropriate.

    If the second method can successfully represent all of the interesting information then we can allow kernel version skew, between the two kernels, and potentially implement the entire crash dump capture process in user space.

As best as I have been able to discover the interesting information includes. The cpu state (registers) at the time of the crash/panic. The list of memory regions the kernel that has crashed was using. And potentially the list of pages dedicated to kernel data as opposed to user space, so the the people with insane amounts of memory (1TB+) don't require unmanagely large core files.

He quoted an earlier message by Andrew Morton, in which Andrew had said, "I don't want us to be in a position of merging all that code and then finding out that it cannot be made to work "sufficiently well", forcing us to revert it and find a new crashdump solution. You guys know far better than I when we will reach that threshold. If the kexec/dump developers can say "yup, this is going to work (because X)" then I'm happy." Eric now offered:

So here is my subjective view.

In the interests of full disclosure my main interesting is using the kernel as a bootloader for other kernels and that has been working fairly for years now :)

He posted a couple dozen atomic patches for this. Vivek Goyal replied:

We have started doing changes to make crashdump up and running again. Following are few identified items to be done.

  1. Reserve the backup region (640k) during kernel bootup.
  2. Copy the data to backup region during crash.(moved to kexec user space code, patch posted in separate mail)
  3. Prepare elf headers while loading kexec panic kernel and store in reserved memory area.
  4. Pass required information to crashdump kernel, which parses it and exports through /proc/vmcore. (may be user space utility, open to discussion)

Following patch implements item 1) in the list. Soon we shall be rolling out the patches for rest.

In going over some of the implementation details, Eric found a number of problems with Vivek's patch; for awhile it seemed the discussion would descend into confusion, when Eric felt Vivek was only producing minimal changes in response to Eric's design suggestions. This had not been Vivek's intention, however, and they soon were 'back on the same page', as Eric put it. Vivek described the new design, saying, "The whole idea is that Crash image is represented in ELF Core format. These ELF Headers are prepared by kexec-tools user space and put in one segment. Address of start of image is passed to the capture kernel(or user space) using one command line (eg. crashimage=). Now either kernel space or user space can parse the elf headers and extract required information and export final kernel elf core image." He went on:

If I prepare One elf header for each physical contiguous memory area (as obtained from /proc/iomem) instead of per zone, then number of elf headers will come down significantly. I don't have any idea on number of actual physically contiguous regions present per machine, but roughly assuming it to be 1 per node, it will lead to 256 + 1024 = 1280 program headers.At 56 bytes per 64 bit program header this will amount to 70KB.

This is worst case estimate and on lower end machines this will require much less a space. On machines as big as 1024 cpus, this should not be a concern, as big machines come with big RAMs.

Eric, do you still think that ELF headers are inappropriate to be passed across interface boundary.

ELF headers can be prepared by kexec-tools in advance and put into one of the data segments. This requires following information to be available to user space.

Regarding Backup Region

Eric had some criticisms, but felt this was a "good place to start". Itsuro Oda asked why, in all this, the ELF format was considered necessary. Eric replied that the ELF format itself was not necessary, but the information contained within an ELF header was a match for the kind of information that needed to be used here. Therefore, Eric said, it made a good match. When Koichi Suzuki echoed Itsuro's concerns, saying, "Format conversion should be done in healthy system separately and we should restrict what to do while taking the dump as few as possible," Eric expanded:

The big part of the conversation that is happening right now is how do we uncouple dependencies between the various parts as much as possible. There is nothing here about format conversions except as to convert weird kernel formats into a stable interface.

There are 3 pieces of code interacting.

  1. The primary kernel that will call panic.
  2. The kernel+initrd that takes over.
  3. The user space that sets it all up (/sbin/kexec) while the primary kernel is still in a sane state.

The goal is to make those 3 pieces as independent of each other as reasonably possible.

So the kernel+initrd that captures a crash dump will live and execute in a reserved area of memory. It needs to know which memory regions are valid, and it needs to know small things like the final register state of each cpu. For the set of valid memory regions it is the intention to encode that as an array of ELF program headers. The information of what the final register contents were will be encoded as ELF notes. There will be one PT_NOTE segment per cpu that holds the notes needed to encode a given cpu's final state. It really does not matter to implementation that captures each cpu's final register state which format we record the data in so using a format designed not to change is not a problem. So all that needs to be communicated to the kernel+initrd that captures a crash dump is the location of an ELF header and it can figure out all of the rest.

For the primary kernel except for remembering it's final cpu register state as it dies it does nothing except jump to the crash recover kernel. All of the interesting information will be exported to user space.

/sbin/kexec is the glue that fills in the cracks. While the primary kernel is in a sane state it sets everything up including finding out which memory areas need to be looked at. And it stashes it all in a reserved area of memory, that has never been the target of DMA transfers.

The goal is to reduce the dependencies as much as possible. So an old stable kernel can take a crash dump of a new buggy kernel. And so that you don't have to be running the latest and greatest user space simply to set everything up. Although it is still better to require a user-space upgrade to cope with new kernels than to require the crash capture kernel+initrd to be upgraded.

2. New scrubd Page Zeroing Daemon

21 Jan 2005 - 8 Feb 2005 (38 posts) Archive Link: "A scrub daemon (prezeroing)"

Topics: FS: sysfs, SMP

People: Christoph LameterAndrea ArcangeliAndrew Morton

Christoph Lameter posted a patch that "Adds management of ZEROED and NOT_ZEROED pages and a background daemon called scrubd." He went on:

scrubd is disabled by default but can be enabled by writing an order number to /proc/sys/vm/scrub_start. If a page is coalesced of that order or higher then the scrub daemon will start zeroing until all pages of order /proc/sys/vm/scrub_stop and higher are zeroed and then go back to sleep.

In an SMP environment the scrub daemon is typically running on the most idle cpu. Thus a single threaded application running on one cpu may have the other cpu zeroing pages for it etc. The scrub daemon is hardly noticable and usually finished zeroing quickly since most processors are optimized for linear memory filling.

Note that this patch does not depend on any other patches but other patches would improve what scrubd does. The extension of clear_pages by an order parameter would increase the speed of zeroing and the patch introducing alloc_zeroed_user_highpage is necessary for user pages to be allocated from the pool of zeroed pages.

There was a good bit of wrangling over implementation, and later he posted an update, saying:

Changes from V4 to V6:

More information and a combined patchset is available at

The most expensive operation in the page fault handler is (apart of SMP locking overhead) the touching of all cache lines of a page by zeroing the page. This zeroing means that all cachelines of the faulted page (on Altix that means all 128 cachelines of 128 byte each) must be handled and later written back. This patch allows to avoid having to use all cachelines if only a part of the cachelines of that page is needed immediately after the fault. Doing so will only be effective for sparsely accessed memory which is typical for anonymous memory and pte maps. Prezeroed pages will only be used for those purposes. Unzeroed pages will be used as usual for file mapping, page caching etc etc.

The patch makes prezeroing very effective by:

  1. Appplying zeroing operations only to pages of higher order, which results in many pages that will later become zero order pages to be zeroed in one step.
  2. Hardware support for offloading zeroing from the cpu. This avoids the invalidation of the cpu caches by extensive zeroing operations.

The scrub daemon is invoked when a unzeroed page of a certain order has been generated so that its worth running it. If no higher order pages are present then the logic will favor hot zeroing rather than simply shifting processing around. kscrubd typically runs only for a fraction of a second and sleeps for long periods of time even under memory benchmarking. kscrubd performs short bursts of zeroing when needed and tries to stay out off the processor as much as possible.

The benefits of prezeroing are reduced to minimal quantities if all cachelines of a page are touched. Prezeroing can only be effective if the whole page is not immediately used after the page fault.

The patch is composed of 3 parts:

[1/3] clear_pages(page, order) to zero higher order pages Adds a clear_pages function with the ability to zero higher order pages. This allows the zeroing of large areas of memory without repeately invoking clear_page() from the page allocator, scrubd and the huge page allocator.

[2/3] Page Zeroing Adds management of ZEROED and NOT_ZEROED pages and a background daemon called scrubd.

[3/3] SGI Altix Block Transfer Engine Support Implements a driver to shift the zeroing off the cpu into hardware. This avoids the potential impact of zeroing on cpu caches.

Andrew Morton seemed interested in accepting the patch; but he required some benchmarks showing a real improvement; and he needed the patch to adhere to existing APIs for starting, binding, and stopping kernel threads. Christopher started to comply, but the thread petered out.

3. ST M41T00 I2C RTC Chip Driver Released

31 Jan 2005 - 4 Feb 2005 (8 posts) Archive Link: "[PATCH][I2C] ST M41T00 I2C RTC chip driver"

Topics: I2C

People: Mark A. GreerGreg KHJean Delvare

Mark A. Greer said:

This patch adds support for the ST M41T00 RTC chip.

You will likely notice that it implements a PPC-specific interface (/dev/rtc->drivers/char/genrtc.h->include/asm-ppc/rtc.h->this file). This was necessary to support a subset of ppc platforms that need to hook up the rtc support at runtime. If I implemented /dev/rtc directly or interfaced to genrtc.c directly, those platforms couldn't use this driver. Eventually, I hope to work on more uniform rtc support across all the processor architectures.

Also, on ppc at least, the hw clock can be set from a timer interrupt if STA_UNSYNC is not set (e.g., ntpd is running). To handle this, a tasklet is used to set the clock if in_interrupt() is true.

Jean Delvare, although not intimately familiar with the hardware involved, still offered some comments, mainly typos, naming conventions, and some memory management advice. Mark posted an updated patch, taking all of Jean's suggestions. Several days later, with no further replies, he asked if his patch could be accepted for inclusion at that point. Greg KH asked if Mark could send the patch with a proper Changlog blurb, and Mark did so. The blurb read:

This patch adds support for the ST M41T00 I2C RTC chip.

This rtc chip has no mechanism to freeze it's registers while being read; however, it will delay updating the external values of the registers for 250ms after a register is read. To ensure that a sane time value is read, the driver verifies that the same registers values were read twice before returning.

Also, when setting the rtc from an interrupt handler, a tasklet is used to provide the context required by the i2c core code.

4. Linux 2.6.11-rc3 Released

2 Feb 2005 - 4 Feb 2005 (9 posts) Archive Link: "Linux 2.6.11-rc3"

Topics: Disks: SCSI, FS: XFS, Kernel Release Announcement, Power Management: ACPI, Sound: ALSA, Version Control

People: Linus Torvalds

Linus Torvalds announced Linux 2.6.11-rc3, saying:

This has a number of architecture updates (mips, arm, ppc, x86-64, ia64), and updates ACPI, DRI, ALSA, SCSI, XFS and InfiniNand.. And a lot of small one-liners all over.

I'd _really_ like to calm down for a final 2.6.11 now, so please note anything really important I missed, but keep the rest pending. And give this a good testing..

Oh, and the automated bitkeeper mirroring to seems slightly broken right now (hasn't updated in the last 48 hours), but the tar-balls are all there, and the BK upating mechanism will hopefully be fixed soon.

(I've got a few BK trees in private places, it's only the public one that hasn't gotten mirrored out yet - many other BK developers will know where to find my secondary trees and can pull from them instead).

5. FUSE Version 2.2 Released

3 Feb 2005 (2 posts) Archive Link: "[ANNOUNCE] Filesystem in Userspace - 2.2 "

People: Miklos SzerediFranco Broi

Miklos Szeredi announced:

FUSE version 2.2 is out there:

This can be used standalone or with recent -mm kernels (with the exception of -rc2-mm2).

Most notable changes since 2.1:


In the long run I hope to solve both problems, but neither is trivial. Ideas are welcome, as well as bugreports of course.

Franco Broi reported excellent success, saying, "I've just ported my filesystem to 2.2-pre6 and was able to throw away about 300 lines of code, the filehandle stuff is great. I was hoping to give it a thorough test and report back before 2.2 was released but you beat me to it. It just keeps getting better and better, well done!"

6. Linux 2.6.11-rc3-mm1 Released

4 Feb 2005 - 9 Feb 2005 (59 posts) Archive Link: "2.6.11-rc3-mm1"

Topics: Device Mapper, Kernel Release Announcement, PCI, USB, Version Control

People: Andrew MortonGreg KHLaurent RiffardChristoph HellwigAlexander Viro

Andrew Morton announced Linux 2.6.11-rc3-mm1, saying:

Greg KH said:

Ok, I've cleaned up the bk-usb tree a bunch. If anyone had a previous copy of it, please just delete it and clone it again. It's at:


and is safe for consumption.

Andrew, can you put it back into the next -mm release?

Oh, and below is the diffstat and changelog of the patches in it. I've also placed a full patch of it, against the 2.6.11-rc3-bk1 tree for those who don't like to use bk, or are just curious about putting this on top of the latest -mm release: (

Also, if you have sent me a USB patch that is not already in the mainline tree, and is not included in this big patch-bundle, please resend it, as my USB patch queue is now empty.

Oops, no, I have a pending patch from Petko Manolov that didn't make it into here, sorry about that Petko, I'll get to that one next week.

Next up, the bk-pci and bk-driver-core mess...

Elsewhere, Laurent Riffard reported:

loading dm-mod module fails with this message :

FATAL: Error inserting dm-mod (/lib/modules/2.6.11-rc3-mm1/kernel/drivers/md/dm-mod.ko): Device or resource busy

The following line appears in dmesg :

register_blkdev: failed to get major for device-mapper

It was OK with kernel 2.6.11-rc2-mm2. Same config, did "make oldconfig".

Andrew replied:

You've enabled CONFIG_BASE_SMALL and so the major_names[] hashtable has just one element. device-mapper uses dynamic major allocation, the range of which is limited to the size of the top-level major_names[] array. You ran out of slots and register_blkdev() failed.

So for now I guess we must drop base-small-shrink-major_names-hash.patch.

Al, that code looks rather crappy. Shouldn't we be using an idr tree or something?

Also, we can never generate a major number of zero if the caller passed in major=0. How come?

Laurent confirmed that selecting CONFIG_BASE_FULL=y solved his problem. Close by, Christoph Hellwig remarked, "It'd be nice to see major_names just gone completely. It's only used for /proc/devices output, and with the infrastucture for easily sharing majors that one is completely misleading.." Alexander Viro replied:

ACK. Moreover, dynamic registration of *majors* makes very little sense these days - about as much as setting lower limit on IP block registration to /12.

IMO we should put a large part of device number space for dynamic allocations (current static ones barely scratch the surface - we could easily leave upper half and nobody'd noticed) and use e.g. buddy allocator within it. With allocation requests taking size of area as argument (rounded up to power of 2, which it normally would be anyway).

Any objections to that? Hell, we can even have register_blkdev() without a fixed major calling blkdev_allocate(name, 1<<20) and then eliminate the callers in favour of saner-sized requests. Then kill register_blkdev() completely...

There was no reply to this on the list.

7. RelayFS Updated

4 Feb 2005 - 5 Feb 2005 (9 posts) Archive Link: "[PATCH] relayfs redux, part 3"

Topics: SMP

People: Tom ZanussiChristoph HellwigAndi Kleen

Tom Zanussi said:

Here's the latest version of relayfs, against 2.6.10. It includes a bunch of cleanup and restructuring prompted by the previous round of comments, but the major change that people would care about would probably be the changes to the logging functions relay_write(), __relay_write(), and relay_reserve(). They've been rewritten to be more efficient, or so I hope - I'm sure I'll hear about how they should be improved for the next version in any case. ;-) Thanks to everyone who commented on the previous version.

This is what the API currently looks like:

rchan *relay_open(chanpath, subbuf_size, n_subbufs, flags, callbacks);
void relay_close(chan);
unsigned relay_write(chan, data, length);
unsigned __relay_write(chan, data, length);
void *relay_reserve(chan, length);
void relay_subbufs_consumed(chan, subbufs_consumed, cpu);
extern void relay_reset(chan);
void relay_commit(buf, subbuf_idx, count);

helper macros:

relay_get_buffer(chan, cpu)
relay_get_padding(buf, subbuf_idx)
relay_get_commit(buf, subbuf_idx)


int subbuf_start(buf, subbuf, prev_subbuf_idx);
int deliver(buffer, subbuf, subbuf_idx);
int fileop_notify(buf, filp, fileop);

As before, I've tested this code on a single proc machine using a hacked version of the kprobes network packet tracing module, which can be found here:

Once everyone's more or less happy with the API and implementation, I'll do some SMP testing and write some Documentation.

Christoph Hellwig and Andi Kleen both had nitty-gritty objections to various lines of the patch; but neither had any serious problems with it, and Tom said he'd incorporate all their corrections into a subsequent version.

8. Linux Test Project Updated

7 Feb 2005 (1 post) Archive Link: "[ANNOUNCE] February release of LTP"

Topics: Networking

People: Marty RidgewayDavid Stevens

Marty Ridgeway announced the February release of the Linux Test Project (LTP), saying:


9. New Marvell MV64xxx I2C Driver

8 Feb 2005 - 9 Feb 2005 (5 posts) Archive Link: "[PATCH][I2C] Marvell mv64xxx i2c driver"

Topics: I2C

People: Mark A. GreerBartlomiej ZolnierkiewiczJean Delvare

Mark A. Greer said:

Marvell makes a line of host bridge for PPC and MIPS systems. On those bridges is an i2c controller. This patch adds the driver for that i2c controller.

Please apply.

Depends on patch submitted by Jean Delvare:

Bartlomiej Zolnierkiewicz offered some minor fixes and criticisms of the patch, and Mark went through several patch iterations with him.







Sharon And Joy

Kernel Traffic is grateful to be developed on a computer donated by Professor Greg Benson and Professor Allan Cruse in the Department of Computer Science at the University of San Francisco. This is the same department that invented FlashMob Computing. Kernel Traffic is hosted by the generous folks at All pages on this site are copyright their original authors, and distributed under the terms of the GNU General Public License version 2.0.