Kernel Traffic
Home | News | RSS Feeds | Mailing Lists | Authors Info | Mirrors | Stalled Traffic

Kernel Traffic #212 For 6�Apr�2003

By Zack Brown

Table Of Contents

Mailing List Stats For This Week

We looked at 1025 posts in 4893K.

There were 381 different contributors. 184 posted more than once. 185 posted last week too.

The top posters of the week were:

1. Support For More Than 256 Disks (Thousands, Really)

21�Mar�2003�-�28�Mar�2003 (22 posts) Subject: "[patch for playing] 2.5.65 patch to support > 256 disks"

Topics: FS: sysfs

People: Douglas Gilbert,�Andrew Morton,�Jens Axboe,�Nick Piggin,�Andries Brouwer,�Badari Pulavarty

Badari Pulavarty submitted a patch to support 2.5.65 systems with over 256 disks. His patch was intended to lay on top of Andries Brouwers' 32-bit dev_t patches, which were currently still waiting in Andrew Morton's -mm tree. Badari said he'd tested his patches with 4000 disks, and had successfully read from and written to them. The only problems he'd been able to uncover so far had been some data structures getting really big, and a big drain on available RAM - at least with 4000 active disks. Douglas Gilbert poked around, and noticed, among other things, "a rather large growth of nodes in sysfs. For 84 added scsi_debug pseudo disks the number of sysfs nodes went from 686 to 3347. Does anybody know what is the per node memory cost of sysfs?" Andrew Morton asked to see Douglas' /proc/slabinfo, and when Douglas complied, Andrew said:

OK, thanks. So with 48 disks you've lost five megabytes to blkdev_requests and deadline_drq objects. With 4000 disks, you're toast. That's enough request structures to put 200 gigabytes of memory under I/O ;)

We need to make the request structures dymanically allocated for other reasons (which I cannot immediately remember) but it didn't happen. I guess we have some motivation now.

Douglas, Andrew and Badari forked off a subthread where they looked into clarifying the size issue; and elsewhere, Jens Axboe said:

Here's a patch that makes the request allocation (and io scheduler private data) dynamic, with upper and lower bounds of 4 and 256 respectively. The numbers are a bit random - the 4 will allow us to make progress, but it might be a smidgen too low. Perhaps 8 would be good. 256 is twice as much as before, but that should be alright as long as the io scheduler copes. BLKDEV_MAX_RQ and BLKDEV_MIN_RQ control these two variables.

We loose the old batching functionality, for now. I can resurrect that if needed. It's a rough fit with the mempool, it doesn't _quite_ fit our needs here. I'll probably end up doing a specialised block pool scheme for this.

Hasn't been tested all that much, it boots though :-)

He posted a quick revision, then found fault with that as well, and continued work on a new version. Nick Piggin was very interested in the patch, but Jens asked him to wait till he put out a known non-broken one. Badari posted his own patch to fix some of the size issues; and the thread petered out.

2. Power PC Profiling

24�Mar�2003�-�28�Mar�2003 (7 posts) Archive Link: "[patch] oprofile + ppc750cx perfmon"

People: Bryan Rittmeyer,�Benjamin Herrenschmidt,�John Levon

Bryan Rittmeyer announced:

I've created very preliminary patches to add PPC Performance Monitor support to oprofile cvs and linux 2.4.20-ben8:

Sadly, this approach seems very unstable on a 750CX imac (PVR 0008 2214). The box freezes hard and requires a cold reboot after just a few minutes of profiling. As benh hinted in a previous thread, I suspect there's an undocumented erratum for this CPU related to decrementer + pm use. If anyone has contacts in IBM, further info would be helpful to rule out software error and possibly to obtain a workaround...

Assuming lots of PPC perfmon hardware is effectively useless with the decrementer, some solutions are:

  1. use the pm irq for all timer-related stuff in Linux, and turn off the decrementer completely. May mean we need to disable idle loop HLTs, hurting thermal dissipation / power consumption--I believe the PMC cycle counters stop incrementing inside power_save_6xx().
  2. use the decrementer for profiling. we'd forfeits the perfmon's ability to sample when MSR[EE]=0 (irqs disabled), taking a big chunk out of oprofile's appeal imo.
  3. hybrid approach. when not using oprofile, the kernel runs as it does now via the decrementer. when profiling, we switch everything over to the pm system, disabling the power_save stuff. I suspect this is the best approach on broken silicon, though there may be some minor nastiness with swapping between different exception mechanisms for timer_interrupt()

I'm going to try the hybrid approach and will post new patches soon.

My work is commercially sponsored by Ixia and therefore focuses on 2.4 and the IBM PPC750FX/CXe. However, I'm happy to discuss a 2.5 port, support for other chips, an eventual upstream merge, and any other issues related to bringing up oprofile on the PPC.

He posted more patches, and Benjamin Herrenschmidt, John Levon, and Andrew Fleming showed some interest.

3. Linus On BitKeeper

25�Mar�2003�-�29�Mar�2003 (9 posts) Archive Link: "[BK PATCH] USB changes for 2.5.66"

Topics: USB, Version Control

People: Bill Davidsen,�Linus Torvalds,�Alan Cox,�Thomas Molina,�John Goerzen,�Greg KH,�Ben Collins

Greg KH made some USB changes and asked Linux to pull from his BitKeeper tree. Bill Davidsen replied, "Another "bk-only" patch. Guess I'd better look at the free (as in license, not cost) clone again." And Linus Torvalds replied:

Well, since BK has made it so trivial for me to merge with Greg, the thing is already integrated into my tree, and as a result the patches should already have been sent out on the patch lists by the robots, and the snapshots will follow shortly as the automation decides to kick in.

In short, give BK the credit it deserves. You get all the information you want, and the fact that you depend on and force yourself to use inferior tools is not the fault of BK.

In other words: _despite_ your luddite ways you actually have more information available to you than you would have had without BK.

So stop whining about BK. Put up or shut up - you get timely non-BK snapshots, and the fact that others see the value of their tools in the things they do for them shouldn't be an issue for _you_.

Stay in the stone age if you wish, but don't expect your stone-age muscle-propellered log car to go as fast as the rocket of the future. And don't complain to us who don't want to expend energy on stuff that shouldn't need it. We've got better tools.

Alan Cox quipped, "Rocket propelled cars cause lots of flames unfortunately 8)" And Thomas Molina cautioned, "Can we hold the bk flamewars to one per week please ;)"

Elsewhere, John Goerzen said to Linus, "Not everyone that doesn't use BitKeeper is a luddite. For instance, Ben Collins has been explicitly prevented from using it simply because he happens to work on Subversion. His only option is to shell out thousands for the commercial version. It would serve everyone well to remember that a minority of contributors to this list have the financial resources to work around the onerous licensing terms of BK."

Elsewhere, Greg and Alan both pointed out that actually, Greg did post traditional patches to the USB mailing list; he just felt it was inappropriate to also post them on the linux-kernel mailing list, since they were USB specific.

4. Linux 2.5.66-mm1 Released; Status Of UMSDOS

26�Mar�2003�-�28�Mar�2003 (20 posts) Archive Link: "2.5.66-mm1"

Topics: FS: UMSDOS, FS: ext2, FS: ext3, Ioctls, Real-Time

People: Andrew Morton,�Ingo Molnar,�Zwane Mwaikambo,�Andries Brouwer,�Dave Jones,�Ed Tomlinson,�Mike Galbraith

Andrew Morton announced:

Ed Tomlinson reported an oops after less than a day of uptime; Andrew took a look and said Ingo Molnar would be the right person to analyze this, since Ed had enabled preemption. Ingo replied, "hm, this is an 'impossible' scenario from the scheduler code POV. Whenever we deactivate a task, we remove it from the runqueue and set p->array to NULL. Whenever we activate a task again, we set p->array to non-NULL. A double-deactivate is not possible. I tried to reproduce it with various scheduler workloads, but didnt succeed." It seemed similar to a crash Mike Galbraith had been having; he asked Mike for a backtrace, but Mike said he'd had too many additional patches applied to make the backtrace worth anything. Zwane Mwaikambo also suspected he had the same problem, "but i never posted due to being unable to reproduce it on a vanilla kernel or the same kernel afterwards (which was hacked so i won't vouch for it's cleanliness). I think preempt might have bitten him in a bad place (mine is also CONFIG_PREEMPT), is it possible that when we did the task_rq_unlock we got preempted and when we got back we used the local variable requeue_waker which was set before dropping the lock, and therefore might not be valid anymore due to scheduler decisions done after dropping the runqueue lock?" Ingo replied, "yes, this one was my only suspect, but it should really never cause any problems. We might change sleep_avg during the wakeup, and carry the requeue_waker flag over a preemptible window, but the requeueing itself re-takes the runqueue lock, and does not take anything for granted. The flag could very well be random as well, and the code should still be correct - there's no requirement to recalculate the priority every time we change sleep_avg. (in fact we at times intentionally keep those values detached.)" No solution (or clear identification of the problem) came up during the thread, but it seems as though they at least gathered some affected systems together.

Elsewhere, in a completely different subthread, Andries Brouwer remarked:

struct umsdos_ioctl has twice dev_t followed by padding. Probably these should become unsigned longs. I'll send a patch later tonight.

Is it used anywhere? That requires detective work. It is used by the utilities udosctl (a useless demo utility), umssync and umssetup. I do not know of any others. No doubt people will tell me what I overlooked. Less conservative people will tell me that umsdos has to be killed entirely.

Dave Jones asked, "Isn't it still horribly broken ? I remember Al putting it on the "To be fixed later" burner, but never saw anything happen to it after that asides from janitor style fixes." And Andries replied, "Yes, umsdos is victim of bitrot. Al broke it with his patch called MA37-break-umsdos-C3-pre3. Afterwards people doing global changes failed to do them on umsdos, so the amount of work required to get umsdos in the shape it was before (working, but with races and other problems) increases in time."

5. Initial Port Of Software Suspend (SWSUSP) From 2.4 To 2.5

26�Mar�2003�-�28�Mar�2003 (4 posts) Archive Link: "Annouce: Initial SWSUSP 2.4 port to 2.5 available."

Topics: Big Memory Support, Software Suspend

People: Nigel Cunningham,�Patrick Mochel

Nigel Cunningham announced:

It is with delight that I write to announce the first release of the port of Software Suspend for 2.4 to 2.5. This version has all the functionality of the 2.4 version beta19-17 - the current development version for 2.4. This includes the following enhancements over the version currently included in the 2.5 kernel:

There are issues still to be dealt with, but these should not in any way interfere with testing at this stage. They are:

You can find the patch against 2.5.66 on It's in the swsusp-devel section at the bottom of the list.

Patrick Mochel replied, "I'm glad that you have done this work, and I look forward to merging it with the power management infrastructure work I have been doing. However, there are several outstanding issues." He went on:

I will not take the code which still relies on the page flag bits. I look forward to your dynamic bitmap implementation.

As for the current patch, I was unable to test its actual functionality, which I will get to in a moment.

First, I request that you please name the patches something sensible that won't collide with anything else, like swsusp-2.5.66-<n>.diff, not 'patch-2.5.66-<n>.diff'.

Also, please make it explicitly clear which patch(es) are needed for download. On the sourceforge site, both -01 and -02 are highlighted as current, though they appear to different versions of the same thing.

I'm glad to hear that you have completed the full port, but many people appreciate incremental patches, especially if the cumulative changes touch multiple parts of the kernel. Please consider breaking the one large patch into multiple, easily digestible, chunks.

6. Death Of An ioctl

28�Mar�2003�-�31�Mar�2003 (2 posts) Archive Link: "TIOCTTYGSTRUCT"

Topics: Ioctls

People: Andries Brouwer,�Theodore Y. Ts'o

Andries Brouwer asked Theodore Y. Ts'o:

Would you mind if I removed TIOCTTYGSTRUCT?

I suppose you don't need it any longer, and otherwise could easily add some debugging stuff again when needed. This ioctl exports lots of kernel-internal stuff that userspace has no business looking at. The direct reason I ask is that it also exports a kdev_t, and the meaning of that will change.

The reason he asked Ted dated back to a changelog entry from November 1994, when Ted had added initial support for that ioctl, saying, "Add support for the new ioctl TIOCTTYGSTRUCT, which allow a kernel debugging program direct read access to the tty and tty_driver structures."

Now, Ted replied to Andries, "Sure, go ahead; I'm pretty sure no one has used it for at least 6-7 years..."

7. Modutils 2.4.25 Released

29�Mar�2003 (1 post) Archive Link: "Announce: modutils 2.4.25 is available"

Topics: Backward Compatibility

People: Keith Owens

Keith Owens announced modutils v. 2.4.25:

This version of modutils is almost identical to 2.4.23. The changes affect architectures that have function descriptors, i.e. ia64, ppc64, hppa, hppa64. It also adds support for combined s390/s390x utilities.

For historical reasons, insmod and depmod treat modules with neither EXPORT_SYMBOL() nor EXPORT_NO_SYMBOLS() as exporting everything. This provides backwards compatibility with 2.0 kernels and some 2.2 modules. No new code should be relying on this behaviour and the feature has been removed in 2.5 kernels. Unfortunately some developers are still relying on this default behaviour, even for new code.

When an architecture has function descriptors and uses EXPORT_SYMBOL() on a function, gcc generates a function descriptor and ksymtab contains the address of that descriptor. Without an explicit EXPORT_SYMBOL(), gcc does not generate a function descriptor and the exported symbol points to the start of the function body. Any attempt to call to that function tries to use the start of the function code as a descriptor and breaks spectacularly.

To prevent this kernel breakage, I am making an incompatible change to modutils. It only affects ia64, ppc64, hppa and hppa64 users, and only if they are relying on the deprecated feature of all symbols being exported.

Users on these architectures must ensure that their modules still resolve and add EXPORT_SYMBOL() where necessary before doing a permanent upgrade to modutils 2.4.25. The simplest way to check is to build (but not install) modutils-2.4.25 then

./depmod/depmod -nae > /dev/null

Any unresolved references that did not occur with modutils 2.4.23 need an explicit EXPORT_SYMBOL(). If this is too much bother, stay on modutils 2.4.23 and risk the kernel breakage.

Other architectures can safely upgrade to 2.4.25 with no change, or they can stay on 2.4.23.

If anybody fancies a janatorial task, configure modutils 2.4.25 with CFLAGS="-O2 -Wall -DHAS_FUNCTION_DESCRIPTORS" ./configure, build it then run ./depmod/depmod -nae > /dev/null. You can do that on any architecture to find kernel modules that still rely on exporting all symbols.

No, I am not going to fudge modutils 2.4 to allow the continued default export of data symbols but not text symbols on architectures with function descriptors. It is too much extra work just to allow the continued use of a deprecated feature that has already been removed in 2.5 kernels.

8. 'USB Gadget' API And Driver Framework

31�Mar�2003 (1 post) Subject: "ANNOUNCE: Linux "USB Gadget" API and Driver Framework"

Topics: Networking, PCI, USB, User-Mode Linux, Version Control

People: David Brownell

David Brownell announced:


This is a kernel-mode API, and an initial set of drivers for it, that helps Linux 2.4 and 2.5 kernels support intelligent "USB Device" (peripheral) hardware.

The code is ready for more general use by the Linux community, including development of new drivers. It supports network connections over USB "out of the box", using the NetChip 2280 USB 2.0 high speed controller, and is now being used with high speed USB devices running under Linux.

(Note that such an API is on the 2.5 wishlist. This is the first such API that was designed from the ground up to work with the existing host side Linux-USB stack, and to support USB 2.0 high speed devices.)


Temporary web page with info, including:


BitKeeper repositories


There are lots of opportunities to write drivers here, both for dozens more USB "class" specifications and, if you want to get down'n'dirty, for USB device controller hardware.

Please discuss this on the mailing list, unless/until a new list gets set up.


Since talking about a "USB Device Driver" becomes ambiguous when both sides of the protocol stack can run Linux, Linux-USB developers have chosen new terminology. A "USB Device Driver" is what current Linux kernels have: a Host-side driver. A Device-side driver is instead called a "USB Gadget Driver" ... that's why the new name.

The API is straightforward and thin, just one new header file to shape how "gadget" drivers talk to the underlying controller hardware. There is no "mid-layer" requirement, and all policy for device configuration and management goes above this API. I/O involves just submitting an asynchronous request to the relevant endpoint (like URBs but simpler).

There are currently two USB device controller drivers available implementing that API.

There are currently two gadget drivers using that API:

Both of those gadget drivers have compile-time configuration support to let them work with net2280, pxa25x, or sa1100 usb drivers. Each controller has slightly different endpoint capabilities; gadget drivers must choose endpoints and configurations accordingly, and there's no point in trying to do that at run time.

9. Linux Security Module 2.5.66-lsm1 Released

31�Mar�2003 (1 post) Subject: "[ANNOUNCE] 2.5.66-lsm1"

Topics: Version Control

People: Chris Wright,�Jakub Jelinek,�Stephen Smalley

Chris Wright announced:

The Linux Security Modules project provides a lightweight, general purpose framework for access control. The LSM interface enables developing security policies as loadable kernel modules. See for more information.

2.5.66-lsm1 patch released. This is a rebase up to 2.5.66 as well as some interface and module updates. Out of tree projects will want to resync with interface changes.

Full lsm-2.5 patch (LSM + all modules) is available at:

The whole ChangeLog for this release is at:

The LSM 2.5 BK tree can be pulled from:

 - merge with 2.5.59-66                                 (me)
 - restore file permission hooks to sendfile            (Stephen Smalley)
 - security.h inclusion in network files                (Stephen Smalley)
 - cleanup init[open]_private_file                      (Stephen Smalley)
 - syslog, sysctl cleanups                              (Stephen Smalley)
 - add CONFIG_SECURITY_NETWORK                          (Stephen Smalley)
 - cleanup for newer skb allocation                     (me)
 - SELinux:                                             (Stephen Smalley)
   - labelled network fixes
   - ptrace fixes, drop support for exec_permission_lite
   - minor fixes
   - use kernel SID in reparent_to_init
 - drop task_kmod_set_label hook                        (me)
 - drop explicit exec_permission_lite hook              (me)
 - drop exta call to security_sock_rcv_skb hook         (me)
 - fix setfs[ug]id return values                        (Jakub Jelinek)

10. Gujin bootloader 0.7 Announced

1�Apr�2003 (1 post) Archive Link: "Announce: Gujin bootloader 0.7"

Topics: Compression, Disks: IDE, FS: FAT

People: Etienne Lorrain

Etienne Lorrain announced:

I would be glad to hear comments on this new release, available at

Gujin is a GPL bootloader rewritten from scratch, it can now be installed on a hard disk (in a small FAT 12/16 partition) or on a floppy.

It is still not perfect, but can still:

There is a lot of stuff to code and test still, for instance the default function to get Linux i386 real-mode parameter can be overloaded by code present at end of the compressed kernel (linux_param_realfct_size in vmlinuz.* shall use a function like the one in linuxparam.c which would be concatenated at the end of linux.kgz) and processor restrictions in the comment field of the GZIPed kernel has also to be tested (get the Gujin loader to complain if the kernel cannot be run on the current processor, using all the CPUID flags).

Also, the late relocation address of the kernel can be modified (because the kernel can now be of any size, it takes DMA-able memory below 16 Mb, precious for the sound buffers, so the default loading address shall be moved to 16 Mbytes and over for up to date systems)

If interrested, you should begin by downloading the install.tgz package because the compiler used is GCC-2.95.3/4 or GCC-3.0.4 due to compiler bugs. Read the doc first !

I am currently writing the ANSI font download (to not use the PC fonts) and will next support more keyboards. My days just have 24 hours and have a limited time to play with Gujin, so I would appreciate some help.

11. VM Documentation Nearing Completion

1�Apr�2003 (1 post) Archive Link: "Last major update to VM documentation"

Topics: Virtual Memory

People: Mel Gorman

Mel Gorman announced:

Yet another release of the VM docs and I hope this is the last major update to it. The swapping and page replacement chapters are the two most notable changes with the usual cleanups and embellishments elsewhere. I am tentatively saying it is now fully accurate now though and the final few technical errors should have been shaken out of it. If I'm wrong, feel free to point it out and laugh a bit.

Arguably, what is more important is that I've written a set of acknowledgments where I tried to compile a list of everyone that gave me a hand. Thanks to anyone who sent me technical corrections, grammar corrections and the odd word of encouragement, it is much appreciated. If I missed anyone, send me an indignant email.

I am aiming to leave this pretty much as it is for the next two weeks and if nothing major happens, it'll be rubber stamped, finalised and I'll make the TeX source publicly available (hopefully on or somewhere else that doesn't depend on my website existing) and start working on something else

Understanding the Linux Virtual Memory Manager


Code Commentary


As usual, comments and feedback welcome

12. libivykis 0.2 Announced, A New FD Event Handling Library

1�Apr�2003 (1 post) Archive Link: "[ANNOUNCE] libivykis 0.2, fd event handling library"

People: Lennert Buytenhek

Lennert Buytenhek announced:

libivykis is a thin wrapper over various OS'es implementation of I/O readiness notification facilities (such as poll(2), kqueue(2), epoll_create(2)), and is mainly intended for writing portable high-performance network servers.

It has so far been used to implement a streaming video server and proxy servers for various protocols.

This is the first public release.

13. Email Notification Of BK->CVS Updates

2�Apr�2003 (2 posts) Subject: "BK->CVS notify?"

Topics: Version Control

People: Larry McVoy,�Matti Aarnio

Larry McVoy said:

Hi, I was moving machines around and for the last day and a half or so the CVS trees weren't getting updated. They should be up to date as of now.

Do we need a cvs-updates mailing list which gets notified of updates or do you not care? I'd be interested in knowing how many people are using the CVS trees.

One other comment: there was concern that the incremental update would not be as good as a one pass conversion. So far, the one pass and the incremental results are identical, no exceptions.

Matti Aarnio gave a link to existing vger mailing lists and pointed out, "VGER is already running bk-commit-head and bk-commit-24 lists. Adding bk-cvs-head-nofify / bk-cvs-24-notify makes hardly detectable addition to load."

Sharon And Joy

Kernel Traffic is grateful to be developed on a computer donated by Professor Greg Benson and Professor Allan Cruse in the Department of Computer Science at the University of San Francisco. This is the same department that invented FlashMob Computing. Kernel Traffic is hosted by the generous folks at All pages on this site are copyright their original authors, and distributed under the terms of the GNU General Public License version 2.0.