Kernel Traffic #186 For 29 Sep 2002

By Zack Brown

Table Of Contents

Mailing List Stats For This Week

We looked at 1481 posts in 8463K.

There were 374 different contributors. 196 posted more than once. 162 posted last week too.

The top posters of the week were:

1. VM Subsystem Necessitating User-Space Changes; Favorite Kernel Trees

13 Sep 2002 - 19 Sep 2002 (17 posts) Subject: "2.5.34-mm4"

Topics: Disks: IDE, FS: ext3, Networking, Version Control, Virtual Memory

People: Andrew MortonAxel SiebenwirthRik van RielAlan CoxAndi KleenBill DavidsenJens Axboe

Andrew Morton gave a URL to his -mm patches ( , and announced:

Some additional work has been performed on the new, faster sleep/wakeup facilities.

I have converted TCP/IPV4 over to use the faster wakeups. It would be appreciated if the people who are interested in (and set up for testing) high performance networking could test this out. Note however that there is no benefit to select()/poll(). That's quite a large change.

So please bear in mind that this code will only help if applications are generally sleeping in accept(), connect(), etc. At this stage I'd like to know whether this work is generally something which should be pursued further - let's be careful that the measurements are not swamped by select()/poll() wakeups.

The individual patches are:

These apply against 2.5.26 and possibly earlier, and testing against earlier kernels would be valid. Thanks.

Changes have been made to /proc/stat which break top(1) and vmstat(1). New versions are available at and newer versions will appear at

Axel Siebenwirth reported:

With changing from 2.5.34-mm2 to -mm4 I have experienced some moments of quite unresponsive behaviour. For example I am building X which at that special moment causes pretty heavy disk load and the system doesn't respond at all. I was using X and was not able to switch consoles or move mouse only extremely sluggish.

I have seen that it used more swap that usual.

             total       used       free     shared    buffers     cached
Mem:        191096     159340      31756          0      10568      94100
-/+ buffers/cache:      54672     136424
Swap:       289160          0     289160

This is how it looks like under normal circumstances and when building X I had 20M in swap usage which seemed quite a lot to me. Maybe I'm just wrong. Unfortunately I was not able to start vmstat, first because I can't start vmstat when system is not responding and second it doesn't work anyway because of your changes.

To the vmstat breakage, Andrew said:

Yeah, sorry. The burden of back-compatibility weighed too heavy and Rik decided that we just have to fix userspace to follow kernel changes. There will be breakage for a while; updates are at

Unfortunately, those updates cause odd-but-not-serious things to happen to Red Hat initscripts. This happens when you install standard util-linux as well. It is due to the initscripts passing in arguments which the standard tools do not understand.

And to the increased swap usage, he went on, "2.5 is much more swaphappy than 2.4. I believe that this is actually correct behaviour for optimum throughput. But it just happens that people (me included) hate it. We don't notice the improved runtimes for the pagecache-intensive operations but we do notice the time it takes to get the xterms working again. We have not yet sat down and worked out what to do about this." Rik replied, saying he was about to send his procps patches upstream to the procps CVS tree, which he expected would fix the breakage soon.

At this point the discussion bent off into which kernel tree seemed best at the moment. A few posts down the line, Rik said, "Current 2.5 is sluggish on systems with a fast CPU and 768 MB of RAM, whereas current -ac runs the same workload smoothly with 128 MB of RAM." Andrew replied:

I've been running 2.5 on my desktop at work (800MHz/256M UP) since 2.5.26 and on the machine at home (Dual 850MHz/768M) on-and-off (recent freizures sent that machine back to Marcelo; need to try again). I also ran 2.4.19-ac-something for a couple of weeks.

Impressions are:

Overall I find Marcelo kernels to be the most comfortable, followed by 2.5. Alan's kernels I find to be the least comfortable in a "developer's desktop" situation.

Rik pointed out that the stalls during heavy write-out were "due to the fact that -ac has an older -rmap VM. As in current 2.5, rmap can write out all inactive pages ... and it did in some worst case situations. This is fixed in rmap14." He added, "I hope Alan is done playing with IDE soon so I can push him a VM update" And Alan Cox replied, "send me rmap-14a patches by all means."

Elsewhere, also in reply to Andrew's assessment of kernel trees, Andi Kleen remarked, "-aa kernels are marcelo kernels, just with the the corner cases fixed too. Works very nicely here." But Bill Davidsen said, "Corner cases? The IDE, VM and scheduler are different..." And Jens Axboe replied, "The IDE is the same, I'll refrain from commenting on the rest. There's just an adjustment to the read ahead, which makes sense."

2. Kernel Conf 0.6 Released; Merge With kbuild

16 Sep 2002 - 23 Sep 2002 (14 posts) Subject: "linux kernel conf 0.6"

Topics: Kernel Build System

People: Roman ZippelSam RavnborgKai Germaschewski

Roman Zippel announced:

At you can find the latest version of the new config system. Changes this time:

only used for type definitions. They were only needed to keep the old config system working, but shouldn't be needed anymore, this allows to generate slightly better dependencies in the generated configs.

Sam Ravnborg posted a patch and replied:

I have been working on integrating lkc with kbuild. Here is the result.


Added infrastructure to support host-ccprogs, in other words support tools written (partly) in c++.


As kbuild does not distingush between individual objects, used for a given target, but (try to) build them all, I have found a solution where I create one Makefile for each executable. I could not see a clean way to integrate this in kbuild, and finally decided that in this special case a number of Makefiles did not hurt too much.


Prepared for "_shipped" files.
Rename lex.zconf.c to lex.zconf.c_shipped etc. in the version reday to go in the kernel.

Roman was very pleased with this work, but to Sam's flex/bison point, he objected:

This works quite well for users, but it's very annoying for the developer. Kai, any chances to use md5sum for this at some point, e.g. with a helper script like this:

set -e
shift 2
test -f $dst && tail -1 $dst | sed 's,/\* \(.*\) \*/,\1,' | md5sum -c && touch
$dst && exit 0
echo "$@"
echo "/* $(md5sum $src) */" >> $dst

The only problem with this script is that it only supports a single input and output file.

Kai Germaschewski also said he was happy with Sam's work, and said he (Kai) had also made some improvements. He replied to Roman's objection:

I'm not particularly fond of these md5sum hacks. I don't think it's all that annoying for the developer, either, it's basically just a alias make="make LKC_GENPARSER=1"

(Of course, you'll have to update the _shipped files eventually, but there isn't really any way around that either way)

One might consider setting LKC_GENPARSER based on a test if bison/flex are in the path.

3. Supporting Large Numbers Of Threads

17 Sep 2002 - 22 Sep 2002 (84 posts) Subject: "[patch] lockless, scalable get_pid(), for_each_process() elimination, 2.5.35-BK"

Topics: Big O Notation, FS: sysfs, SMP, Version Control

People: Ingo MolnarWilliam Lee Irwin IIILinus Torvalds

Ingo Molnar announced:

the attached patch is yet another step towards world-class threading support.

the biggest known load-test of the new threading code so far was 1 million concurrent kernel threads (!) running on Anton's 2.5.34 Linux box. On my testsystem i was able to test 100,000 concurrent kernel threads, which can be started and stopped within 2 seconds.

While even the biggest internet servers are not quite at 1 million parallel users yet, even a much smaller 10,000 threads workload causes the O(N^2) get_pid() algorithm to explode, if consecutive PID ranges are taken and the PID value overflows and reaches the lower end of the PID range.

With 100,000 or more threads get_pid() causes catastrophic, many minutes silent 'lockup' of the system. Plus the current get_pid() implementation iterates through *every* thread in the system if it reaches a PID that was used up before - which can happen quite often if the rate of thread creation/destruction is sufficiently high. Besides being slow, the algorithm also touches lots of unrelated cachelines, effectively flushing the CPU cache. Eg. for the default pid_max value of 32K threads/processes, if there are only 10% processes (3200), statistically we will flush the cache for every 10 threads created. This is unacceptable.

there are a number of patches floating around that try to improve the worst-case scenario of get_pid(), but the best they can achieve is a runtime construction of a bitmap and then searching it => this still sucks performance-wise, destroys the cache and is generally a very ugly approach.

the underlying problem is very hard - since get_pid() not only has to take PIDs into account, but TGIDs, session IDs and process groups as well.

then i found one of wli's older patches for 2.5.23 [grr, it was not announced anywhere, i almost started coding the same problem], which provides the right (and much harder to implement) approach: it cleans up PID-space allocation to provide a generic hash for PIDs, session IDs, process group IDs and TGIDs, properly allocated and freed. This approach, besides paving the way for a scalable and time-bounded get_pid() implementation, also got rid of roughly half of for_each_process() (do_each_thread()) iterations done in the kernel, which alone is worth the effort. Now we can cleanly iterate through all processes in a session group or process group.

i took the patch, adopted it to the recent ->ptrace_children and threading related changes, fixed a couple of bugs and made it work. It really worked well, nice work William!

I also wrote a new alloc_pid()/free_pid() implementation from scratch, which provides lockless, time-bounded PID allocation. This new PID allocator has a worst-case latency of 10 microseconds on a cache-cold P4, the cache-hot worst-case latency is 2 usecs, if pid_max is set to 1 million.

Ie. even in the most hopeless situation, if there are 999,999 PIDs allocated already, it takes less than 10 usecs to find and allocate the remaining one PID. The common fastpath is a couple of instructions only. The overhead of skipping over continuous regions of allocated PIDs scales gracefully with the number of bits to be skipped, from 0 to 10 usecs.

(In the fastpath, both the alloc_pid() and free_pid() function falls through to a 'ret' instruction if compiled with gcc 3.2 on x86.)

i tested the new PID allocation functions under heavy thread creation workloads, and the new functions just do not show up in the kernel profiler ...

[ on SMP the new PID allocator has a small window to not follow the 'monotonic forward allocation of PIDs' semantics provided by the previous implementation - but it's not like we can guarantee any PID allocation sequence to user-space even with the current get_pid() allocation. The new allocator still follows the last_pid semantics in the typical case. The reserved PID range is protected as well. ]

memory footprint of the new PID allocator scales dynamically with /proc/sys/kernel/pid_max: the default 32K PIDs cause a 4K allocation, a pid_max of 1 million causes a 128K footprint. The current absolute limit for pid_max is 4 million PIDs - this does not cause any allocation in the kernel, the bitmaps are demand-allocated runtime. The pidmap table takes up 512 bytes.

and as an added bonus, the new PID allocator fails in fork() properly if the whole PID space is used up. BK-curr's get_pid() still didnt do this properly.

i have tested the patch on BK-curr, on x86 UP and SMP boxes - it compiles, boots and works just fine. X and the other session/pgrp-intensive applications appear to work just fine as well.

William Lee Irwin III, the current maintainer of the patch, said, "Thank you for taking up the completion of development on and maintenance of this patch. I have not had the time resources to advance it myself, though now with your help I would be glad to contribute to the effort. If you would like to assume ownership, I'd be glad to hand it over, and send patches removing additional instances of for_each_process() to you as I find the time." And Ingo replied, "well, it's your baby, i only dusted it off, merged it and redid the PID allocator. I have intentionally left out some of the for_each_task eliminations, to ease the merging - you are more than welcome to extend the patch."

Elsewhere, there was a medium-sized argument over whether Ingo's and William's work was needed at all. To many folks, it seemed that simply making PID a sufficiently large number would solve all the practical problems. Linus Torvalds was one of these folks. He felt Ingo's and William's work was adding too much complexity for very little gain. Ingo and others defended the patch, and by the end of the discussion Linus was willing to allow it, if they'd do some cleanup work and some tighter integration with existing code. Ingo said he was working on it.

4. Zero-Copy NFS For 2.5.36

18 Sep 2002 - 21 Sep 2002 (20 posts) Subject: "[PATCH] zerocopy NFS for 2.5.36"

Topics: FS: NFS, FS: XFS

People: Hirokazu Takahashi

Hirokazu Takahashi announced:

I ported the zerocopy NFS patches against linux-2.5.36.

I made va05-zerocopy-nfsdwrite-2.5.36.patch more generic, so that it would be easy to merge with NFSv4. Each procedure can chose whether it can accept splitted buffers or not. And I fixed a probelem that nfsd couldn't handle NFS-symlink requests which were very large.

    This patch enables HW-checksum against outgoing packets including UDP frames.
    This patch makes sendfile systemcall over UDP work. It also supports UDP_CORK interface which is very similar to TCP_CORK. And you can call sendmsg/senfile with MSG_MORE flags over UDP sockets.
    This patch fixes the problem of x86 csum_partilal() routines which can't handle odd addressed buffers.
    This patch makes RPC can send some pieces of data and pages without copy.
    This patch makes NFSD send pages in pagecache directly when NFS clinets request file-read.
    nfsd_readdir can also send pages without copy.
    This patch makes per-cpu UDP sockets so that NFSD can send UDP frames on each prosessor simultaneously. Without the patch we can send only one UDP frame at the time as a UDP socket have to be locked during sending some pages to serialize them.
    This patch enables NFS-write uses writev interface. NFSd can handle NFS requests without reassembling IP fragments into one UDP frame.
    This patch makes writev for regular file work faster. It also can be found at

    Caution: XFS doesn't support writev interface yet. NFS write on XFS might slow down with No.8 patch. I wish SGI guys will implement it.

    This makes NFS buffer much bigger (60KB). 60KB buffer is the same to 32KB buffer for linux-kernel as both of them require 64KB chunk.
    If you don't want to use sendfile over UDP yet, you can apply it instead of No.1 and No.2 patches.

5. contest Benchmark Results Comparing 2.5.34 With 2.5.36

18 Sep 2002 - 19 Sep 2002 (10 posts) Subject: "[BENCHMARK] contest results for 2.5.36"

Topics: Virtual Memory

People: Con Kolivas

Con Kolivas reported:

Here are the latest results with 2.5.36 compared with 2.5.34

No Load:
Kernel                  Time            CPU
2.4.19                  68.14           99%   
2.4.20-pre7             68.11           99%
2.5.34                  69.88           99%   
2.4.19-ck7              68.40           98%
2.4.19-ck7-rmap         68.73           99%   
2.4.19-cc               68.37           99%
2.5.36                  69.58           99%

Process Load:
Kernel                  Time            CPU
2.4.19                  81.10           80%
2.4.20-pre7             81.92           80%
2.5.34                  71.39           94%
2.5.36                  71.80           94%

Mem Load:
Kernel                  Time            CPU   
2.4.19                  92.49           77%   
2.4.20-pre7             92.25           77%
2.5.34                  138.05          54%
2.5.36                  132.45          56%

IO Halfmem Load:
Kernel                  Time            CPU
2.4.19                  99.41           70%   
2.4.20-pre7             99.42           71%
2.5.34                  74.31           93%
2.5.36                  94.82           76%

IO Fullmem Load:
Kernel                  Time            CPU
2.4.19                  173.00          41%
2.4.20-pre7             146.38          48%
2.5.34                  74.00           94%
2.5.36                  87.57           81%

The full log for 2.5.34 is:

noload Time: 69.88  CPU: 99%  Major Faults: 247874  Minor Faults: 295941
process_load Time: 71.39  CPU: 94%  Major Faults: 204811  Minor Faults: 256001
io_halfmem Time: 74.31  CPU: 93%  Major Faults: 204019  Minor Faults: 255284
Was writing number 4 of a 112Mb sized io_load file after 76 seconds
io_fullmem Time: 74.00  CPU: 94%  Major Faults: 204019  Minor Faults: 255289
Was writing number 2 of a 224Mb sized io_load file after 98 seconds
mem_load Time: 138.05  CPU: 54%  Major Faults: 204107  Minor Faults: 255695

and for 2.5.36 is:

noload Time: 69.58  CPU: 99%  Major Faults: 242825  Minor Faults: 292307
process_load Time: 71.80  CPU: 94%  Major Faults: 205009  Minor Faults: 256150
io_halfmem Time: 94.82  CPU: 76%  Major Faults: 204019  Minor Faults: 255214
Was writing number 6 of a 112Mb sized io_load file after 104 seconds
io_fullmem Time: 87.57  CPU: 81%  Major Faults: 204019  Minor Faults: 255312
Was writing number 3 of a 224Mb sized io_load file after 119 seconds
mem_load Time: 132.45  CPU: 56%  Major Faults: 204115  Minor Faults: 255234

As you can see, going from 2.5.34 to 2.5.36 has had a minor improvement in response under memory loading, but a drop in response under IO load. The log shows more was written by the IO load during benchmarking in 2.5.36 The values are different from the original 2.5.34 results I posted as there was a problem with the potential for loads overlapping, and doing the memory load before others made for heavy swapping afterwards.

contest has been upgraded to v0.34 with numerous small changes and a few fixes. It can be downloaded here:

6. Linux 2.4.20-pre2-ac2 Released

18 Sep 2002 - 19 Sep 2002 (5 posts) Subject: "Linux 2.4.20-pre7-ac2"

Topics: Disks: IDE, FS: JFS, Kernel Release Announcement, USB

People: Alan CoxHugh DickinsJoel BeckerAndre HedrickIvan KokshayskySteven Cole

Alan Cox announced 2.4.20-pre7-ac2 and said:

Ok thats the worst of the queue cleared. I still need to sort out the JFS module cond_resched stuff. This kernel also reports itself as -ac1. I know, but I only just noticed..

Linux 2.4.20-pre7-ac2

7. Supporting Large Numbers Of Threads (Continued)

18 Sep 2002 - 20 Sep 2002 (28 posts) Subject: "[patch] generic-pidhash-2.5.36-D4, BK-curr"

Topics: SMP, Version Control

People: Ingo MolnarLinus Torvalds

Ingo Molnar announced:

the attached patch is a significantly cleaned up version of the generic pidhash patch, against BK-curr. Changes:

the patch is stable in both UP and SMP tests.

performance and robustness measurements:

thread creation+destruction (full) latency is the one that is most sensitive to the PID allocation code. I've done testing on a 'low end server' system with 1000 tasks running, and i've also done 5000, 10000 and 50000 (inactive) threads test. In all cases pid_max is at a safely large value, 1 million, the PID fill ratio is 1% for the 10K threads test, 0.1% in the 1000 threads test.

i created and destroyed 1 million threads, in a serial way:

./perf -s 1000000 -t 1 -r 0 -T --sync-join

4 of such measurements were running at once, to load the dual-P4 SMP box. All CPUs were 100% loaded during the test. Note that pid_max and the number of threads created is 1 million as well, this is to get a fair average of passing the whole PID range.

BK-stock gives the following results:

        # of tasks:             250     1000    5000    10000   50000
        4x 1m threads (sec):    21.2    23.0    47.4    [NMI]   [NMI]

the results prove that even this extremely low PID space fill rate causes noticeable PID allocation overhead. Things really escallate at 10000 threads, the NMI watchdog triggered very quickly (ie. an irqs-off latency of more than 5 seconds happened). The results would have been well above 200 seconds. Even in the 5000 threads test the system was occasionally very jerky, and probably missed interrupts as well.

Note that the inactive threads (1K, 5K, 10K and 10K threads) were using a consecutive PID range, which favors the get_pid() algorithm, as all cachemisses will be concentrated into one big chunk, and ~0.95 million PIDs can be allocated without any interference afterwards. With a more random distribution of PIDs the overhead is larger.

BK+patch gives the following results:

        # of tasks:             250     1000    5000    10000   50000
        4x 1m threads (sec):    23.8    23.8    24.1    25.5    27.7

ie. thread creation performance is stable, it increases slightly, probably due to hash-chains getting bigger. The catastrophic breakdown in performance is solved.

but i'm still not happy about the 250 tasks performance of the generic pidhash - it has some constant overhead. It's not noticeable in fork latencies though. I think there are still a number of performance optimizations possible that we can do.

(i have not attempted to measure the improvement in those cases where the for_each_task loop was eliminated - it's no doubt significant.)

Linus Torvalds said, "Hmm.. I think I like it." He did have some objections, and after some debate, Ingo posted a new version, saying:

i've attached the latest version of the generic pidhash patch. The biggest change is the removal of separately allocated pid structures: they are now part of the task structure and the first task that uses a PID will provide the pid structure. Task refcounting is used to avoid the freeing of the task structure before every member of a process group or session has exited.

this approach has a number of advantages besides the performance gains. Besides simplifying the whole hashing code significantly, attach_pid() is now fundamentally atomic and can be called during create_process() without worrying about task-list side-effects. It does not have to re-search the pidhash to find out about raced PID-adding either, and attach_pid() cannot fail due to OOM. detach_pid() can do a simple put_task_struct() instead of the kmem_cache_free().

the only minimal downside is the potential pending task structures after session leaders or group leaders have exited - but the number of orphan sessions and process groups is usually very low - and even if it's higher, this can be regarded as a slow execution of the final deallocation of the session leader, not some additional burden.

[Benchmark results comparing stock kernel against patched kernel come after the Changelog.]

Changes since -D4:


single-thread create+exit latency (full latency, including the release_task() overhead) on a UP 525 MHz PIII box:

        stock BK-curr
        -> 3864 cycles [7.025455 usecs]

        BK-curr + generic-pidhash-J2
        -> 3430 cycles [6.236364 usecs]

ie. the patched kernel is 12% faster than the stock kernel. Most of the improvement is due to the fork.c and exit.c optimizations.

Note that this is the best-possible benchmark scenario for the old PID allocator: the new PID allocator's best-case latency is burdened by the fact that it avoids worst-case latencies, and its average latency is independent of layout of the PID space. Yesterday's test that proved the old allocator's high sensitivity to the PID allocation rate and patterns is still valid.

i've compiled, booted & tested the patch on UP and SMP x86 as well.

Linus said:

Ok, applied.

I'm also applying the session handling changes to tty_io.c as a separate changeset, since the resulting code is certainly cleaner and reading peoples areguments and looking at the code have made me think it _is_ correct after all.

And as a separate changeset it will be easier to test and perhaps revert on demand.

8. Module Updates

19 Sep 2002 (1 post) Subject: "[PATCH] Updated module rewrite."

Topics: Backward Compatibility

People: Rusty Russell

Rusty Russell announced:

Convenient mega-patch:

You'll want the 0.4 version of module init tools:

Changes (roughly, it's been busy here):

9. Linux 2.4.20-pre7-ac3 Released

19 Sep 2002 - 20 Sep 2002 (5 posts) Subject: "Linux 2.4.20-pre7-ac3"

Topics: Braille, FS: JFS, Hot-Plugging, PCI, USB

People: Alan CoxJens AxboeDominik Brodowski

Alan Cox announced:

I still need to sort out the JFS cond_resched

Linux 2.4.20-pre7-ac3

10. Linux Trace Toolkit 0.9.6-pre1 Released

19 Sep 2002 (1 post) Subject: "LTT 0.9.6pre1: Lockless logging, ARM, MIPS, etc."

Topics: Real-Time: RTAI, SMP

People: Karim YaghmourFrank RowandTom Zanussi

Karim Yaghmour announced:

A new development version of LTT, 0.9.6pre1, is now available. At this point, LTT supports 6 architectures: i386, PPC, S/390, ARM, SuperH, and MIPS.

Here's what's in 0.9.6pre1:

The lockless logging feature is very important because LTT can now trace a system without using any spinlocks or IRQ disabling whatsoever. Though this feature is currently only in the 2.5.35 patch included with LTT, it will be part of all patches for kernels past 2.5.35. At this time, preliminary testing of the 2.5.35 patch included in 0.9.6pre1 has shown some minor issues which will result in lockups. These issues will be addressed in a patch I will soon be releasing for 2.5.36.

This is a development release, so the usual warnings apply.

You will find 0.9.6pre1 here:

LTT's web site is here:

11. Support For GMIIPHY And GMIIREG ioctls In 2.4's 8139cp Ethernet Driver

19 Sep 2002 (1 post) Subject: "[PATCH] 8139cp: SIOCGMIIPHY and SIOCGMIIREG"

Topics: Ioctls, Networking

People: Felipe W Damasio

Felipe W Damasio announced:

This patch adds support to the GMIIPHY and GMIIREG ioctls to the 2.4 version of the 8139cp ethernet driver.

This is required so we don't break apps (eg mii-diag, mii-tools) who rely on these ioctls to get the NIC's settings.

Patch against 2.4.20-pre7.

Please consider pulling it from:

12. AccessFS 0.4 Released For 2.5.34

20 Sep 2002 (1 post) Subject: "[ANNOUNCE][PATCH] accessfs v0.4 - 2.5.34"

Topics: FS: accessfs

People: Olaf Dietsche

Olaf Dietsche announced:

Accessfs is a new file system to control access to system resources. Currently it controls access to inet_bind() with ports < 1024 only.

With this patch, there's no need anymore to run internet daemons as root. You can individually configure which user/program can bind to ports below 1024.

For further information see the help text.

I adapted accessfs to linux 2.5.34.

The patch is attached below. It is also available at:

I did minimal testing using uml 0.58-2.5.34.

13. Linux 2.5.37 Released

20 Sep 2002 (3 posts) Subject: "Linux 2.5.37"

Topics: FS: driverfs, Power Management: ACPI, Virtual Memory

People: Linus Torvalds

Linus Torvalds announced 2.5.37 ( , saying:

Lots of stuff all over the map. Arch updates (ppc*, sparc*, x86 machine reorg), VM merges from Andrew, ACPI updates, BIO layer updates, networking, driverfs, build process, pid hash, you name it it's there.

And that probably still means I missed some stuff.

14. New ext3 Indexed-Directory Patch

20 Sep 2002 - 22 Sep 2002 (3 posts) Subject: "New version of the ext3 indexed-directory patch"

Topics: Backward Compatibility, FS: ext3, Version Control

People: Theodore Y. Ts'oAndrew Morton

Theodore Y. Ts'o announced:

I've done a bunch of hacking on the ext3 indexed directory patch, and I believe it's just about ready for integration with the 2.5 tree. Testing and comments are appreciated.

The code can be found either via bitkeeper, at:


Or for those people who want straight diffs, patches against 2.4 and 2.5 can be found at:

Here are my Release Notes for my changes: (i.e., the changes from Christopher Li's port of Daniel Phillip's hashtree code)

Andrew Morton was very pleased with this, and asked, "What is the status of e2fsprogs support for htree? Is everything covered?" Theodore replied:

Almost. E2fsck support is fully there. With e2fsprogs 1.28, you still need to manually set up the dir_index feature flag to convert a filesystme to use the directory indexing feature:

debugfs -w -R "features dir_index" /dev/hdXX
debugfs -w -R "ssv def_hash_version tea" /dev/hdXX
debugfs -w -R "ssv hash_seed random" /dev/hdXX

In the 1.29 release, mke2fs will create filesystems with the directory indexing flag set by default, and tune2fs -O dir_index will do set up the directory indexing flag and the default hash version automatically.

There is also a bug in e2fsprogs 1.28 so that the -D option to e2fsck (which optimizes all directories) has a 1 in 512 chance of corrupting a directory, due to a fenceport error that escaped my testing. This will be fixed in 1.29, which should be released very shortly.

Folks who want to play with the latest e2fsprogs get grab it here:


15. Controlling Core Dump Filenames

20 Sep 2002 - 23 Sep 2002 (14 posts) Subject: "[PATCH] kernel 2.4.19 & 2.5.38 - coredump sysctl"

Topics: BSD: FreeBSD, Debugging, FS: NFS

People: Michael SinzBill DavidsenAndrew Morton

Michael Sinz announced:

coredump name format control via sysctl

Provides for a way to securely move where core files show up and to set the name pattern for core files to include the UID, Program, Hostname, and/or PID of the process that caused the core dump. This is very handy for diskless clusters where all of the core dumps go to the same disk and for production servers where core dumps want to be segregated from the main production disks.

I have again updated the patch to work with 2.4.19 and 2.5.36 kernels and am now hosting it on my web site at:

-- Patch background and how it works --

What I did with this patch is provide a new sysctl that lets you control the name of the core file. The this name is actually a format string such that certain values from the process can be included. If the sysctl is not used to change the format string, the behavior is exactly the same as the current kernel coredump behavior.

The default name format is set to "core" to match the current behavior of the kernel. Old behavior of appending the PID to the "core" name is also preserved with added logic of only doing so if the PID is not already part of the name format. This fully preserves current behaviors within the system while still providing for the full control of the format of the core file name. Thus current behavior is not a special case but "falls out" of the general code when the format is set to "core".

The following format options are available in that string:

      %P   The Process ID (current->pid)
      %U   The UID of the process (current->uid)
      %N   The command name of the process (current->comm)
      %H   The nodename of the system (system_utsname.nodename)
      %%   A "%"

For example, in my clusters, I have an NFS R/W mount at /coredumps that all nodes have access to. The format string I use is:

sysctl -w "kernel.core_name_format=/coredumps/%H-%N-%P.core"

This then causes core dumps to be of the format:


I used upper case characters to reduce the chance of getting confused with format() characters and to be somewhat simular to the mechanism that exists on FreeBSD.

Andrew Morton felt that 'sysctl -w "kernel.core_name_format=/coredumps/%H-%N-%P.core"' was too complex; he proposed an alternative, but Bill Davidsen replied:

this way you can do more things with where you put your dumps, such as using one element of this flexible method to select a directory, where the dump directories for various applications would be on a single NFS server, and dumps for another might be on another server, or all dumps of a certain kind could share a filename, where only the latest dump would be of interest (or take space).

The code seems to have very little overhead involved in the parse, and it gives a good deal of flexibility to the admin. I like the idea of a sysctl for setting the value, you don't want to have to reboot the system when an app goes sour and you need to save more than one dump to run it down, or need to mvoe the dump target dir somewhere with more space.

If you're worried about size make it a compile option, but if I (as an admin) need any control I really want a bunch of control I can set right now. I don't think most people will want this option, but it would be really useful in some cases.

This made sense to Andrew, and Michael was also impressed, saying, "Ahh, I never thought of using the program name for a directory. That would be nice as only those programs you pre-made a directory for would dump (as the code does not create directories). Using the UID for a directory name works out to separate the dumps of different users. (And works really well too - albeit on a non-Linux platform that happens to have a simular feature.)"

16. Linux Hardened Device Project

20 Sep 2002 - 24 Sep 2002 (35 posts) Subject: "[ANNOUNCE] Linux Hardened Device Drivers Project"

Topics: BSD, POSIX

People: Rob RhoadsAndre HedrickGreg KHPatrick MochelJeff Garzik

Rob Rhoads announced:

Project Announcement:

We've started a new project on w/ focus on hardening Linux device drivers for highly available systems. This project is being worked on with folks from OSDL's CGL and DCL projects as well.

Initially we've created a specification, a few kernel modules that implement a set of driver programming interfaces, and a sample device driver that demonstrates those interfaces.

We are actively soliciting involvement with others in the Linux developer community. We need your help to make this project relevant and useful.

Below I've included an overview of the hardened driver project. By no means is this complete or final. It's just our initial attempt at defining what is meant by the term hardened driver and the areas we want to focus on.

For additional info, please checkout the links at the bottom of this message and the Hardened Drivers web site at

Hardened Driver Project Overview:

Device drivers have traditionally been a significant source of software faults. For this reason, they are of key concern in improving the availability and stability of the operating system. A critical element in creating Highly Available (HA) environment is to reduce the likelihood of faults in key drivers, a methodology called driver hardening.

A device driver is typically implemented with emphasis on the proper operation of the hardware. Attention to how it will function in the event of hardware faults is often minimal. Hardened drivers, on the other hand, are designed with the assumption that the underlying hardware that they control will fail. They need to respond to such failures by handling faults gracefully, limiting the impact on the overall system. Hardened device drivers must continue to operate when the hardware has failed (e.g. allow device fail-over), and must not allow the propagation of corrupt data from a failed device to other components of the system.

Hardened device drivers must also be active participants in the recovery of detected faults, by locally recovering them or by reporting them to higher-level system management software that subsequently instructs the driver to take a specific action.

The goal of a hardened driver is to provide an environment in which hardware and software failures are transparent to the applications using their services, where possible. The way to effectively achieve this goal is to analyze a driver's software design and implement appropriate changes to improve stability, reliability and availability, and to provide instrumentation for management middleware.

We believe that improving driver stability and reliability includes such measures as ensuring that all wait loops are limited with a timeout, validating input and output data and structuring the driver to anticipate hardware errors. Improving availability includes adding support for device hot swapping and validating the driver with fault injection. Instrumentation for management middleware includes functions such as reporting of statistical indicators and logging of pertinent events to enable postmortem analysis in the event of a failure.

To minimize instability contributed by device drivers and to enhance the availability of HA systems, we've attempted to define a set of requirements that a device driver should adhere to in order to be considered a hardened driver. We then define different hardening traits and the required programming interfaces to support these hardening traits.

We've identified four areas in which drivers can be hardened:

We've also identified some key areas we feel are most critical to overall system stability and plan to focus initial hardening efforts on drivers for network interface cards, physical storage, and logical storage.

Project Links:

There was a lot of criticism of the whole idea on linux-kernel. Jeff Garzik said the goals were lofty enough, but that it was absurd to try to formalize the 'hardening' process, without taking into account the fact that the core, people had to be responsible for making sure the code worked properly.

Elsewhere, Andre Hedrick felt even the goals were pretty bad. He said:

Obvious this is a way for the telecom folks to get something for free that really should be paid for by funding the project with CASH. Or funding (a) startup(s) related to generating such support.

Regardless, it takes (fill in the blank) to boldly ask people to add APIs for an industry who is only interested in using and not contributing. Prove that all the stuff which is going to be plugged into these security-hole^Wbug-generators^Wfeatures will be scheduled for open source. Or this another attempt to try and take over the license and shove BSD down the piles?

Rob replied, "This project is open to anyone who wants to participate and is being paid for by Intel and a host of other companies. The idea is to enable Linux to play in the Carrier space with all the work given away under the GPL." But Andre said, "Explain how it is paid and to whom?" There was no reply.

Greg KH took a look at the published specification, and gave some point by point comments, then summarized:

I think that a lot of people have spent a lot of time in creating this document, and the surrounding code that matches this document. I really wish that a tiny bit of that effort had gone into contacting the Linux kernel development community, and asking to work with them on a project like this. Due to that not happening, and by looking at the resultant spec and code, I'm really afraid the majority of that time and effort will have been wasted.

What do I think can be salvaged? Diagnostics are a good idea, and I think they fit into the driver model in 2.5 pretty well. A lot of kernel janitoring work could be done by the CG team to clean up, and harden (by applying the things in section 2) the existing kernel drivers. That effort alone would go a long way in helping the stability of Linux, and also introduce the CG developers into the kernel community as active, helping developers. It would allow the CG developers to learn from the existing developers, as we must be doing something right for Linux to be working as well as it does :)

Also, open specs for the hardware the CG members produce, to allow existing kernel drivers to be enhanced (instead of having to be reverse engineered), and new kernel drivers to be created, would also go a long way in helping out both the CG's members and the entire Linux community's cause of having a robust, stable kernel be achived easier. Closed specs, and closed drivers do not help anyone.

Patrick Mochel agreed with Greg's first paragraph, saying:

in order to gain the support of kernel developers, or even the blessing of a few, you should be working with them on the design from the beginnging.

Designing APIs is hard. Doing it well is very hard. I'm not claiming I've done a stellar job, but I have at least learned that. I've made a lot of poor design decisions, many of which are also evident in your code descriptions and examples. I can't tell you how many times I've rewritten things over and over and over because someone hated them (usually Linus or Greg).

There are people that are willing to help, as we are trying to do. But, it's much easier if you do things gradually and get that help from the beginning.

At one point Rob said elsewhere:

I appreciate all the feedback. Based on the wide variety of ideas/comments, it looks like I need to go back and incorporate these ideas into the document, potentially changing areas in major ways where appropriate.

Rather than bog down this mailing list with exchanges, I would like to move this discussion to the hardened driver mailing list. Please don't feel like I'm ignoring your feedback--just moving the forum.

An underlying theme tends to revolve around the binding of the concepts of 'hardening' and RAS features being added to drivers. We will be looking into splitting these two different approaches out from this singular document and into their appropriate locations.

He set up a new mailing list for the discussion, but Greg said this was a bad idea. Major discussions of kernel drivers should be on linux-kernel, he said. Rob dove in, saying:

First throw away any idea of a spec. That was a bad idea. :)

Next, turn the first section, "Stability & Reliability" of our original doc into a "Driver Hardening HOWTO". It would be a list of characteristics that all good drivers should have, packed with examples to back it up.

BTW, by no means did I or anyone involved on this project, ever mean to imply that the current drivers in the kernel are "bad". Rather, I'd like to capture a list of the best practices and document them. In any event our current list needs to be strengthened with concrete examples. My thinking is that we should work with the Kernel Janitor project. This is where Intel can probably really help out.

The section on Instrumentation should be broken up and each piece dealt with separately as separate project. Most likely killed outright or as part of existing efforts. I see this section as not having anything to do with driver hardening and more to do with driver RAS.

POSIX Event Logging-- is a dead issue. The mailing list feedback is making that point very clear, many thanks. The current thread on an alternative, seems like there is some sort of need for event logging. Whatever the final decision that the Linux community decides, we'll do.

There seems to be a desire to have some sort of driver diagnostics. We can work on that with the existing linux-diag project.

Statistics needs to be debated on its own merits. There are some arguments for keeping it, but I think that stats could be better handled in user-space and NOT kernel space. IMHO it's not driver hardening, therefore it's a separate project.

Third, the most of the section on High Availability should just be axed. The big exception being "fault injection testing".

I see value in keeping FI testing. I think that getting FI tools into the hands of developers would be worthwhile. Why? Because letting people do more complicated testing, produces better code. I think there is room for us to work on a set of FI tools.

Greg liked these ideas, and Rob said he'd get on it.

17. CPUFreq Released For 2.5.37

21 Sep 2002 (1 post) Subject: "cpufreq for 2.5.37 now available"

Topics: Hyperthreading

People: Dominik Brodowski

Dominik Brodowski announced:

Updated patches for CPU frequency and voltage scaling support are now available at

Changes since the version for 2.5.36:

complete patch for kernel 2.5.37:

step-by-step patches for kernel 2.5.37:

backport to kernel 2.4.19/2.4.20-pre7:

This cpufreq version is included in 2.4.-ac patchsets since 2.4.20-pre7-ac1, a few updates for -ac2 and -ac3 can be found here:

Comments welcome; please ensure that the cpufreq development list at receives a copy of all comments.

18. Problem Report Status For 2.5 For September 21

21 Sep 2002 (3 posts) Subject: "2.5 Problem Report Status"

Topics: FS: JFS, Software Suspend, Version Control

People: Thomas Molina

Thomas Molina reported:

I had a system crash, so this may have some holes in it. My backup was a week old since this is my testing system.

The most up-to-date version of this report can be found on my web page at:

              2.5 Kernel Problem Reports as of 21 Sep
     Problem Title                  Status               Discussion
 1   BUG at kernel/sched.c          open                 15 Sep 2002
 2                                  another report       13 Sep 2002
 3                                  fix under discussion 18 Sep 2002
 4   lockups under X                possible fix in bk   21 Sep 2002
 5                                  additional reports   18 Sep 2002
 6   2.5.37 won't run X             open                 21 Sep 2002
 7   KVM/Mouse problem              open                 20 Sep 2002
 8   AIC7XXX boot failure           open                 20 Sep 2002
 9   Dead loop on virtual device lo open                 18 Sep 2002
10   nmi_watchdog problem           open                 19 Sep 2002
11   JFS software suspend problem   open                 18 Sep 2002
12   preempt related lockup         possible fix in bk   20 Sep 2002
13   ide double init                open                 19 Sep 2002
14   DRM/XFree issue                open                 18 Sep 2002
15   oops in lock_get_status        open                 18 Sep 2002
16                                  additional reports   20 Sep 2002
17   scheduling while atomic oops   possible fix in bk   20 Sep 2002



19. Floppy Driver Broken In 2.5.37

21 Sep 2002 - 23 Sep 2002 (8 posts) Subject: "2.5.37 broke the floppy driver"

Topics: Version Control

People: Mikael PetterssonJens AxboeAlexander ViroThomas Molina

Mikael Pettersson reported that doing a 'dd bs=8k if=bzImage of=/dev/fd0' under 2.5.37 "makes the kernel print "blk: request botched" and a few seconds later instantly reboot the machine (w/o any further messages)." He could reproduce this at will, but 2.5.36 worked fine. Thomas Molina confirmed that he could duplicate this on 2.5.37-bk as well as 2.5.38-bk. Jens Axboe posted a patch, and Mikael replied, "It's an improvement (the kernel doesn't reboot as soon as I read or write /dev/fd0), but there are still some strange things going on with floppy in 2.5.38." They hunted around a bit more, and Alexander Viro identified part of the problem as his own fault; but a conclusive solution did not emerge in the discussion.

20. MMU-Less Patches 2.5.38-uc0 Released

22 Sep 2002 (1 post) Subject: "[PATCH]: linux-2.5.38uc0 (MMU-less support)"

People: Greg Ungerer

Greg Ungerer announced:

Latest MMUless support patches, against linux-2.5.38, are up at:

The patch file is linux-2.5.38uc0.patch.gz

After the hiccup that was 2.5.17 this one patched pretty easily. Happy hacking :-)

21. Wolk 3.6 Released

22 Sep 2002 (1 post) Subject: "[ANNOUNCE] WOLK v3.6 FINAL"

Topics: Compression, Disks: SCSI, FS: JFS, FS: NTFS, FS: ext3, Framebuffer, Networking, PCI, POSIX, Power Management: ACPI, Sound: i810, USB, User-Mode Linux

People: Marc-Christian Petersen

Marc-Christian Petersen gave a link ( and announced:

this is v3.6 FINAL of WOLK. This is the last release of the 3.x series!

Here we go, Changelog from v3.5 -> v3.6

 o indicates work by WOLK Developers (almost me)
 + indicates work by WOLK Users

+   add:        SuperPage Support for alpha, sparc64 and x86
                 This is an EXPERIMENTAL PATCH. Apply manually! Nr. 990
o   add:        SCSI-Idle for v2.4.19 + SCSI Idle Daemon in WOLK-Tools package
o   add:        oom_killer updates from v2.4.19 final
o   add:        some another ext3 additions from v2.4.20-pre5
o   add:        VFS Soft/Hard-Limit of FileDescriptors
o   add:        ebtables v2.0
+   fixed:      USB v2.4.19 compile problems / missing definitions
+   fixed:      BlueTooth v2.4.19 compile problems / missing definitions
+   fixed:      Some Config dependencies for ISDN / USB Stuff
o   fixed:      LSM compile problems. totally conflicts with CTX(vserver)
o   fixed:      One AIO reversed #ifdef -> #ifndef
o   fixed:      Forgot to add "gr_is_capable(cap)" to #ifndef CONFIG_LSM
                 This broke capabilities to add/remove with grsecurity!
o   update:     MIPL Mobile IPv6 v0.9.4
o   update:     Bridge with Netfilter firewalling v0.0.7
o   update:     ACPI (Sep 18th, 2002) (use pci=noacpi when you have problems)
o   update:     e2compression v0.4.43
o   update:     SOFFIC (Secure On-the-Fly File Integrity Checker) v0.1
o   update:     Crypto v2.4.19-1
                 includes new options:
                 - 3DES cipher (64bit blocksize)
                 - GOST cipher (64bit blocksize)
                 - NULL cipher (NO CRYPTO)
                 - RIPEMD160 digest
                 - SHA256 digest
                 - SHA384 digest
                 - SHA512 digest
                 - Atomic Loop Crypto
                 - Loop IV hack
                 - Loop Crypto Debugging
                 - IPSEC tunneling (ipsec_tunnel) support
o   update:     Compressed Cache v0.24-pre4
o   update:     HTB v3.6-020525
o   update:     grsecurity v1.9.7 final
                 + gradm v1.5 final in the WOLK-tools package
o   update:     IBM's NGPT2 (Next Generation Posix Threading 2) v2.0.2
o   update:     htree ext3 directory indexing 2.4.19-3-dxdir
o   update:     UML - User-Mode-Linux v2.4.18-53
o   update:     NTFS Filesystem Driver v2.1.0a
o   update:     XFree v4.2.0 DRM/DRI Drivers from 2.4.20-preX-acX tree
o   update:     JFS v1.0.22
o   update:     Intel EtherExpress PRO/100 Support (Alternate Driver) v2.1.6
o   update:     Intel EtherExpress PRO/1000 Gigabit NIC Support v4.3.2
o   update:     i810/i815 Framebuffer Device Driver v0.0.33
o   change:     CTX12 (Virtual private servers and security contexts)
                 and disable vservers if grsecurity is selected (breaks gradm)
o   removed:    some ext3 additions
                 (causes system locking at high disk i/o)

22. BitKeeper Behavior

22 Sep 2002 (2 posts) Subject: "boring BK stats"

Topics: Compression, FS: ext2, Version Control, Virtual Memory

People: Larry McVoyJeff Garzik

Larry McVoy said:

I should be working on getting the bk-3.0 release done but I'm sick of fixing BK-on-windows bugs...

Linus' kernel tree has 13333 revision controlled files in it. Without repository compression, it eats up 280M in an ext2 fs. With repository compression, that drops to 129M. After checking out all the files, the size of the revision history and the checked out files is 317MB when the revision history is compressed. That means the tree without the history is 188MB, we get the revision history in less space than the checked out tree. That's pretty cool, by the way, I know of no other SCM system which can say that.

Checking out the tree takes 16 seconds. Doing an integrity check takes 10 seconds if the repository is uncompressed, 15 seconds if it is compressed. That's on 1.3Ghz Athlon w/ PC133 memory running at the slower CAS rate, but lots of it, around 900MB.

An integrity check checksums the entire revision history and does a checkout into /dev/null to make sure that both the overall and most recent delta checksums are valid.

There are about 8600 changesets in the tree. There have been 76998 deltas made to the tree since Feb 05 2002. That's an average of 37 changesets and 333 deltas per *day* seven days a week. If you assume a 5 day work week then the numbers are 52 csets/day and 466 deltas/day.

Those changerate numbers are pretty zippy. You guys are rockin'.

As for syncs with bkbits, I dunno, my guess is we're pushing 300,000 pulls or so. We're nowhere near to saturating the T1 line so BK compression stuff is working well.

Jeff Garzik replied:

If you can't fit a whole tree including metadata into RAM, though, BK crawls... Going from "bk citool" at the command line to actually seeing the citool window approaches five minutes of runtime, on this 200MB laptop... [my dual athlon with 512MB RAM corroborates your numbers, though] "bk -r co -Sq" takes a similar amount of time...

I also find that BK brings out the worst in the 2.4 kernel elevator/VM... mouse clicks in Mozilla take upwards of 10 seconds to respond, when "bk -r co -Sq" is running on this laptop [any other read-from-disk process behaves similarly]. And running any two BK jobs at the same time is a huge mistake. Two "bk -r co -Sq" runs easily take four or more times longer than a single run. Ditto for consistency checks, or any other disk-intensive activity BK indulges in.

Next time I get super-annoyed at BK on this laptop, I'm gonna look into beating the disk scheduler into submission... some starvation is clearly occurring.

23. User-Mode Linux 2.5.38-1 Released

23 Sep 2002 (1 post) Subject: "uml-patch-2.5.38-1"

Topics: Kernel Build System, User-Mode Linux

People: Jeff DikeNikita Danilov

Jeff Dike announced:

UML has been updated to 2.5.38.

Thanks to comments from Al Viro and fixes from James McMechan, the block driver is up to date with the block layer changes.

There were some fixes to keep up with the latest kbuild, including some changes in the top-level Makefile, which I'll be feeding to Kai.

There were also a number of smaller fixes to update to 2.5.38, a number of which came from Nikita Danilov.

I'll be feeding these changes to Linus.

The patch is available at

For the other UML mirrors and other downloads, see

Other links of interest:

The UML project home page :
The UML Community site :







Sharon And Joy

Kernel Traffic is grateful to be developed on a computer donated by Professor Greg Benson and Professor Allan Cruse in the Department of Computer Science at the University of San Francisco. This is the same department that invented FlashMob Computing. Kernel Traffic is hosted by the generous folks at All pages on this site are copyright their original authors, and distributed under the terms of the GNU General Public License version 2.0.