Kernel Traffic
Latest | Archives | People | Topics
Latest | Archives | People | Topics
Latest | Archives | People | Topics
Home | News | RSS Feeds | Mailing Lists | Authors Info | Mirrors | Stalled Traffic

Kernel Traffic #141 For 12 Nov 2001

By Zack Brown

Table Of Contents

Mailing List Stats For This Week

We looked at 1672 posts in 7953K.

There were 603 different contributors. 266 posted more than once. 193 posted last week too.

The top posters of the week were:

1. Memory Debugging Tool

24 Oct 2001 - 2 Nov 2001 (42 posts) Archive Link: "xmm2 - monitor Linux MM active/inactive lists graphically"

Topics: Virtual Memory

People: Zlatko CalusicAndrea ArcangeliLinus Torvalds

Zlatko Calusic announced a new version of xmm2, a tool to monitor active/inactive lists in the MM code. He added:

As Linus' MM lost inactive dirty/clean lists in favour of just one inactive list, the application needed to be modified to support that.

You can still continue to use the older one for kernels <= 2.4.9 and/or Alan's (-ac) kernels, which continued to use older Rik's VM system.

There was a long discussion about an apparent problem Zlatko was seeing with the VM in Linus Torvalds' tree. Linus, Andrea Arcangeli (author of the code) and others piled on the problem, but it turned out that Zlatko's system used the wrong hdparm parameters. There was no discussion of xmm2.

2. Linus And Alan Outline Their Future Plans (Wow!)

31 Oct 2001 - 5 Nov 2001 (34 posts) Archive Link: "2.4.14-pre6"

Topics: FS: ext3, Kernel Release Announcement, OOM Killer, Virtual Memory

People: Linus TorvaldsMichael PeddemorsAlan CoxMarcelo TosattiDavid Weinehall

Linus Torvalds announced Linux 2.4.14-pre6, remarking, "Incredibly, I didn't get a _single_ bugreport about the fact that I had forgotten to change the version number in pre5. Usually that's everybody's favourite bug.. Is everybody asleep on the lists?" He also said:

The MM has calmed down, but the OOM killer didn't use to work. Now it does, with heurstics that are so incredibly simple that it's almost embarrassing.

And I dare anybody to break those OOM heuristics - either by not triggering when they should, or by triggering too early. You'll get an honourable mention if you can break them and tell me how ("Honourable mention"? Yeah, I'm cheap. What else is new?)

In fact, I'd _really_ like to know of any VM loads that show bad behaviour. If you have a pet peeve about the VM, now is the time to speak up. Because otherwise I think I'm done.

Michael Peddemors suggested, "Lets' let this testing cycle go a little longer before making any changes.. Let developers catch up.." Linus replied:

My not-so-cunning plan is actually to try to figure out the big problems now, then release a reasonable 2.4.14, and then just stop for a while, refusing to take new features.

Then, 2.4.15 would be the point where I start 2.5.x, and where Alan gets to do whatever he wants to do with 2.4.x. Including, of course, just reverting all my and Andrea's VM changes ;)

I'm personally convinced that my tree does the right thing VM-wise, but Alan _will_ be the maintainer, and I'm not going to butt in on his decisions. The last thing I want to be is a micromanaging pointy-haired boss.

(2.5.x will obviously use the new VM regardless, and I actually believe that the new VM simply is better. I think that Alan will see the light eventually, but at the same time I clearly admit that Alan was right on a stability front for the last month or two ;)

Several folks asked for ext3 to be included in 2.4.14, but Michael Peddemors said, "As much as I would like to ext3 get in, NOT IN THIS RELEASE please... Don't put anything else in, until what we got works.. Hit him up on 2.4.15 :)"

In a completely different forum, Alan Cox wrote in his Advogato Diary:

People will have been wondering about the 2.4 stable kernel progression. Various bizarre rumours in Byte seem to have generated a lot of discussion and rumour. Now that the people concerned are all agreed its time to put the entire roadmap out and make it clear.

Linus will be releasing a 2.4.14 and probably a 2.4.15 finishing off the VM stability work and other rough corners. At that point the 2.5 kernel tree will be opened. There is a lot stuff queued for 2.5. It isn't going to be possible or sensible to throw it all into 2.5.0. One of the tasks is to put changes together in the right order.

Marcelo Tosatti will be the head maintainer over the 2.4 stable kernel tree. This is not the giant change it may seem from the outside. The stable kernel management was and is a group effort. Marcelo and many others have been active in 2.2 and 2.4 stabilisation work. I'll be helping Marcelo with advice when he asks it, and working on feeding him the 2.4 relevant bits of the -ac tree.

I will not be dissappearing from the scene, although I might be a little less visible at times. There are various kernel projects I will be working on as well as spending more time concentrating on Red Hat customer related needs. I'm hopeful that spending more time closer to customers will help provide more insight into where 2.5 needs to be going.

David Weinehall did a great job on 2.0.39 when he took over 2.0 from me. I'm very confident that Marcelo will do a great job on 2.4.

(Thanks go to Christophe Barbé for the Advogato link.)

3. Andrea's VM Code Performs Better Than Rik's

31 Oct 2001 - 3 Nov 2001 (17 posts) Archive Link: "graphical swap comparison of aa and rik vm"

Topics: Virtual Memory

People: Ed SweetmanRik van RielAndrea Arcangeli

Ed Sweetman (known to the list as 'safemode') reported:

In an earlier post i mentioned a way of locking up my vm easily and repeatedly but that has since been fixed in one way or another. I reran the test and took vmstat 1 's of both runnings on a 2.4.14-pre6-preempt kernel and a 2.4.13-ac5-preempt kernel. I began both vmstat's at the same time (about 4 seconds before running each). What i did was run kghostview on a postscript file located here It is 224K. kmail was loaded previously in both trials so kdeinit was already loaded as were all libs. After kghostview became responsive, i waited a few seconds (again about 5) and then exited the app.

No other interaction or running programs were present while doing this. I have 771580 KB of ram and 290740 KB of swap.

Now to explain the graphs. The blue is AA's vm. The red is Rik's vm. Rik's vm finished in 66 seconds. AA's vm finished in 52 seconds. Both start at 0 swap usage. Both from clean boots.

Here is the graph It's about 4.6K.

When you look at the graph it goes like this. The left side is 0 seconds, the right side is 66 seconds. bottom is 0KB, top is 290740KB.

These are generated from data from the orignal vmstat outputs. These are at and

I'll leave the actual interpretation of the data of both the graph and raw data up to those who actually know the code.

Neadless to say that while running the test on either box, the entire computer became unresponsive multiple times for extended lengths of times. No OOM was generated on either run.

To explain the better performance of Linus' kernel, Rik van Riel (author of Alan's VM) said, "I think this is because in safemode's test, the swap space gets exhausted. My VM works better when there is lots of swap space available but degrades in the (rare) case where swap space is exhausted. Testing corner cases always gives interesting results ;)" . But Ed replied:

I think the answer of why AA's kernel beat rik's has nothing to do with how much swap rik is using or how much swap is being swapped back in. It has to do with how rik decides what to swap. Apparently the algorithm used by rik to play with memory is taking seriously too much cpu and it leaves little for the actual process to work. Thus AA's less cpu intensive code allows the program to actually run and despite making errors in what to swap-out, the process finishes well before Rik's more intelligent code.

Unfortunately, the trailing columns in my aa vmstat somehow got lost during the paste from terminal buffer to file. This means i'm going to have to redo it all in order to get an accurate measurement to compare system cpu time to the rik vm. But for now i think the rik vm system graph is sufficient. And there are some numbers from the AA vmstat and those alone show a much lower cpu usage than in rik's. MUCH.

I made an overlay of Rik's system ( kernel ) cpu usage on top of the so and si graphs to illustrate this. Bottom being 0% top being 100% usage.

Here we see that after every major write out, there is major kernel cpu usage. This is serious usage, and this is the reason why rik's VM loses the race even though it swapped out and in the right things the first try more often than AA's.

Of course after each major write out in Rik's vm there is a minor read in. These happen to be directly under the cpu spikes so this could be the cause of the cpu usage, perhaps determining where the page is? I dont know enough about what's going on in the code to figure out if the VM does something after writing out that could be using all that cpu or if whenever it needs to read in. Although now that i look at it i'm tending to lean towards some bad code dealing with swap -> ram.

This is truly where the simple vm design conquers the complex. Less cpu being used by the kernel means more by the program, and sometimes the time gained by not using a lot of cpu greatly outweighs the time lost by having to correct mistakes with deciding what gets swapped in and out.

Maybe i'm wrong as to the cause of the kernel cpu usage, but from the numbers i do have from AA's vmstat, they are much higher in Rik's vm than in Andrea's. That and the fact that Rik's vm seems to be doing the right thing whereas Andrea's is having to fix mistakes yet Rik's loses seems to tell you that i'm not wrong in thinking that it's the vm's cpu usage that is the culprit.

Rik replied:

Note that this is likely to be a side effect of running completely out of swap, because that means many of the "obvious candidates" of what to swap out cannot be swapped out, meaning we have to scan more pages until we find something which already has swap backing.

Before you draw conclusions like the one above, please test again with more swap.

Ed replied that he'd do another test after work, but that there was no denying that the code written by Rik used more swap to do the same thing than the code in Linus' kernel (written by Andrea Arcangeli). Rik replied:

Uhhh ... this is nothing but a classical speed/size tradeoff.

The fact that under my VM swap space stays reserved for the program on swapin means that if the page isn't dirtied, we can just drop it without having to write it to disk again.

In situations where there is enough swap available, this should be a win (and it has traditionally been a big win).

Andrea's VM always frees swap space on swapin, so even if the process doesn't write to its memory at all, the data still needs to be written out to disk again.

Only in the one corner-case where my VM runs out of swap space and Andrea's VM doesn't yet run out of swap you'll find situations where the tactic used by Andrea's VM has its advantages, but I consider this to be a rare situation.

Ed added more swap and redid his tests, and found that Linus' kernel still performed better, in fact much better. End Of Thread (tm)

4. Solaris Making Use Of Linux

1 Nov 2001 - 2 Nov 2001 (5 posts) Archive Link: "Code from ~2.4.4 going into Solaris 9 Alpha?"

Topics: BSD, Samba

People: Mike FedykDanek DuvallChris Ricker

Mike Fedyk noticed on a graph of the history of UNIX, a line going from Linux to Solaris 9 Alpha. He asked, "Does anyone know what code they copied, and if they're now making solaris GPL compatible?" Danek Duvall speculated, "That might simply be the inclusion of various "freeware" packages -- shells, gzip, apache, samba, and so forth, not necessarily kernel code. All of those packages come with full source as well, so they should be compliant with the GPL if that's how they happen to be licensed."

Mike had been flamed off-list, and replied to Danek, "I didn't mean to start a flame thread (like someone accused me of doing), it just looked interesting to me, and I don't remember anything in kernel traffic (which I was reading at the time) or on lkml (which I have been reading more recently)... so I figured someone here would know more." Chris Ricker replied in technical terms:

Solaris 9/ia32 includes software called lxrun (actually slip-streamed during Solaris 8, as Sun is so fond of doing for some brain-dead reason) which implements the Linux/ia32 ABI on Solaris/ia32. It's much like the Linux compatibility layer all the *BSDs have these days.

Solaris 9 on both Intel and Sparc also implements more of the Linux (really primarily GNU glibc) APIs. The idea is that Linux apps are now just a recompile away from running on Solaris (assuming they're sane and don't have 32-bit / 64-bit or endian issues to be sorted out), with no portage necessary....

I'd imagine these two features are what the line reflects. No code theft has taken place, and Solaris is definitely not GPL'ed.

5. Comparing The 2.2 And 2.4 Virtual Memory Subsystems

1 Nov 2001 - 2 Nov 2001 (5 posts) Archive Link: "Linux 2.2 and 2.4 VM systems analysed"

Topics: Virtual Memory

People: Derek GliddenRik van Riel

Derek Glidden reported:

I've been following the 2.4 VM issues since the early 2.4-pre days. As a "power user" and someone who uses Linux at work, the kernel's stability is of great interest to me. Finally, I got sick of trying to interpret the data from various sources on how well the 2.4 VM systems perform overall and in comparison with each other and other systems. So I ran my own tests against 2.4.12-ac6, 2.4.13, and 2.2.19 and wrote up the results:

"An analysis of three Linux kernel VM systems"

The conclusion in a nutshell is that yes, the 2.4 kernel VM systems still have a few quirks to work out, but overall they are so significantly better than the 2.2 VM that there really is no comparison.

However, this "significantly better" conclusion is for certain high-stress situations where the 2.2 VM apparently fails entirely, while 2.4 chugs along with barely a notice.

For overall end-user experience, 2.2 still "feels" better overall with better interactive responsiveness under a varying set of loads even though 2.4 really is faster at doing the actual work.

Rik van Riel really liked the document, and was very happy that Derek had taken the time and put in the work to do it. There was no other real discussion.

6. Bootmem For 2.5

2 Nov 2001 - 8 Nov 2001 (8 posts) Archive Link: "[RFC] bootmem for 2.5"

People: William Lee Irwin IIIRobert LoveTony Luck

William Lee Irwin III announced:

A number of people have expressed a wish to replace the bitmap-based bootmem allocator with one that tracks ranges explicitly. I have written such a replacement in order to deal with some of the situations I have encountered.

The following patch features space usage proportional only to the number of distinct fragments of memory, tracking available memory at address granularity up until the point of initializing per-page data structures, and the use of segment trees in order to support efficient searches on those rare machines where this is an issue. According to testing, this patch appears to save somewhere between 8KB and 2MB on i386 PC's versus the bitmap-based bootmem allocator.

The following patch has been tested on i386 PC's, IA64 Lions, and IBM IA64 NUMA hardware with sparse memory, and debugged without the help of logic analyzers or in-target probes. I would like to thank the testers of #kernelnewbies (reltuk and asalib) and my co-workers for their help in making this work, and Tony Luck and Jack Steiner for their assistance in profiling the existing bootmem.

I am now especially interested in feedback regarding its design, and also the results of wider testing.

Robert Love was very impressed, and replied, "The patch is without problem on 2.4.13-ac7. Free memory increased by about 100K: free and dmesg both confirm 384292k vs 384196k. This is a P3-733 on an i815 with 384MB. Very nice." He also added, "Note that the patch and UP-APIC do not get along. Some quick debugging with William found the cause. APIC does indeed touch bootmem. The above is thus obviously with CONFIG_X86_UP_APIC unset." William was thrilled that Robert had tested the patch, and promised to investigate the problem. A couple days later he posted again to the list, saying he'd managed to reproduce the bug; and posted a new patch. Robert tried the patch, reporting, "No problem on any system -- no difference, in fact, except the gain in total system memory. Most importantly, however, the new design is quite nice. :>"

7. Status Of Matrox G550 Framebuffer Support

3 Nov 2001 - 4 Nov 2001 (4 posts) Archive Link: "Support for Matrox G550 framebuffer?"

Topics: Framebuffer

People: Petr VandrovecAlan CoxDave Jones

Jordan Breeding asked if there was or would be support for the Matrox G550 framebuffer. Dave Jones replied that Petr Vandrovec had sent a pretty good patch into the linuxfb-devel mailing list a few weeks before; Petr also replied to Jordan, saying, "I sent patches to Alan on Friday. I do not know whether he'll apply them or not. But for using G550 you must download matroxset from, as if you are connecting VGA monitor to card, you are on 90% using secondary output." And Alan Cox confirmed, "They are in my working tree and will be in the next -ac." End Of Thread (tm)

8. Faster File Creation And Deletion For ext2

3 Nov 2001 - 7 Nov 2001 (16 posts) Archive Link: "Ext2 directory index, updated"

Topics: Big Memory Support, FS: ext2

People: Daniel PhillipsChristian Laursen

Daniel Phillips announced a new version of his patch to allow faster file creation and deletion. A performance graph is also available. He warned that the code should only be used on test machines, and said:

This update mainly fixes a bug, a one-in-a-million occurance on an untested code path. This bug resulted in rm -rf deleting all files but one from a million-file directory. I believe that's the last untested code path, and otherwise it's been very stable.

I didn't expect highmem to work properly, and it didn't. It's on my to-do list, but for now highmem has to be off or you will oops on boot.

I elaborated the dx_show_buckets debug output to show dump the full index tree instead of just one level. This function now serves as a capsule summary of the index tree structure, and as you can see, it's simple.

I've done quite a bit more testing, including stress testing on a real machine and I find that everything works quite comfortably up to about 2 million files, turning in an average time of about 50 microseconds/create and 300 microseconds/delete (1 GHz PIII). In the 4 million file range things go pear-shaped, which I believe is not due to the index patch, but to rd. The runs do complete, but require exponentially more time, with cpu 98% idle and block throughput in the 300/second range. I'll look into that more later.

I did run into some bad mm behavior on 2.4.13. The icache seems to be too severely throttled, resulting in delete performance being less than it should be. I also find I am rarely unable to create a million file test run on uml (2.4.13) without oom-ing. In my experience, such problems are not due to uml, but to the kernel's memory manager. These issues may have been addressed in recent pre-patch kernels, but it seems there is a still some room for improvement in mm stability.

The patch is available at:

Christian Laursen tried the patch, never having examined earlier versions before, and reported:

I must say, that the first impression is very good indeed.

I took a real world directory (my linux-kernel MH folder containing roughly 115000 files) and did a 'du -s' on it.

Without the patch it took a little more than 20 minutes to complete.

With the patch, it took less than 20 seconds. (And that was inside uml)

But he added, "However, when I accidentally killed the uml, it left me with an unclean filesystem which fsck refuses to touch because it has unsupported features. Even the latest version does this." He asked if there was a patch to fix this, and Daniel replied:

Ted Ts'o volunteered to do that but I failed to support him with proper documentation so it hasn't been done yet.

However, it's very easy to get around this, just comment out the part of the patch that sets the incompat flag. Then the indexed directories will magically turn back into normal directories the next time you write to them (it would be very good to give this feature a real-life test :-)

9. Regression Testing

3 Nov 2001 - 5 Nov 2001 (9 posts) Archive Link: "Regression testing of 2.4.x before release?"

People: Ted DeppnerDan KegelAlan CoxLinus Torvalds

Dan Kegel felt sure that Alan Cox stress-tested his kernels more than Linus Torvalds tested his. He suggested Linus adopt Alan's stress-tests before putting out releases. Ted Deppner replied, "It would be a better idea if everyone (including you and me) stress test those pre and final kernels." He added in a later post, "Linus and others have said in the past though, that YOUR usage is the testing they want... So it's best if you install the kernel and use it normally, whatever you'd use a kernel to do." At one point Dan said:

I'm not saying Linus should do the testing.

It's good that Linus is asking others to test with cerberus, as he did in

It would be even better if Linus came out and stated that he would refuse to call a kernel final if there is an outstanding report of it failing an agreed-upon set of stress tests.

And it would be *even better* if were used to do stress testing in a nice, automated way on 1, 4, 8, and 16-cpu machines on release candidates.

Almost none of this requires any work by Linus. All Linus has to do is say "The 2.4.x kernels will pass stress tests before release", and recruit someone to run his kernels through OSDL's STP in a timely manner.

10. Status Of ext3

6 Nov 2001 - 7 Nov 2001 (6 posts) Archive Link: "ext3-0.9.15 against linux-2.4.14"

Topics: Access Control Lists, Extended Attributes, FS: ext3, SMP

People: Andrew MortonSteven N. Hirsch

Andrew Morton announced:

Download details and documentation are at

Changes since ext3-0.9.13 (which was against linux-2.4.13):

Steven N. Hirsch reported that he'd been seeing thousands of Andrew's "hard-to-hit" problem, but just hadn't realized it was a problem with ext3. Andrew asked for more details about Steven's system and setup, but there was not much more discussion.







Sharon And Joy

Kernel Traffic is grateful to be developed on a computer donated by Professor Greg Benson and Professor Allan Cruse in the Department of Computer Science at the University of San Francisco. This is the same department that invented FlashMob Computing. Kernel Traffic is hosted by the generous folks at All pages on this site are copyright their original authors, and distributed under the terms of the GNU General Public License version 2.0.