Kernel Traffic #169 For 2 Jun 2002

By Zack Brown

Table Of Contents

Mailing List Stats For This Week

We looked at 1315 posts in 6084K.

There were 352 different contributors. 178 posted more than once. 159 posted last week too.

The top posters of the week were:

1. Multithreaded Core Dump Support For 2.5 And 2.4

13 May 2002 - 23 May 2002 (33 posts) Subject: "PATCH Multithreaded core dump support for the 2.5.14 (and 15) kernel."

Topics: Debugging, Scheduler, Virtual Memory

People: Mark GrossVamsi Krishna S.Pavel MachekAndi KleenErich Focht

Mark Gross posted a patch to implement multithreaded core dump support in 2.5.14 and 2.5.15 kernels. He explained, "This work has been tested on the 2.5.14 kernel using a few pthread applications to dump core, from SIGQUIT and SIGSEV. This unit test has been done on both 2 and 4 way systems. Further, some stress testing has been done where, the core files have been created while the system is under schedule stress from the chat room benchmark running while creating the core files. This implementation seems to be quit stable under a busy scheduler, YMMV."

Erich Focht asked how the patch would handle the case in which the suspended thread happened to be in kernel mode, a possibility in the 2.5 kernels. A couple posts later, Vamsi Krishna S. replied, "if a thread happens to be in kernel mode when some other thread is dumping core (capturing register state of other threads, to be more accurate) then we would capture the _user mode_ register of that thread from the bottom of it's kernel stack. GDB will show back trace untill the thread entered kernel (int 0x80), eip will be pointing to the instruction after the system call (return address)." Pavel Machek thought he found a problem with this. He described his exploit, "Thread 1 is in kernel and holds lock A. You need lock A to dump state. When you move 1 to phantom runqueue, you loose ability to get A and deadlock." But Mark replied:

Any pending tasklet / bottom half + top half get processes by the real CPU's even thought the I/O bound process may have been moved to the phantom run queue. Its just that for the suspended processes sitting on the phantom queue this processing stops with the call to try_to_wake_up, until the process is moved back onto a run queue with a CPU.

The only way I can see what your talking about happening is for some kernel code (or driver) to grab a lock and then hold it across a call to one of the sleep_on functions pending some I/O.

Any driver that holds a lock across any sleep_on call I think is abusing locks and needs adjusting.

Nothing prevents someone writing a driver that abuses locks.

If you know of such a case I need to worry about or there is another way for this design to get into trouble please let me know.

Pavel and Andi Kleen both took exception to Mark's statement that no driver should hold a lock across a sleep_on() call. As Andi put it, "That's true for spinlocks, but not for semaphores. The mm layer and the vfs layer both use semaphores extensively and sleep with them hold, also some other subsystems (like networking) use sleeping locks." He and Mark and a couple other folks went back and forth on this for a few posts. At first, Mark was still unconvinced there was a problem, but after awhile he did start to see some areas that would need to be fixed. He posted a new patch, and said:

After some investigations I concluded that the down_write(current->mm->mmap_sem) in elf_core_dump was to protect crashing multithreaded applications from dumping corrupted and possibly illegal mm data due to the actions of the other, still running, thread processes.

As my patch has these thread processes suspended on the phantom run queue we don't need to grab this semaphore in elf_core_dump any more.

However; we did see another potential issue on 4+ way systems with 3 or more processes of the same thread group entering suspend threads at about the same time. In tcore_suspend_threads between the release of the spin locks and the calls to set_cpus_allowed, one of the other, crashing, thread processes could move this task, currently in set_cpus_allowed, to the phantom queue before it returns. (a bad thing)

I've put in a fix for this possibility by doing down_write/up_write on current->mm->mmap_sem for the scope of tcore_suspend_threads. This also as the benefit of stopping VM operations for the thread group until the thread group process are suspended.

This updated tcore patch has been tested on 2 and 4 way i386 systems, dumping core for pthread applications with 300+ thread process, while running the chat room benchmark. It "seems" stable.

This apparently fixed the problems folks had with it, and there was a brief integration discussion, and various talk of possible problems that might arise.

At one point Mark remarked that he hoped to have the patch working on 2.4 soon as well.

2. Improving Virtual Memory Balancing

17 May 2002 - 29 May 2002 (8 posts) Subject: "[RFC][PATCH] using page aging to shrink caches"

Topics: Virtual Memory

People: Ed TomlinsonBenjamin LaHaise

Ed Tomlinson posted a patch, and explained, "I have never been happy with the way slab cache shrinking worked. This is an attempt to make it better." Benjamin LaHaise clapped Ed on the back, and said, "Thank you! This is should help greatly with some of the vm imbalances by making slab reclaim part of the self tuning dynamics instead of hard coded magic numbers. Do you have any plans to port this patch to 2.5 for inclusion? It would be useful to get testing in the 2.5 before merging in 2.4." Over the next few days, Ed replied with updated versions of the patch.

3. Backward Compatibility

20 May 2002 - 24 May 2002 (28 posts) Subject: "Quota patches"

Topics: Backward Compatibility, Executable File Format

People: Linus TorvaldsAlan CoxMartin DaleckiChristoph HellwigJan Kara

In the course of discussion some new patches for disk quotas, Linus Torvalds suggested removing some specific code that only provided backward compatibility to older kernels. He asked, "Are there _any_ reasons to use the old stuff, if the fix is just to upgrade to a newer quota tool?" And Alan Cox added, "Most people use 2.4 with quota tools and 32bit uid quota already, so its not much of a breakage at all. The 2.4 quota base code is unusable in the real world so the problem got settled by the vendor trees."

Jan Kara said he'd send a patch to Linus to remove the extra code, and Martin Dalecki suggested, "If we can do it for quota - we could possible remove the IPC_OLD variant away as well. It's looong overdue by now, becouse the IPC_OLD was not standard conformant anyway." But Alan replied that it was:

More code that takes almost no space, ensures old systems still work and old XFree86 still runs on new kernels. Why remove it ?

If you want to design a mathematically elegant and small ultra clean OS go do it. Linux however has to work in the real world not in the happy clueless world of pure mathematical elegance.

Martin replied, "It is an illusion to think that you can actually run *that old* a.out binaries on a modern kernel I think." And Christoph Hellwig rejoined, "Of course you can. Even the latest OpenLinux release (shipping 2.4.13-ac) uses a libc4/a.out based installer fo space reasons. Not to forget the old quake1 binary from some redhat 4.x CD I run from time to time :)" Martin was impressed to hear that these things actually worked, and Alan told him he should have tested the idea before making his suggestion. Alan added, "It btw goes beyond Libc4. Currently we have almost 100% compatibility back to libc 2.2.2. The dated libc before that doesn't work because we dropped some very very early obscure versions of a few syscalls."

At this point Christoph remarked, "For 2.5 I have some plans to make obsolete syscalls depend on CONFIG_COMPAT_*, this allows to compile big and bloated kernel for compatiblity and smaller kernels without that (e.g. for embedded devices). And in fact we have quite a loft of cruft that can go away for setups only having very modern userspace.." Martin and Alan both approved of this idea, and Alan added, "For embedded you also want config options to remove the block layer and so forth. I'd been thinking about a set of options buried in a config menu item like "Fine tune configuration for small/embedded devices" CONFIG_SMALL."

4. Status Of /dev/port

20 May 2002 - 27 May 2002 (153 posts) Subject: "Linux-2.5.17"

Topics: Development Strategy, FS, Kernel Release Announcement

People: Martin DaleckiAlan CoxLinus TorvaldsPete ZaitcevPaul MackerrasDavid S. MillerPaul Rusty Russell

Linus Torvalds announced 2.5.17 and there was a ton of discussion about it. In one subthread, Martin Dalecki posted a patch to completely get rid of the /dev/port interface. He argued:

  1. It is not usable with ports which require 4 byte access.
  2. The same can be achieved by using capabilities and su bits and so on.
  3. __m68000__ doesn't even implement it and most other non i386 archs "implement" it but apparently don't even care about endianess issues.
  4. It's not standard.
  5. seek() + port access is "racy" with respect to multiple usage.
  6. Nothing is using it.

... and so on and so on ...

And finally, kernel size with it:

   text    data     bss     dec     hex filename
1480587  243280  259628 1983495  1e4407 vmlinux

kernel size without it:

[root@kozaczek linux]# size vmlinux
   text    data     bss     dec     hex filename
1480229  243184  259628 1983041  1e4241 vmlinux

Which means a saving of 454 bytes :-).

Paul Mackerras and David S. Miller both thought this was a fantastic idea, but there was some speculation that Martin would be flamed to Hell and back for making the suggestion. Several hours later folks started expressing surprise at the absense of the expected flame-war. But at one point Alan Cox did point out, "The /dev/port interface is used by various apps and its a traditional x86 in paticular unix thing. For platforms like ARM its poorly implemented since it ought to turn into a fraction of /dev/mem and support mmap for speedier user space in/out emulation.." Martin raised an eyebrow at this; he'd thought /dev/port was entirely Linux-specific. But Alan went on, "The /dev/port interface is in a whole variety of older Unixen for x86, and also in systems like Minix." And elsewhere he came down more firmly against the whole idea. At this point the temperature did start to rise slightly, until Paul Rusty Russell suggested that this entire issue would be better discussed at the Kernel Summit rapidly approaching. That pretty much ended that subthread.

However, elsewhere, Linus Torvalds weighed in on the issue, saying he was OK with getting rid of /dev/port. He explained, "It was done purely because Minix did it that way, and it wasn't even compatible with Minix (I think Minix actually supoorted 2- and 4-byte accesses by just doign 2- and 4-byte read/write calls, the Linux code never did)." He added, "Anybody: if you've ever used /dev/ports, holler _now_." Alan replied:

Holler. I posted a list of examples to linux-kernel already. iopl and ioperm are not portable in the way /dev/port is. ioperm/iopl also doesnt work with most scripting languages, java tools trying to avoid JNI etc

I've seen it used in tools written in java, python, perl, even tcl

Other examples include libieee1284, the pic 16x84 programmer, hwclock, older kbdrate, /sbin/clock on machines that don't have /dev/rtc.

Not everything in the world is an x86, and not every app wants to be Linux/x86 specific or use weird syscalls

Pete Zaitcev also said to Linus:

I often use it as an alternative to #include <asm/io.h>, which you decreed illegal. I understand <sys/io.h> is a legal alternative, but a bunch of platforms forget to include <sys/io.h>, for instance Jes cried bloody murder when asked to add it to ia-64. But if you decide to drop /dev/port I can tough it out. Solaris lives without it, and so can we.

I saw this whining about outl not implemented for write(fd, &my_int, 4), and I think the guy had a little point. Though if he wanted it, he ought to submit a patch.

Martin replied, "if someone want's to use /dev/port for developement on some slow control experimental hardware for example. Why doesn't he just" [...] "compile it as a *separate* character device module ? That's linux - you have the source, so use it. You wan't to cheat around the OS abstractions - do it for yourself! There is no requirement that it has to be permanently in the mainline kernel where it tends to attract people who shouldn't have used it in first place for generic stuff like kbd rate settings and clock device manipulation." But Linus said:

That's not a productive approach, Martin.

Yes, with open source you can do whatever you want.

HOWEVER, there is a huge amount of advantage to having a common base that is big enough to matter: why do you think MS does well commercially?

It's important to _not_ have to force people to do site-specific (or problem-specific) hacks, even if they could do so. Because having to have site-specific hacks detracts from the general usability of the code.

So when simplifying, it's not just important to say "we could do without this". You have to also say "and nobody can reasonably expect to need it".

Which doesn't seem to be the case with /dev/ports. So it stays.

5. Status Of ext3 And RAID In 2.2

21 May 2002 - 23 May 2002 (11 posts) Subject: "2.2 kernel - Ext3 & Raid patches"

Topics: Disk Arrays: RAID, FS: ext3, Version Control

People: Andreas DilgerStephen C. TweedieMike FedykDavid S. MillerTomas Szepe

Jon Hedlund seemed to remember hearing some warnings not to use ext3 with RAID in 2.2 kernels; but he'd been using ext3 and RAID 1 with almost no problems for over nine months. He asked if this was normal, or had he just been lucky? Andreas Dilger replied, "You've just been lucky. I forget the exact scenario, but it is something like if journal replay is happening while the RAID is being reconstructed after a crash you can get garbage written to your disk." And Stephen C. Tweedie added:

Right --- the raid resync code in 2.2 uses the normal buffer cache, which results in writes being scheduled for clean buffers, behind ext3's back. That's not allowed --- it violates the write ordering requirements that make ext3 work, and trips up debugging assert failures in the ext3 write checking code.

You might get away with it, but a raid resync on ext3 on 2.2 is basically not safe. If you wait until after the resync before mounting the ext3 filesystem, you'll be OK.

It should work on 2.4.

Elsewhere, Mike Fedyk took credit for warning people against using those two patches in combination, and suggested:

If I were you, I'd just test a 2.4 kernel on the configuration you want. Unless there is some binary driver that use that doesn't support 2.4 there isn't much use staying with 2.2.

This configuration is unsafe for 2.2, and I've used raid1 and raid5 with ext3 without any trouble, even on degraded arrays (for as short a period as possible of course).

But Stephen put in, "Actually, you just need to renumber one of the conflicting #defines to something unused, and it will work fine. Soft raid0 or linear mode will work quite happily with ext3 on 2.2 after you do that, it's only the resync after a crash that you get with raid1 or raid5 that is dangerous." Later, he reiterated that this fix would only work for RAID 0.

Also in reply to Mike, Tomas Szepe objected that the recommendation to just use 2.4 was not feasible on a Sparc 32 system because of some bugs that had surfaced recently. But David S. Miller replied, "There have been several patches posted to deal with that problem, you can apply them yourself or grab Marcelo's current 2.4.x BK tree." After some work, Tomas also offered:

Here comes for all sparc people who can't install BK:

All sparc32/sparc64 related changes since 2.4.19-pre8 in one diff copied and fixed up by hand from

All I can claim as to the patched kernel's functionality -- it has compiled for me on sparc32. I'm going to try to boot it next week when I'm changing disks in my server.

6. LVM Cleanup

22 May 2002 - 26 May 2002 (4 posts) Subject: "[RFC/PATCH] lvm sanitation in 2.5"

Topics: Disk Arrays: LVM, FS, Ioctls

People: Anders GustafssonJoe ThornberAlexander Viro

Anders Gustafsson announced, "I have started cleaning up lvm. The following patch contains the first steps. It disables a lot of functionallity but the basic things are there, I'm actually running a kernel with this patch right now, with /home and /var on lvm. The vg_t/lv_t..-structures are now available in to versions, one exported to userspace (and that should remain constant through versions) and one used in kernelspace containing stuff that should not be exposed to userspace (struct block_device, kdev_t and such). (this also allows more flexibillity making changes in the driver without changing the userspace interface)." Alexander Viro was very pleased to see this, and gave some advice for the ongoing work. And Joe Thornber also said to Anders:

I started a similar process last summer, if you want to pick up on my work you can find it in cvs under that tag 'experimental' (cvs co -d -r experimental LVM). There are a *lot* of changes in there, particualarly I factored out the ioctl interface into a file of its own and rewrote a lot of it. I think I tidied up the mapping functions a lot too.

However it soon became apparent that the end result would still be poor due to the appalling ioctl interface. Hence the LVM2 project, which the team has been working on for the last 9 months. So maybe the question should be 'is it time to switch from LVM1 to LVM2 in 2.5?'.

Just so that you are aware that no matter how much you tidy up LVM1 people are not going to be happy with it - you have to compete against flawed design as well as bad code.

There was no reply.

7. BitKeeper Repository Downtime

24 May 2002 (1 post) Subject: " downtime"

Topics: Version Control

People: Larry McVoy

Larry McVoy announced:

Hi, we're working on an upgrade for and when we have it ready, we'll want to switch the drives from one machine to another. We're aiming to do this later today, so please update your trees now. Linus hasn't pushed anything since yesterday so this is probably a good time.

One side effect of the upgrade is that we're going to get an online hot spare out of the deal, so in the future, we'll be able to do this sort of thing behind your back and you'll never know.

There was no reply.

8. Status Of I2O In 2.5 And 2.4

24 May 2002 (1 post) Subject: "Linux I2O Status"

Topics: Disks: SCSI, I/O, I2O, PCI

People: Alan Cox

Alan Cox announced:

I asked folks to avoid touching the I2O stuff in 2.5 because major surgery and work was needed in 2.4 before even tackling 2.5

The 2.4.19pre8-ac5 status is:

On x86 32bit it is all stable again and now works on my DPT and on the AMI Megaraid as well as the cards it handled before. Block caching strategy is now configurable.

I still have to do the pci mapping and try and find the rest of the 64bit bogons, however the core code is now in a shape where it ought to be possible to move it forward into 2.5 if anyone with i2o kit feels the urge.

I'll look at the SCSI 64bit cleanness and PCI mapping over time. They are not priority items to me right now (at least until AMD Hammer hits the mass market)

There was no reply.

9. BitKeeper Discussion

27 May 2002 - 29 May 2002 (21 posts) Subject: "2.4 SRMMU bug revisited"

Topics: Version Control

People: David S. MillerDavid WoodhouseTomas Szepe

In the course of discussion, Tomas Szepe was unable to find evidence of a patch he'd been certain had been applied. He could not find it using the web interface to the BitKeeper tree ( David S. Miller replied:

The BK repository to use has the URL:


The web stuff is updated still by hand and is as a result chronically out of date.

But David Woodhouse at one point gave a link to his own BitKeeper web interface ( , and said "That web stuff is updated by cron and is as a result never more than an hour out of date (w.r.t. bk// unless something breaks."







Sharon And Joy

Kernel Traffic is grateful to be developed on a computer donated by Professor Greg Benson and Professor Allan Cruse in the Department of Computer Science at the University of San Francisco. This is the same department that invented FlashMob Computing. Kernel Traffic is hosted by the generous folks at All pages on this site are copyright their original authors, and distributed under the terms of the GNU General Public License version 2.0.