Kernel Traffic #239 For 1 Nov 2003

By Zack Brown

Table Of Contents

Mailing List Stats For This Week

We looked at 1173 posts in 6062K.

There were 434 different contributors. 207 posted more than once. 158 posted last week too.

The top posters of the week were:

1. Status Of Software Suspend

19 Oct 2003 - 25 Oct 2003 (11 posts) Subject: "Wow. Suspend to disk works for me in test8. :)"

Topics: Software Suspend

People: Rob LandleyNigel CunninghamMarek Habersack

Rob Landley was surprised and pleased to see suspend-to-disk actually working under 2.6.0-test8. Voicu Liviu asked how long it took to do the suspend, and Rob estimated about 15 seconds. Rob added, "A couple of down sides I've noticed: I have to run "hwclock --hctosys" after a resume because the time you saved at is the time the system thinks it is when you resume (ouch). And because of that, things that should time out and renew themselves (like dhcp leases) have to be thumped manually." Marek Habersack reported no success resuming after a suspend; and Rob speculated that he (Rob) might have a particularly amenable set of hardware.

Elsewhere, Rob happened to mention, "I'm still subscribed to the swsusp list, but stopped reading it some time ago because it was all about 2.4 and I haven't run 2.4 in months..." Nigel Cunningham remarked in reply, "That's about to change. I've just gotten a port of the current 2.4 code going tonight. A bit more testing and tweaking, and I'll post a version for others to try."

2. New srfs Distributed Filesystem

20 Oct 2003 - 24 Oct 2003 (18 posts) Subject: "srfs - a new file system."

Topics: FS: Coda, POSIX, Version Control

People: Nir TzacharPavel MachekErik AndersenDaniel EggerEric Sandall

Nir Tzachar announced:

We're proud to announce the availability of a _proof of concept_ file system, called srfs. ( ). a quick overview: [from the home page]

srfs is a global file system designed to be distributed geographicly over multiple locations and provide a consistent, high available and durable infrastructure for information.

Started as a research project into file systems and self-stabilization in Ben Gurion University of the Negev Department of Computer Science, the project aims to integrate self-stabilization methods and algorithms into the file (and operation) systems to provide a system with a desired behavior in the presence of transient faults.

Based on layered self-stabilizing algorithms, provide a tree replication structure based on auto-discovery of servers using local and global IP multicasting. The tree structure is providing the command and timing infrastructure required for a distributed file system.

The project is basically divided into two components:

  1. a kernel module, which provides the low level functionality, and disk management.
  2. a user space caching daemon, which provide the stabilization and replication properties of the file system.

these two components communicate via a character device.

more info on the system architecture can be find on the web page, and here:

We hope some will find this interesting enough to take for a test drive, and wont mind the latencies ( currently, the caching daemon is a bit slow. hopefully, we will improve it in the future. ) anyway, please keep in mind this is a very early version that only works, and keeps the stabilization properties. no posix compliance whatsoever...

Eric Sandall said this sounded very similar to existing Coda work, but Nir replied, "not at all. coda is not self stabilizing at all. srfs is also a totally distributed file system -> see the doc." Pavel Machek remarked, "perhaps differences can be localized to userspace daemon, having same kernel part for coda and srfs? That would be *good*." And Nir replied, "in essence, ur correct. we would have taken that approach, if we were not aiming at building a file system on top of an object storage. this approach simplifies things a bit, and the kernel part is reduced." Elsewhere, Daniel Egger remarked that Coda was an ugly mess (or was when he'd last looked at it), and any new alternatives were certainly welcome, as far as he was concerned. Eric agreed with this.

Elsewhere, Erik Andersen posed a problem:

Suppose I install srfs on both my laptop and my server. I then move the CVS repository for my pet project onto the new srfs filesystem and I take off for the weekend with my laptop. Over the weekend I commit several changes to file X. Over the weekend my friend also commits several changes to file X.

When I get home and plug in my laptop, presumably the caching daemon will try to stabalize the system by deciding which version of file X was changed last and replicating that latest version.

Who's work will the caching daemon overwrite? My work, or my friends work?

Of course, this need not involve anything so extreme as days of disconnected independent operation. A rebooting router between two previously syncd srfs peers seems sufficient to trigger this kind of data loss, unless you make the logging daemon fail all writes when disconnected.

Nir replied:

i want to apologize if my explanation was not clear enough: self stabilization (original idea by Dijkstra) - A self stabilizing system is a system that can automatically recover following the occurrence of ( transient ) faults. The idea is to design a system which can be started in an arbitrary state and still converge to a desired behavior.

Our file system behaves like this: lets say you have several servers, with different file system trees on them. If (and when ...) you connect these file systems with an srfs framework, all servers will display the same file system tree, which is somewhat of a union between them all. if you wish to talk in coda terms, you can say all servers operated disconnectedly, and then were connected at the same time. the conflict resolving mechanism we use, is by majority.

We differ from coda in the sense we don't have a main server, which pushes Volumes to sub-servers (im not sure what the coda terminology is... ), and data is served in a load-balanced way. In Srfs, all the data resides on all servers (hosts) and is replicated between them. replication takes place at two levels: tree view (plus meta data) and the actual data.

all replication is lazy, and happens only on access to dirs / files (and on successful writes - when the file is being closed.)

Thus, the following behavior can be achieved: lets say you have 2N+1 hosts, all with coherent file system trees. now, take N of them offline, change the tree, put those N back online, and their tree will be the same as the other N+1 other hosts.

The main goal of the file system is self stabilization, over long periods of time and long distances. you can use it as a SAN, or as a data farm, using system like LinuxVirtualServer to balance the load between nodes.

3. Developer Discussion Of POSIX Capabilities

21 Oct 2003 - 25 Oct 2003 (19 posts) Subject: "posix capabilities inheritance"

Topics: POSIX

People: Michael GlasgowAlbert Cahalan

Michael Glasgow reported:

I wrote a simple setuid-root wrapper which sets some capabilities, gives up all other privs, and and then execs a shell. I was hoping to use this wrapper as a login shell so that I could have a user log in interactively with a small subset of elevated privileges.

Unfortunately after looking over the capabilities code in the 2.4 kernel, it would appear that this is not currently possible, and my wrapper cannot work without filesystem support for capabilities. And even then, I'd have to set each file's inheritable flag for the capabilities I want on every executable that I am likely to run, including the shell. Am I mising something, or is this an accurate description?

I think I understand the rationale behind this behavior; the draft posix 1003.1e specification states:

The purpose of assigning capability states to files is to provide the exec() function with information regarding the capabilities that any process image created with the program in the file is capable of dealing with and have been granted by some authority to use.

So, the lack of an inheritable flag on a file can serve to prevent that file from executing with the corresponding capability enabled.

Fine, but what about my semi-superuser shell situation? How can I force the retention of a capability set across exec() for all executables? It would seem that neither the spec nor the current implementation in the 2.4 kernel allow for this, but it strikes me as a pretty reasonable and useful thing to do in some cases.

As an interim workaround, how about assuming all capabilities are inheritable in fs/exec.c:prepare_binprm, i.e. instead of cap_clear(bprm->cap_inheritable), call cap_set_full() ??? I don't think this would break anything, and it would make capabilities a lot more useful until we get fs support merged in.

There was a bit of discussion, and at one point Albert Cahalan said:

The people who wrote the code were working from two different drafts of the spec. I think some people used draft 16, while others used draft 17. (or 15 and 16, or 17 and 18 -- a difference of 1) Between these two drafts there had been BIG changes. Well, a critical equation changed.

People at SGI, mindlessly cloning the IRIX code, stuck us with the half-ass set of capability bits we have today. They ignored the DG-UX implementation using 256 bits and slightly different equations. They ignored the fact that the security model will be terribly inconsistent if you still have apps making UID-based decisions -- that is, you need to allocate bits for glibc, XFree86, Linux vendors, admin tools, various databases, and local site usage. Yes it's yucky, but it's required. Covering ears and burying the head won't make this go away.

Nobody thought to have half the bits default to "on" for stuff currently allowed for regular users. For example, the right to listen for incoming network connections could be limited if this had been given a default-enabled bit.

Then there's the emergency hack done to patch a security hole that the capability bits introduced. I think that was back in the early 2.4.x days.

People like to ignore the fact that apps tend to answer "Do I need setuid-style precautions?" by examining UID.

People like to ignore the fact that privileged code, written with setuid in mind, can lead to all sorts of mayhem if 42% of the privileged operations are prohibited. Yeah, you'd hope that a setuid app has great error checking and can cope... but "hope" shouldn't satisfy you. We really need a way for app authors to mark a binary as "always block rights P, Q, and R" and "block all rights unless given V, W, and X", with the assumption that an unmarked app requires an all-or-none situation.

Probably there should be two worlds on the system. Apps with "funny" rights should be kept away from UID 0 and setuid apps, while apps with UID 0 or setuid should be kept away from "funny" rights. Give the init process a special ability to cross worlds.

The authors of our code seem to have given up and moved on. Nobody cleaned up the mess. Is it any wonder the POSIX draft didn't ever make it beyond the draft state?

(and damn, WTF is with !capable(...) meaning that you are capable of performing something?)

One final horror: just imagine trying to write up some sane documentation for the average admin. Poorly-understood security mechanisms are a hazard. BTW, don't forget to imagine documenting all the interactions with UID, filesystems, etc.

Face it: admins will think in terms of assigning rights to users, never minding that there are some weird equations, UID interactions, and perhaps per-executable bits.

4. Status Of BitKeeper Changeset Numbers

21 Oct 2003 - 24 Oct 2003 (7 posts) Subject: "cset #'s stable?"

Topics: Version Control

People: David WoodhouseTheodore Ts'oLarry McVoyFrank CusackChris Wright

Frank Cusack asked if BitKeeper's changeset numbers were stable, because he noticed that a patch he'd submitted under one changeset number, seemed to have been incorporated under another. Chris Wright explained that no, the changeset numbers themselves were not stable, but the key (obtained by 'bk changes -k -r<rev>' was stable. And David Woodhouse remarked that, "This is in the X-BK-ChangeSetKey: header of the mails sent to the mailing lists." Theodore Ts'o also explained to Frank, "Changeset numbers are subject to change when you merge in other changesets which depend on earlier changesets. So older changeset numbers tend to be more stable compared to newer changeset numbers, and changeset numbers won't change unless you have done a pull (or someone else has done a push) to your repository." At one point Larry McVoy also said:

In general, we're moving towards a BK version where keys (internal revisions, sort of like mail message id's) are useable anywhere a rev is useable.

One place we'll be using this is on BK/Web so that you guys can have URLs that don't change out from underneath you.

We should fix that at the same time that we turn on the GNU patch server so you can get any changeset as a patch. The dual T1's are due in at the end of this month.

There may be some delay, I'm away dealing with family stuff that is way higher in priority than this but I'll try and get someone else to do it if it takes longer than the end of the month before I'm back.

5. UML For 2.6.9-test8; Known UML 2.6 Loadable Module Problems

21 Oct 2003 - 25 Oct 2003 (7 posts) Subject: "uml-patch-2.6.0-test8"

Topics: User-Mode Linux, Version Control

People: Jeff DikeBrice Goglin

Jeff Dike announced:

This patch updates UML to 2.6.0-test8.

The 2.6.0-test5 UML patch is available at

BK users can pull my 2.5 repository from

For the other UML mirrors and other downloads, see

Other links of interest:

The UML project home page :
The UML Community site :

Brice Goglin reported good success with the patch, "except when enabling loadable module support (CONFIG_MODULES)" Jeff replied that this was a "Known problem, I haven't got around to implementing the changes needed for modules in 2.6."

6. udev 005 Released

22 Oct 2003 - 24 Oct 2003 (8 posts) Subject: "[ANNOUNCE] udev 005 release"

Topics: FS: devfs, FS: sysfs, Hot-Plugging, Klibc, Version Control

People: Greg KHLars Marowsky-BreeChris FriesenGiuliano PochiniRobert Love

Greg KH announced:

This release is done in advance of a talk about it for the CGL meeting tomorrow at OSDL.

I've released the 005 version of udev. It can be found at: (

rpms are available at: (
with the source rpm at: (

udev is a implementation of devfs in userspace using sysfs and /sbin/hotplug. It requires a 2.6 kernel to run.

The major changes since the 004 release are:

The biggest stuff is the klibc integration. If you build with klibc, the 453K binary shrinks to 45K. Nothing like a power of ten decrease :)

The rpms are still built with debugging enabled, using glibc, so they do not get any size savings yet...

Again, many thanks to Dan Stekloff, Kay Sievers, and Robert Love for their help with patches for this release. I really appreciate it.

The full ChangeLog can be found below.

The udev FAQ can be found at: (

Development of udev is done in a BitKeeper tree available at:

I have the initial framework of some regression tests in the bk tree, but there is a libsysfs bug that is keeping these tests from working properly right now. The libsysfs people are working on fixing this.

If anyone ever wants a snapshot of the current tree, due to not using BitKeeper, or other reasons, is always available at any time by asking.

Giuliano Pochini asked why devfs had been considered unfixable. Lars Marowsky-Bree replied, "Well, one of the bugs seems to be that people just didn't like the approach, while udev's approach is lean and mean and people do seem to approve of it. That's a matter of taste." Chris Friesen urged folks not to get into a devfs bashing party, but to google around for it first.

7. Status Of UMSDOS In 2.6

23 Oct 2003 (2 posts) Subject: "umsdos and kernel 2.6"

Topics: FS: UMSDOS

People: Alexander Viro

Someone asked if Linux 2.6 would support UMSDOS, and Alexander Viro replied, "Not unless it gets very massive fixes."

8. Driver Update For MPT Fusion

24 Oct 2003 (11 posts) Subject: "[PATCH] 2.4.23-pre8 driver udpate for MPT Fusion (2.05.10)"

Topics: Backward Compatibility, Disks: SCSI, Hot-Plugging, PCI

People: Eric Dean MooreJames BottomleyGreg KHMatthew WilcoxChristoph Hellwig

Eric Dean Moore of LSI Logic announced:

Here's a patch for 2.4.23-pre8 kernel for MPT Fusion driver, coming from LSI Logic.

This patch is large, so I have placed it on the LSI ftp site at:

A new email address is setup for directing any MPT Fusion questions: (

James Bottomley remarked, "The policy for driver updates into 2.4 is that they should be backports from 2.6 (for things like mpt fusion that have similar drivers) so that the newer driver gets into 2.6 first. If you want to send the 2.6 patches, I can queue them up for when the "bugfix only" freeze is relaxed." Eric replied:

I'm clear on the policy, however the MPT driver in 2.6 kernel is *NOT* compatible with what is shipping in 2.4 kernel which is 2.05.05+ driver. The driver in 2.6 has had most of the backward compatibility stripped out, such as Old Error Handling, and many other changes to make it work with new kernel structures and functions, however doesn't make it backward compatible to 2.4 kernel.

Our focus has been 2.4 kernel version of the driver as that is what is shipping in all Linux distributions, and our customers have been asking for RPM driver updates to the latest driver fix bugs and enhancements for their shipping systems out in the field. One major OEM player has requested we update as to reduce their dependency on LSI for RPM driver updates. I wish that these updates make their way into the 2.4 kernel. I will begin porting these changes over the driver in 2.6 immediately. Also one thing is that there have been change on Maintainership of this driver from Pam Delaney to myself and Larry Stephens, so things are about getting back to normal.

Elsewhere, Matthew Wilcox also asked if a 2.6 version would become available, and Eric said yes, but he wasn't sure exactly when. Greg KH asked, "How about support for all of the pci hotplug systems on 2.4 that are shipping today?" Matthew replied, "The SCSI system isn't really capable of supporting hotplug PCI in 2.4." And Greg acknowledged, "Yeah, but some drivers almost do (Adaptec comes to mind.) It will work in a pci hotplug system, while other scsi drivers will not work at all." But Christoph Hellwig came back with, "No, it won't work. calling scsi_register outside ->detect on 2.4 will just get you a dead Scsi_Host."

9. More Status Of Software Suspend

25 Oct 2003 - 26 Oct 2003 (8 posts) Subject: "Announce: Swsusp-2.0-2.6-alpha1"

Topics: Software Suspend, Version Control

People: Nigel Cunningham

Nigel Cunningham said:

I'm pleased to be able to announce the first test release of a port of the current 2.0 pre-release Software Suspend code to 2.6. This is now available from ( and bk://

Release notes:

Apart from the above, and the normal problems with incomplete driver support will continue. In addition, you may see freezing failures. If the process hangs at 'Freezing processes: Waiting for activity to finish' or 'Syncing remaining I/O', try pressing escape once. If the process doesn't abort, try a second time (which tries harder to restart things). All going well, you should be able to cancel the suspend. A log of what went wrong will be stored in /var/log/messages. Run it through ksymoops if necessary and send it to me, and I should be able to address the issue.

Please send feedback via the Software Suspend mailing list on Sourceforge. See for FAQs, mailing list details and so on. Because the code is essentially the same as the 2.4 version, many of the solutions to issues will be the same.

Iain D. Broadfoot was very excited about this patch, and posted a quick one-liner that made it compile for him. But later he reported that actually suspending to disk did not work in his tests.

10. New DevFS Replacement uSDE, Similar To udev

27 Oct 2003 - 28 Oct 2003 (15 posts) Subject: "ANNOUNCE: User-space System Device Enumeration (uSDE)"

Topics: Disk Arrays: LVM, Disks: IDE, Disks: SCSI, FS: devfs, FS: sysfs, Hot-Plugging, Networking, Serial ATA, USB

People: Mark BellonPatrick MochelGreg KH

Mark Bellon announced:

Initial availability of User-Space System Device Enumeration (uSDE) package, version 0.74, can be found at

The uSDE provides an open framework for the enumeration (specification) of system devices in a dynamic environment. Device handling is implemented via plug-in programs known as policy methods. Policy methods are free to handle their devices in any way, from trivial to complex - anything from providing LSB device nodes to persistent device name handling with replacement and relocation strategies.

The uSDE depends on /sbin/hotplug (for dynamic insertions and removals), sysfs (for device information) and /proc (various pieces of information). It is not dependent on initrd - it explicitly scans sysfs upon system startup to determine the initial device ensemble.

Part of the uSDE release is a collection of sample polices:

disk-ide-policy - handles IDE, EIDE, SATA and USB-EIDE disks. Implements persistent device naming, automatic device replacement and automatic device relocation features.

disk-scsi-policy - handles SCSI, IEEE-1394, FibreChannel and USB-SCSI disks including multiported devices. Iplements persistent device naming, automatic device replacement and automatic device relocation features.

multipath-policy - handles the automatic provisioning of multipathing for multiported storage devices.

ethernet-policy - handles ethernet interefaces. Implements persistent interface naming, interface anchoring, automatic device replacement and automatic device relocation features.

floppy-policy - handles internal floppy disks.

simple-device-policy - a "catch all" policy for block and character devices.

devfs-policy - provides devfs device names.

lsb-policy - provides LSB device names.

Mailing list:

Patrick Mochel asked:

How does uSDE relate to udev? You do not mention it in your email, though it claims to implement similar, if not identical functionality. Is it related? Is it built on top of it?

If not, are you planning on merging your efforts with udev in the future?

Are you using the libsysfs library for accessing sysfs data? If not, I highly recommend it.

I would also recommend sending email to the linux-hotplug list, as most of the hotplug-related applications are discussed and developed via that list.

Mark said he'd look into the mailing list, libsysfs, and merging with udev; he also explained:

The uSDE is not built on top of udev.

The uSDE and udev are similar in some respects. They both create device nodes. There is a lot more to handling devices than managing device nodes.

Some differences between uSDE and udev that come to mind as I type (a good deal of this is part of the INTRO in the uSDE tarball):

Devices are classified and an explicit, ordered list of policies are invoked on behalf of the devices based on that classification.

Policies are implemented as open plug-ins that have complete control (e.g. naming, configuration, special needs) over a device.

Multiple policies can be executed concurrently; they can be independent or cooperative.

All device types are embraced - ethernet, disks, cdroms, floppies, MD, LVM and so on. Policies can analyze data and handle complex situations such as ethernet interface anchoring, multiported disk handling and automatic multipath device management.

The concept of service agents who provide critical information to the enumeration framework allowing policies to handle extremely diverse hardware situations such as multiple chassis and geographical addressing.

The uSDE sample policies implement basic device replacement and relocation strategies, something that the community has been asking about for some time.

If you want to learn more about that differences, download the tarball and try it out...

The uSDE was built in response to a set of telco and embedded community requirements. We found it difficult to express our ideas. Everyone wanted to see code and documentation. Here is the code and the initial documentation. This is a starting point...

A number of people criticized Mark for starting a project from scratch, that was so similar to an existing project such as udev. And Greg KH said at one point, "I have a few emails from you lying around here from back in Feb and March of this year in which you detailed this project. And you have been aware of udev from at least April, as it's code has been public since then." He also added his voice to those asking why the new project was begun, in favor of contributing to the existing udev project. Mark replied:

The two packages take philosophically different approaches and arrive with (largely) overlapping and some non-overlapping capabilities - after all they are both trying to do "the same thing". The uSDE has strengths and weaknesses just as udev or any program does. It is certainly possible to discuss changes (and make patches) to udev to incorporate the key issues addressed in the uSDE implementation.

The uSDE is an encapsulation of ideas and techniques. It is "complete" enough for those ideas to be discussed in a community setting and we can see how/what to move things together. Think of it as the projects "resting place" from which to confidently discuss techniques and implementions.

He also went into more detail on what his requirements were, and why udev didn't meet them:

The requirements were collected from the OSDL CGL requirements specification version 1.0 and 1.1 ratified September 2002. They come from extensive discussions with the OSDL members as part of the definition of these requirements, expounding on them:

Greg had a lot of points to make in response, but the discussion petered out immediately, with nothing conclusive emerging.







Sharon And Joy

Kernel Traffic is grateful to be developed on a computer donated by Professor Greg Benson and Professor Allan Cruse in the Department of Computer Science at the University of San Francisco. This is the same department that invented FlashMob Computing. Kernel Traffic is hosted by the generous folks at All pages on this site are copyright their original authors, and distributed under the terms of the GNU General Public License version 2.0.