Kernel Traffic #306 For 11 Apr 2005

By Zack Brown

Table Of Contents

Mailing List Stats For This Week

We looked at 2041 posts in 12MB. See the Full Statistics.

There were 727 different contributors. 262 posted more than once. The average length of each message was 94 lines.

The top posters of the week were: The top subjects of the week were:
69 posts in 377KB by Ingo Molnar
65 posts in 420KB by Andrew Morton
50 posts in 413KB by Evgeniy Polyakov
46 posts in 208KB by Steven Rostedt
43 posts in 166KB by Herbert Xu
97 posts in 395KB for "Can't use SYSFS for "Proprietry" driver modules !!!."
62 posts in 272KB for "[patch] Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-07"
51 posts in 201KB for "forkbombing Linux distributions"
50 posts in 227KB for "[PATCH] API for true Random Number Generators to add entropy"
45 posts in 395KB for "NFS client latencies"

These stats generated by mboxstats version 2.2

1. Real-Time Preemption Updates And Bug Hunting

19 Mar 2005 - 7 Apr 2005 (89 posts) Archive Link: "[patch] Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-00"

Topics: FS: ext3, Real-Time, SMP

People: Ingo MolnarSteven RostedtPaul E. McKenneyLee Revell

Ingo Molnar said:

i have released the -V0.7.41-00 Real-Time Preemption patch (merged to 2.6.12-rc1), which can be downloaded from the usual place:

the biggest change in this patch is the merge of Paul E. McKenney's preemptable RCU code. The new RCU code is active on PREEMPT_RT. While it is still quite experimental at this stage, it allowed the removal of locking cruft (mainly in the networking code), so it could solve some of the longstanding netfilter/networking deadlocks/crashes reported by a number of people. Be careful nevertheless.

there are a couple of minor changes relative to Paul's latest preemptable-RCU code drop:

to create a -V0.7.41-00 tree from scratch, the patching order is:

Lee Revell reported system hangs with this code, but no serious debugging took place. Elsewhere, Paul E. McKenney offered a few cleanup patches, and Ingo updated his tree. Magnus Naeslund confirmed that system hangs were occurring about 20 minutes after bootup with this kernel, exactly as Lee had reported. Ingo tried doing some code revisions that turned out to blow up in his face, leaking dentries all over the place. He abandoned that entire tree and returned to his earlier version. It didn't help that while these lockups occurred on single-processor systems, there was another crash bug showing up -- a race condition between two CPUs -- that only affected SMP systems.

A lengthy debugging process followed, with more work being done to solve the SMP problem than the other. It seemed that progress was being made, with folks like Steven Rostedt working round the clock on patch implementations; but ultimately the solution was not as forthcoming as folks would have liked. To give an example of some of the troubles they had during this thread, at one point Ingo asked for a cleaner patch, in hopes of migrating it eventually into the official tree, and Steven remarked, "that's the main problem. Without changing the design of the ext3 system, I don't think there is a clean patch."

2. Linux 2.6.12-rc1-mm3 Released

25 Mar 2005 - 1 Apr 2005 (56 posts) Archive Link: "2.6.12-rc1-mm3"

Topics: Kernel Release Announcement, Serial ATA, USB, Version Control

People: Andrew Morton

Andrew Morton announced Linux 2.4.12-rc1-mm3, saying:

3. JFFS3 Cleanups And An Attempt At Additional APIs

25 Mar 2005 - 3 Apr 2005 (44 posts) Archive Link: "[RFC] CryptoAPI & Compression"

Topics: Compression

People: Artem B. BityuckiyDavid Woodhouse

Artem B. Bityuckiy said:

I'm working on cleaning-up the JFFS3 compression stuff. JFFS3 contains a number of compressors which actually shouldn't be there as they are platform-independent and generic. So we want to move them to the generic part of the Linux kernel instead of storing them in fs/jffs2/. And we were going to use CryptoAPI to access the compressors.

But I've hit on a problem - CryptoAPI's compression method isn't flexible enough for us.

CryptoAPI's compress method is:

int crypto_compress(struct crypto_tfm *tfm, const u8 *src, unsigned int slen, u8 *dst, unsigned int *dlen);

*src - input data;
slen - input data length;
*dst - output data;
*dlen - on input - output buffer length, on output - the length of the compressed data;

The crypto_compress() API call either compresses all the input data or returns error.

In JFFS2 we need something more flexible. Te following is what we want:

int crypto_compress_ext(struct crypto_tfm *tfm, const u8 *src, unsigned int *slen, u8 *dst, unsigned int *dlen);

*src - input data;
*slen - on input - input data length, on output - the amount of data that were actually compressed.
*dst - output data;
*dlen - on input - output buffer length, on output - the length of the compressed data;

This would allow us (and we really need this) to provide a large input data buffer, a small output data buffer and to ask the compressor to compress as much input data as it can to fit (in the compressed form) to the output buffer. To put it differently, we often have a large input, and several smalloutput buffers, and we want to compress the input data to them.

I offer to extend the CryptoAPI and add an "extended compress" API call with the above mentioned capabilities. We might as well just change the crypto_compress() and all its users.

Alternatively, we may create some kind of "Compression API" but I don't like this idea...

Herbert Xu was happy to see the generic code moved into the generic part of the kernel, but he felt the specific API suggested by Artem would be slow and unwieldy. While he was willing to add the interface, he felt it would be best to keep the existing API as well. The two of them powwowed over an implementation that would solve the various requirements; and David Woodhouse also chimed in with his own ideas; but ultimately Artem had to give up on his idea, when he couldn't convince Herbert of its value.

As Artem put it, "OK. No problems, that was just RFC. :-)"

4. Using New And Old Megaraid Drivers

25 Mar 2005 - 31 Mar 2005 (10 posts) Archive Link: "megaraid driver (proposed patch)"

Topics: Version Control

People: Bruno CornecJames Bottomley

Bruno Cornec reported:

I've noticed that since recent kernel versions, it's not possible anymore to use simultaneously new and old megaraid driver.

It seems to have been introduced by that changeset:| src/.|src/drivers|src/drivers/scsi|src/drivers/scsi/megaraid|hist/drivers/scsi/megaraid/Kconfig.megaraid (|src/.|src/drivers|src/drivers/scsi|src/drivers/scsi/megaraid|hist/drivers/scsi/megaraid/Kconfig.megaraid)

It particularly makes life of people developing kernel for distro difficult as it forces them to drop support for legacy hardware which is working just fine with 2.6, or to patch their own kernel build. As well it prevents simultaneous usage of new and old cards in the same system.

Would you consider to apply the following patch proposed by Thierry Vignaud as a solution for the MandrakeSoft kernel in the mainstream 2.6 kernel?

James Bottomley suggested talking to the megaraid maintainers and the linux-scsi mailing list. There was a brief discussion of possible solutions, but the thread petered out with no clear resolution.

5. Reorganizing Network Configuration Options

30 Mar 2005 - 5 Apr 2005 (18 posts) Archive Link: "[RFC/PATCH] network configs: disconnect network options from drivers"

Topics: Modems, Networking

People: Randy DunlapJamal Hadi SalimSam RavnborgDavid S. MillerChris FriesenAndrew Morton

Randy Dunlap said:

RFC: This is a work-in-progress (WIP), not yet completed.

A few people dislike that the Networking Options menu is inside the Device Drivers/Networking menu. This patch moves the Networking Options menu to immediately before the Device Drivers menu, renames it to "Networking options and protocols", & moves most protocols to more logical places (IMHOOC).

The reasons that it is still WIP are:

Jamal Hadi Salim said, "About time someone brave did this." And Chris Friesen and David S. Miller said the work looked sensible, as did Thomas Graf. Randy was happy for the kudos, but asked specifically for feedback on some key issues. For one, he asked if the best thing would be "leaving IrDA and Bluetooth subsystem (with drivers) where they are, which is under "Network options and protocols" (I really don't want to split their drivers away from their subsystem, just to put them under Network driver support.)" Sam Ravnborg replied, "Agreed. All IrDA / Bluetooth stuff belongs together. Leave them where they are for now." Randy also wanted to know if "leaving SLIP, PPP, and PLIP where they are under Network driver support, even though they say that they are "protocols"" would be a good idea. Sam replied, "SLIP and PLIP is not that common. PPP is more common for cable-modem/ADSL I suppose. But still it would make sense to create an Misc protocols menu, like we have a misc filesystems menu." But on further reflection, Randy noticed that "SLIP, PLIP, and PPP depend on NETDEVICES, and they use some netdev interfaces, so they appear to be more like net devices than protocols even though they are called protocols in Kconfig text, so I am leaving them alone for now." Sam posted his own additions to Randy's work, and they continued to discuss what would be best. Sam said:

The new Networking menu looks unstructured. And the net/Kconfig file contains a lot of config snippets that does not belong there. So I took a stamp on it with focus on:

The patch became much larger. The win is that the top-level net/Kconfig contains much less cruft.

Randy replied, "Nice job overall. Especially nice to move ATM, bridge, DECNET, ECONET, etc., to their own Kconfig files so that they are more manageable." But he added, "I still prefer Networking to come before Device Drivers FWIW. Just makes some kind of hierarchical sense to me." Sam fixed this location; and after further digging, Randy said:

Here are a few more suggestions for you to consider.

Sam implemented both of these suggestions, and added, "I thought of creating a Kconfig.netfilter for the common netfilter stuff. But in the end did not do it - felt there was plenty of new small files being created already." Randy replied, "It would make sense to isolate the netfilter options, but that can be done later. But you are right about "plenty of new small files."" He added, "I would move Frame Diverter (NET_DIVERT) from the end of the net/core/Kconfig file to the top of the same file.... and then ship it. :)" Sam committed his changes and pushed them along the Andrew Morton's -mm tree.

6. Linux 2.6.12-rc1-mm4 Released

31 Mar 2005 - 4 Apr 2005 (6 posts) Archive Link: "2.6.12-rc1-mm4"

Topics: FS: ReiserFS, Kernel Release Announcement

People: Andrew Morton

Andrew Morton announced Linux 2.6.12-rc1-mm4, saying:

7. FUSE Breaks Backward Compatibility

31 Mar 2005 (3 posts) Archive Link: "[PATCH] FUSE: 0/3 update kernel ABI"

Topics: Backward Compatibility

People: Miklos SzerediFranco Broi

Miklos Szeredi said:

The following 3 patches change the userspace interface. Backward compatibility is not retained, the library must be upgraded to 2.3-pre1 or later. The library will support both the old and the new ABI versions. Filesystems dynamically linked with libfuse don't need to be recompiled.

The main reason for the change is that the current interface was not compatible between 32bit and 64bit modes of dual architecures.

The patches are:

1/3 - Add padding to structures to make sizes the same on 32bit and 64bit archs

2/3 - Add offset to fuse_dirent structure. This will make the readdir interface more flexible

3/3 - Change ABI major version from 5 to 6, and check if userspace supports the new interface

In a subsequent post ( he added that for the first patch, "Initial testing and test machine generously provided by Franco Broi."

8. New ConfigFS Filesystem For Userspace-Driven Kernel Object Configuration

3 Apr 2005 - 5 Apr 2005 (6 posts) Archive Link: "[PATCH] configfs, a filesystem for userspace-driven kernel object configuration"

Topics: FS: procfs, FS: sysfs, Ioctls

People: Joel BeckerMatt MackallZach Brown

Joel Becker said:

I humbly submit configfs. With configfs, a configfs config_item is created via an explicit userspace operation: mkdir(2). It is destroyed via rmdir(2). The attributes appear at mkdir(2) time, and can be read or modified via read(2) and write(2). readdir(3) queries the list of items and/or attributes.

The lifetime of the filesystem representation is completely driven by userspace. The lifetime of the objects themselves are managed by a kref, but at rmdir(2) time they disappear from the filesystem.

configfs is not intended to replace sysfs or procfs, merely to coexist with them.

An interface in /proc where the API is:

# echo "create foo 1 3 0x00013" > /proc/mythingy

or an ioctl(2) interface where the API is:

        struct mythingy_create {
                char *name;
                int index;
                int count;
                unsigned long address;

        do_create {
                mythingy_create = {"foo", 1, 3, 0x0013};
                return ioctl(fd, MYTHINGY_CREATE, &mythingy_create);

becomes this in configfs:

        # cd /config/mythingy
        # mkdir foo
        # echo 1 > foo/index
        # echo 3 > foo/count
        # echo 0x00013 > foo/address

Instead of a binary blob that's passed around or a cryptic string that has to be formatted just so, configfs provides an interface that's completely scriptable and navigable.

Patch is against 2.6.12-rc1-bk3.

Matt Mackall asked, "How does the kernel know when to actually create the object?" And Zach Brown replied:

"actually create", huh? :)

In the trivial case Joel describes, the item is almost certainly allocated during "# mkdir foo" when the subsystem will get a ->make_item() call for the 'mythingy' group it registerd. The various attribute writes then find the item by following their configfs_attribute argument to the item that its embedded in.

But I bet you're not really asking about creation. I bet you're wondering how the kernel will know when enough attributes have been filled and that it's safe to use the object. Misguided items could assign magical ordering to the attribute filling such that once a final attribute is set, and others have been set, the item goes live. That's what ocfs2 does now, sadly, but certainly not as a long-term solution.

The missing piece is the 'commit_item' group operation that is yet to be implemented. The intent is to have a directory of pending items that can have their attributes filled before being rename()ed into a directory of items that are in active use. The commit_item() call that hits at rename() would give the kernel the chance to refuse the item because attributes haven't been filled in or conflict with existing items, or whatever.

9. Linux 2.4.30 Released

3 Apr 2005 (1 post) Archive Link: "linux-2.4.30 released"

People: Marcelo Tosatti

Marcelo Tosatti announced Linux 2.4.30, saying that it was identical to 2.4.30-rc4. No changes at all.

10. New Hardware For

4 Apr 2005 (6 posts) Archive Link: " replaced"

People: H. Peter AnvinAlessandro Suardi

H. Peter Anvin said:

HP has most graciously donated a pair of DL585 quad Opteron servers with 24 GB of RAM and 10 TB of disk using a pair of MSA-30 arrays for each server. The first ones of these servers was officially put in service today; the next one will be put in service next week. Each server is in a different ISC colo, connected to the Internet via gigabit fiber links.

Consequently, we should now see incredibly much better performance from Huge thanks to HP for the new hardware, and huge thanks to ISC for letting us quadruple our rack space requirements from 5U to 2x10U. We'll be saturating those links in no time :)

He added in a later post:

A few additional notes:

Alessandro Suardi was thrilled about these developments, though he did point out that "2.6.12-rc2 has been announced a few hours ago on, still the patch isn't there.. it will be hard to saturate links that way ;)" H. Peter looked into this and reported it "Fixed. It was uploaded while I was still in the process of getting the upload system set up, and it apparently got recorded as already uploaded."

11. inotify Version 0.22 Released

4 Apr 2005 - 5 Apr 2005 (8 posts) Archive Link: "[patch] inotify 0.22"

People: Robert LoveDale BlountMartin Schlemmer

Robert Love said:

Below, find inotify 0.22, against 2.6.12-rc1.

This release introduces a conversion in our primary locking from spinlocks to semaphores. Semaphores are a more natural fit for our code, which synchronizes with user-space, thus we clean up a bit of code with a net reduction of 63 lines. Also, I was able to remove the GFP_ATOMIC allocation.

I did this as a bit of an experiment, not to fix any specific problem, and I now think it is the right way to go.

This release also fixes a small bug in the coalescing code, which could of mistakenly dropped a move event. We now verify that the cookies match before coalescing.

Dale Blount asked, "Will inotify watch directories recursively?" And Robert replied:

No, inotify does not support watching directories recursively. I would love to add it, but it would be a mess to do inside of the kernel.

Making it easy and efficient to watch a full tree, however, was a goal of inotify. Beagle, a personal indexing infrastructure, watches the user's entire home directory.

You could never do this in dnotify because you would run out of file descriptors and pin every file.

In inotify, it is not hard to write a simple recursive loop to add a watch to each directory starting at a given path. It can even be done in an atomic fashion. See

wherein I publish such an algorithm.

Martin Schlemmer asked if the new inotify code would appear in the official kernel any time soon, but there was no reply to this. Robert did release two more updates to the patch during the course of the thread, however.

12. Refining The Master Abort Mode Flag For PCI Bridge Chips

5 Apr 2005 - 6 Apr 2005 (4 posts) Archive Link: "[RFC/Patch 2.6.11] Take control of PCI Master Abort Mode"

Topics: Disks: IDE, Networking, PCI

People: Ross BiroRandy DunlapDaniel Egger

Ross Biro said:

Currently Linux 2.6 assumes the BIOS (or firmware) sets the master abort mode flag on PCI bridge chips in a coherent fashion. This is not always the case and the consequences of getting this flag incorrect can cause hardware to fail or silent data corruption. This patch lets the user override the BIOS master abort setting at boot time and the distro maintainer to set a default according to their target audience.

The comments in the patch are probably a bit too verbose, but I think it is a good patch to start discussions around. If it is decided that something should be done about this problem, this patch could be included in a -mm release and migrate into Linus's kernel as appropriate.

This incarnation of the patch has had minimal testing. For our internal kernels, we always force the master abort mode to 1 and then let the device drivers for hardware we know can't handle target aborts switch the master abort mode to 0. This does not seem appropriate for general release.

Some background for those who do not spend most of their waking hours exploring buses and what can go wrong.

The master abort flag tells a PCI bridge what to do when a bus master behind the bridge requests the bus and the bridge is unable to get the bus. With the flag clear, for master reads the bridge returns all 0xff's (hence silent data corruption) and for master writes, it throws the data away. With the bit set, the bridge sends a target abort to the master. This can only happen when the system is heavily loaded.

The problem with always setting the bit is that some PCI hardware, notably some Intel E-1000 chips (Ethernet controller: Intel Corporation: Unknown device 1076) cannot properly handle the target abort bit. In the case of the E-1000 chip, the driver must reset the chip to recover. This usually leads to the machine being off the network for several seconds, or sometimes even minutes, which can be bad for servers.

I even have a single motherboard with both a device that cannot handle the target abort and an IDE controller that can handle the target abort behind the same bridge. For this motherboard, I have to choose the lesser of two evils, network hiccups or potential data corruption. For the record, I have seen both occur. Other people may make wish to make a different choice than we did, hence this patch allows the user to choose the mode at runtime.

Randy Dunlap went over the patch with a fine tooth comb and offered some suggestions; and Daniel Egger thought Ross's patch might solve a problem he'd been having with his own system; but there was no further discussion and the thread ended.







Sharon And Joy

Kernel Traffic is grateful to be developed on a computer donated by Professor Greg Benson and Professor Allan Cruse in the Department of Computer Science at the University of San Francisco. This is the same department that invented FlashMob Computing. Kernel Traffic is hosted by the generous folks at All pages on this site are copyright their original authors, and distributed under the terms of the GNU General Public License version 2.0.