Kernel Traffic
Latest | Archives | People | Topics
Latest | Archives | People | Topics
Latest | Archives | People | Topics
Home | News | RSS Feeds | Mailing Lists | Authors Info | Mirrors | Stalled Traffic

Kernel Traffic #248 For 20 Jan 2004

By Zack Brown

Table Of Contents


In response to Issue #247, Section #3  (18 Dec 2003: Real-Time Maintainership; Nanokernel Maintainership; Patent Policy) , Karim Yaghmour (author of Building Embedded Linux Systems) sent me an email saying:

I was going through KT247 and noticed the summary of the real-time maintainership discussion. Most of the summary is right on the spot and I have nothing to add about it. The only thing I'd like to bring to your attention is Linus' statement on the issue.

As I pointed out in my response to him (, Linus is making major mistakes on this issue. For example, the mail you quote has him saying: "It's doubly discgusting with some of the people who were trying to spread all the FUD and mis-information were doing so because they were themselves doing a non-GPL microkernel, and they complained about how the patents were somehow against the GPL and wanted to get community support by trying to make out the situation to be somehow different from what it was."

Yet, neither I or anyone involved in this debate has ever asked for a non-GPL microkernel. In fact, I was quoted by KT in June 2002 ( as saying: "We don't want a non-GPL real-time executive or a non-GPL OS. All we want is the right to develop applications using our licenses as others are for Linux."

In addition, Linus' claim that I'm somehow spreading FUD is contradicted by Victor Yodaiken's own statements on the issues, as I explained in my reply to Linus (see above URL.)

I do understand that given Linus' community stance, his word is worth reporting. That being said, Linus' stance doesn't mean he's infallible and I think KT readers would profit from having the full facts presented to them, even when such facts show that a prominent figure of the community is mistaken.

I'm including Karim's comments (with permission) because it seems like a relevant part of the debate. Further discussion should of course go to linux-kernel.

Mailing List Stats For This Week

We looked at 1221 posts in 5976K.

There were 389 different contributors. 189 posted more than once. 144 posted last week too.

The top posters of the week were:

1. Researching SCO's Infringement Claims

22 Dec 2003 - 2 Jan 2004 (54 posts) Archive Link: "SCO's infringing files list"

Topics: Ioctls, POSIX

People: Tom FelkerErik AndersenLinus TorvaldsJ.W. SchultzStan BubrouskiFlorian WeimerGiacomo A. CatenazziAlexander ViroXose Vazquez PerezMitchell Blank JrH. J. LuAndries BrouwerAlan Cox

Stan Bubrouski posted a list of files the SCO claimed were infringed by Linux. Among these were 'include/asm-i386/errno.h'. Tom Felker said, "The original errno.h, from linux-0.01, says it was taken from minix, and goes up to 40. Between linux-0.96c and linux-0.97, that file was replaced with the present version, which includes the error strings and goes up to 121. Where did the 0.97 to present version come from?" Erik Andersen replied, "For errno.h, according to this: I got Linus to add ENOMEDIUM and EMEDIUMTYPE into the kernel for 2.0.32, as part of my work on (what was to become) the Uniform cdrom driver, based on original work from David van Leeuwen. I then helped push these defines into glibc and libc5. So at least those two defines are clearly not SCO derived..." Close by, Linus Torvalds also said to Stan:

Good eyes - I only analysed the ctype.h thing, and didn't look up errno.h in the original sources. errno.h has a _big_ comment saying where the numbers came from (and some swearwords about POSIX ;)

Looking at signal.h, those numbers also seem to largely match minix. Which makes sense - I actually had access to them.

In both cases it's only the numbers that got copied, though. And not all of them either - for some reason I tried to make the signal numbers match (probably lazyness - not so much that I cared about the numbers themselves, but about the list of signal names), but for example the SA_xxxx macros - in the very same file - bear no relation to the minix ones.

J.W. Schultz said, "And for the names, perhaps they would care to sue The Open Group? And that probably applies to the rest of these header files." Stan replied:

Just to shore up what jw was talking about:

and for a listing of all the defines the open group has (dir listing):

Florian Weimer also replied to J. W., saying that the comments were more of a problem than the names. For 'errno.h', he said:

The comments were added in Linux 0.99.1, and I'm not sure what was the source. For example, Linux has:

#define ENOTTY          25      /* Not a typewriter */


#define ENOTTY  25      /* Inappropriate ioctl for device       */

Current POSIX:

        Inappropriate I/O control operation.

I couldn't find any historic Minix header files. Minix 2 has:

#define ENOTTY        (_SIGN 25)  /* inappropriate I/O control operation */

Giacomo A. Catenazzi replied:


/* Extended support for using errno values.
   Written by Fred Fish.
   This file is in the public domain.  --Per Bothner.  */
#if defined (ENOTTY)
  ENTRY(ENOTTY, "ENOTTY", "Not a typewriter"),

FYI there was a proposed patch to change "Not a typewriter" to "Inappropriate ioctl for device". Check the interesting thread of lkml:

Linus replied:

Something like that may well be the source of the string. Fred Fish was active long before this timeframe (if it's the same Fred Fish - he used to do freeware collections for the Amiga in the '80's).

But there were multiple libc's around (estdio, libc5, glibc..), and it could be any of them.

Trying to find the kernel list archives from that timeframe would likely clarify the issue. There were several lists back then: "linux-activists" mailing list, and of course the "comp.os.linux" newsgroup (this was before it split into multiple newsgroups).

I've found some archives for linux-activists, but no newsgroup archives going that far back.. Anybody?

Alexander Viro replied:

They'd done what DN had promised and never did - merged old usenet archives into their database, all way back to '81.

The earliest they have on comp.os.linux is March 1992 and results of voting on newgroup had been posted by tytso on Mar 25 1992. There's a bogus crosspost into (still not existing) c.o.l on Mar 21 and regular postings starting on Mar 31. IOW, archive goes all way back to the group creation.

alt.os.linux archives there start on Jan 19 1992 (newgrouped somewhere around Jan 17, took several days to propagate). Before that it's c.o.minix, but by the time you are looking for migration from c.o.m should've been over.

Xose Vazquez Perez also said that Alan Cox "used to keep some archives at:" . Mitchell Blank Jr. also pointed out:

the linux-activists messages for that period would be in:

Specifically, the files "digest51[789] digest52[012]" are the neighborhood around july 25th. I did a quick grep for "errno" and a few strings from the errno.h file and didn't see anything relevant pop up, though.

Andries Brouwer and others found a relevant post from that era (July 1992), in which H. J. Lu apparently authored or collected some errno.h changes that might have been relevant. As Mitchell pointed out close by, "this is back when gcc and libc were distributed together, both by H J Lu." Mitchell also noticed that in H. J.'s announcement long ago, the file list of his patches included 0.96bp2inc.tar.Z, "the kernel header files for 0.96b patch level 2" . Mitchell said that the file "seems to be a modified version of the 0.96bp2 header files needed in order to work with the new gcc release (searching for that filename turns up a message discussing it a little) So I'm guessing that the July 25, 1992 errno.h in the linux tree is a merge from this code. Now, does anyone have a copy of "0.96bp2inc.tar.Z" lying around?" Linus replied:

Ok, this is the source.

In particular, I can re-create _exactly_ the linux-0.97 "errno.h" file by using the "sys_errlist[]" contents from "libc-2.2.2". In particular, this trivial loop will generate the exact (byte-for-byte) list that is in the kernel:

        int i;

        for (i = 1; i < 122; i++) {
                const char *name = names[i];
                int n = strlen(name);
                char *tabs = "\t\t"+(n > 7);
                const char *expl = libc222_errlist[i];
                printf("#define\t%s%s%2d\t/* %s */\n",
                        name, tabs, i, expl);

here, the "names[]" array was filled in with the error names, ie

        const char *names[] = { "none",
        "EPERM", "ENOENT", "ESRCH", "EINTR", "EIO", "ENXIO", "E2BIG",

and the "libc222_errlist[]" array was filled in with the strings found by just downloading the old "libc-2.2.2" binary that can still be found at

and then just doing a "strings - libc-2.2.2" and "sys_errlist[]" will be obvious:

        static char *libc222_errlist[] = {
                "Unknown error",
                "Operation not permitted",

This was literally a five-minute hack (I wrote the silly loop yesterday to see what it does with the current "strerror()" - there is very good correlation even today, but using the libc-2.2.2 sys_nerrlist[] you get _exactly_ the same result).

So this is definitely the source of the kernel error header. It's either a file from the libc sources, or it is literally auto-generated like the above (I actually suspect the latter - now that I did the auto-generation it all felt very familiar, but that may just be my brain rationalizing things. Humans are good at rationalizing reality.).

Can anybody find the actual libc _sources_? Not the kernel headers that hjl mentions (those are the old ones from _before_ the change), but the file "libc-2.2.2.tar.Z"?

Anyway, we know where the kernel header comes from. Let's figure out where the libc data comes from.

Elsewhere, and still regarding 0.96bp2inc.tar.Z, Mitchell said:

BTW, a few more details on this file - the linux GCC 2.2.2 release was originally announced 28-Jun-1992. The 0.96bp2inc.tar.Z file originally lived on the then-primary linux ftp site in directory pub/Linux/GCC.

banjo stopped being an FTP server a couple months later - however, Jonathan Magid announced on 13-Aug-1992 that the entire banjo site was being reincarnated at host in directory ftp/pub/pc-stuff/Linux. Here's a copy of the announcement:

My understanding is that reggae.oit morphed at some point into (which is now, of course, Jonathan still appears to be there, so I'm cc:ing him on this (apologies in advance if its an intrusion, Jonathan) on the off-chance that there might still be a 1992-era archive of the linux files once hosted by banjo.

The only other person likely to have access to a copy is H J Lu himself (also cc:'ed although I'm 99% sure he's still on lkml :-)

Linus replied:

Note that we really don't care about that "0.96bp2inc.tar.Z" file: that's just the kernel headers, and 0.96b-pl2 did _not_ contain the comments yet. But libc used to use the kernel headers for other things (for things like system call numbers etc).

It's almost certainly the "libc-2.2.2.tar.Z" file that we want - that's the one that is going to contain the sys_errlist[] lists etc. Note how this libc-2.2.2 announcement predates the merging of the kernel header by almost a month - the kernel header information came from libc, not the other way around.

He went on:

Does anybody have old CD-ROM's lying around?

In particular, the Yggdrasil Linux/GNU/X alpha CD-ROM was apparently released just a few months later. It would quite possibly contain the libc-2.2.2 sources... Adam Richter is still active, and I added him to the cc..

Who else was doing CD's back then? SLS? If nobody has the thing on a web-site any more, maybe they exist in physical format on somebodys bookshelf? The only reason that the really historic kernel archives still exist is that people saved them, and even so we're missing versions 0.02 and 0.03, but by the latter half of -92 there were already CD-ROMs being manufactured...

Of course, maybe the CD's are unreadable by now.

Andries said, "I just uploaded a copy" [of libc-2.2.2.tar.gz] "to" Linus replied:

Yup, and I can confirm two things:

that, together with the timing, pretty much proves that the kernel header was indeed just auto-generated from sys_errlist[] of that timeframe, with a program very much like the one I already posted.

Now, the libc file just says

        /* This is a list of all known signal numbers.  */

(which is obviously just a cut-and-paste from siglist.c n the same directory). But it shouldn't much matter, since I don't think SCO really is going to try to claim copyright ownership of the result of standard C library interactions like using "sys_errlist[]".

(I take that back - _of_course_ they are going to try to claim ownership. After all, they already claimed ownership of code I provably wrote).

2. Some Discussion Of Process Load Balancing And Priority Handling

22 Dec 2003 - 2 Jan 2004 (31 posts) Archive Link: "[PATCH] 2.6.0 batch scheduling, HT aware"

Topics: Hyperthreading

People: Con KolivasNick PigginJun NakajimaBill Davidsen

Con Kolivas said:

I've done a resync and update of my batch scheduling that is also hyper-thread aware.

What is batch scheduling? Specifying a task as batch allows it to only use cpu time if there is idle time available, rather than having a proportion of the cpu time based on niceness.

Why do I need hyper-thread aware batch scheduling?

If you have a hyperthread (P4HT) processor and run it as two logical cpus you can have a very low priority task running that can consume 50% of your physical cpu's capacity no matter how high priority tasks you are running. For example if you use the distributed computing client setiathome you will be effectively be running at half your cpu's speed even if you run setiathome at nice 20. Batch scheduling for normal cpus allows only idle time to be used for batch tasks, and for HT cpus only allows idle time when both logical cpus are idle.

This is not being pushed for mainline kernel inclusion, but the issue of how to prevent low priority tasks slowing down HT cpus needs to be considered for the mainline HT scheduler if it ever gets included. This patch provides a temporising measure for those with HT processors, and a demonstrative way to handle them in mainline.

Patch available here:

Nick Piggin replied:

I wonder how does Intel suggest we handle this problem? Batch scheduling aside, I wonder how to do any sort of priorities at all? I think POWER5 can do priorities in hardware, that is the only sane way I can think of doing it.

I think this patch is much too ugly to get into such an elegant scheduler. No fault to you Con because its an ugly problem.

How about this: if a task is "delta" priority points below a task running on another sibling, move it to that sibling (so priorities via timeslice start working). I call it active unbalancing! I might be able to make it fit if there is interest. Other suggestions?

Jun Nakajima replied:

Today utilization of execution resources of a logical processor is around 60% as you can find in public papers, and it's dependent on the processor implementation and the workload. It could be higher in the future, and their relative priority could be much higher then. So I don't think it's a good idea to hard code such a implementation-specific factor into the generic scheduler code.

Regarding H/W-based priority, I'm not sure it's very useful especially because so many events happen inside the processor and a set of the execution resources required changes very rapidly at runtime, i.e. the H/W knows what it should do to run faster at runtime, and imposing priority on those logical processor could make them run slower.

Nick said his idea would not involve hard-coding implementation-specific stuff into the generic scheduler code; "The mechanism would be generic, but the parameters would be arch specific" . He added that he agreed with Jun, that a hardware-based priority system would not be entirely sufficient, if only because hardware that didn't provide the means for a hardware-based solution would still need software to do the same thing; so the software would need to be written regardless.

Elsewhere, Con replied to Nick's specific 'active unbalancing' idea, saying, "I discussed this with Ingo and that's the sort of thing we thought of. Perhaps a relative crossover of 10 dynamic priorities and an absolute crossover of 5 static priorities before things got queued together. This is really only required for the UP HT case." Nick disagreed, and over the course of a little back-and-forth, said that the multi-CPU situation was "the same problem. A nice -20 process can still lose 40-55% of its performance to a nice 19 process, a figure of 10% is probably too high and we'd really want it <= 5% like what happens with a single logical processor." And Con agreed.

Close by, Bill Davidsen also said:

There are two goals here. Not having a batch process on one siling makes sense, and I'm going to try Con's patch after I try Nick's latest. Actually, if they play nicely I would use both, batch would be very useful for nightly report generation on servers.

But WRT the whole HT scheduling, it would seem that ideally you want to schedule the two (or N) processes which have the lowest aggregate cache thrash, if you had a way to determine that. I suspect that a process which had a small itterative inner loop with a code+data footprint of 2-3k would coexist well with almost anything else. Minimizing the FPU contention also would improve performance, no doubt. I don't know that there are the tools at the moment to get this information, but it seems as though until it's available any scheduling will be working in the dark to some extent.

Con said the patches would never play nicely, but they might be able to live together in some way. Regarding the existence of tools to get the information Bill wanted, Con added, "Impossible with current tools. Only userspace would have a chance of predicting this and the simple rule we work off is that userspace can't be trusted so this does not appear doable in the foreseeable future." Bill thought it would be hard to make much headway without a solution to this problem.

3. SysFS Class Patches To Ease Driver Support

23 Dec 2003 - 29 Dec 2003 (13 posts) Archive Link: "[PATCH] sysfs class patches - take 2 [0/5]"

Topics: FS: sysfs

People: Greg KHJeff Garzik

Greg KH said:

Here are the sysfs class patches reworked against a clean 2.6.0 tree. I've created a class_simple.c file that contains a "simple" class device interface. I've then converted the tty core to use this interface (the combo of these two patches makes for no extra code added).

Then there are 3 patches, adding class support for misc, mem, and vc class devices. As the interface to add simple class support for devices is now so low, I feel that we do need to have mem class support as to not special case any char device.

With these patches, it's now much easier for others to implement class support for remaining char drivers/subsystems that do not have it yet.

Jeff Garzik remarked, "Interesting... I bet that will be useful to the iPAQ folks (I've been wading through their patches lately), as they have created a couple ultra-simple classes for SoC devices and such." Greg KH replied, "I bet it will. I've ported my old frame buffer patch to use it, and it saved a lot of code."

4. udev 011 Released

24 Dec 2003 - 29 Dec 2003 (4 posts) Archive Link: "[ANNOUNCE] udev 011 release"

Topics: FS: devfs, FS: sysfs, Hot-Plugging, Klibc, Version Control

People: Greg KH

Greg KH announced:

I've released the 011 version of udev. It can be found at:

rpms built against Red Hat FC1 are available at: with the source rpm at:

udev is a implementation of devfs in userspace using sysfs and /sbin/hotplug. It requires a 2.6 kernel to run. Please see the udev FAQ for any questions about it:

The major changes since the 010 release are:

Thanks again to everyone who has send me patches for this release, a full list of everyone, and their changes is below.

udev development is done in a BitKeeper repository located at: bk://

Daily snapshots of this tree used to be found at: But that box seems to be down now. Hopefully it will be restored someday. If anyone ever wants a tarball of the current bk tree, just email me.

5. Increasing PAGE_SIZE For 2.7

26 Dec 2003 - 29 Dec 2003 (28 posts) Archive Link: "Page Colouring (was: 2.6.0 Huge pages not working as expected)"

Topics: Executable File Format, Forward Port, Virtual Memory

People: Linus TorvaldsWilliam Lee Irwin IIIMike FedykDaniel PhillipsRusty RussellHugh DickinsEric W. BiedermanAndrew Morton

In the course of discussion, Eric W. Biederman suggested increasing PAGE_SIZE in the kernel. Linus Torvalds replied:

Yes. This is something I actually want to do anyway for 2.7.x. Dan Phillips had some patches for this six months ago.

You have to be careful, since you have to be able to mmap "partial pages", which is what makes it less than trivial, but there are tons of reasons to want to do this, and cache coloring is actually very much a secondary concern.

William Lee Irwin III said, "I've not seen Dan Phillips' code for this. I've been hacking on something doing this since late last December." Mike Fedyk said, "I remember his work on pagetable sharing, but haven't heard anything about changing the page size from him. Could this be what Linus is remembering?" William said, "Doubtful. I suspect he may be referring to pgcl (sometimes called "subpages"), though Dan Phillips hasn't been involved in it. I guess we'll have to wait for Linus to respond to know for sure." And Linus said:

I didn't see the patch itself, but I spent some time talking to Daniel after your talk at the kernel summit. At least I _think_ it was him I was talking to - my memory for names and faces is basically zero.

Daniel claimed to have it working back then, and that it actually shrank the kernel source code. The basic approach is to just make PAGE_SIZE larger, and handle temporary needs for smaller subpages by just dynamically allocating "struct page" entries for them. The size reduction came from getting rid of the "struct buffer_head", because it ends up being just another "small page".

He asked Daniel Phillips for the details, and Daniel said:

Your description is accurate. Another reason for code size shrinkage is getting rid of the loops across buffers in the block IO library, e.g., block_read_full_page.

Subpages only make sense for file-backed memory, which conveniently lets the page cache keep track of subpages. Each address_space has pages of all the same size, which may be smaller, larger or the same as PAGE_CACHE_SIZE. The first case, "subpages", is the interesting one.

An address_space with subpages has base pages of PAGE_CACHE_SIZE for its "even" entries and up to N-1 dynamically allocated struct pages for the "odd" entries where N is PAGE_CACHE_SIZE divided by the subpage size. Base pages are normal members of mem_map. Subpages are not referenced by mem_map, but only by the page cache. They are created by operations such as find_or_create_page, which first creates a base page if necessary. A counter field in the page flags of the base page keeps track of how many subpages share a base page's physical memory; when this field goes to zero the base page may be removed from the page cache.

Subpages always have a ->virtual field regardless of whether mem_map pages do. This is used for virt_to_phys and to locate the base page when a subpage is freed.

Page fault handling doesn't change much if at all, since the faulting address is rounded down to a physical page, which will be a base page.

Most of the changes for subpages are in the buffer.c page cache operations and are largely transparent to the VMM, though PAGE_CACHE_SHIFT becomes mapping->page_shift, which touches a lot of files. As you noted, buffer_head functionality can be taken over by struct page and buffers become expendible. However it is not necessary to cross that bridge immediately; page buffer lists continue to work though the buffer list is never longer than one.

With a little more work, subpages can be used to shrink mem_map: implement a larger PAGE_CACHE_SIZE then use subpages to handle ABI problems. In this case faults on subpages are possible and the fault path probably needs to know something about it. With a larger-than-physical PAGE_CACHE_SIZE we can finally have large buffers, though the kernel would have to be compiled for it. Some more work to allowing mapping->page_shift to be larger than PAGE_CACHE_SIZE would complete the process of generalizing the page size. My impression is, this isn't too messy, most of the impact is on faulting. Bill and others are already familiar with this I think. The work should dovetail.

I took a stab at implementing subpages some time ago in 2.4 and got it mostly working but not quite bootable. I did find out roughly how invasive the patch is, which is: not very, unless I've overlooked something major. I'll get busy on a 2.6 prototype, and of course I'll listen attentively for reasons why this plan won't work.

Close by, William said to Linus, "I did get a positive reaction from you at KS, and I've also been slaving away at keeping this thing current and improving it when I can for a year. Would you mind telling me what the Hell is going on here? I guess I already know I'm screwed beyond all hope of recovery, but I might as well get official confirmation." Linus replied:

I haven't even _looked_ at any 2.7.x timeframe patches, and I'm not even going to for the next few months.

I don't care what does it, I want a bigger PAGE_CACHE_SIZE, and working patches are the only thing that matters. But for now, I have my 2.6.x blinders on.

William still felt discouraged, though he supposed there was still some hope. Linus replied:

Well, I don't even know what your approach is - mind giving an overview?

My original plan (and you can see some of it in the fact that PAGE_CACHE_SIZE is separate from PAGE_SIZE), was to just have the page cache be able to use bigger pages than the "normal" pages, and the normal pages would continue to be the hardware page size.

However, especially with mem_map[] becoming something of a problem, and all the problems we'd have if PAGE_SIZE and PAGE_CACHE_SIZE were different, I suspect I'd just be happier with increasing PAGE_SIZE altogether (and PAGE_CACHE_SIZE with it), and then just teaching the VM mapping about "fractional pages".

What's your approach?

William replied:

I presented on this at KS. Basically, it's identical to Hugh Dickins' approach from 2000. The only difference is really that it had to be forward ported (or unfortunately in too many cases reimplemented) to mix with current code and features.

Basically, elevate PAGE_SIZE, introduce MMUPAGE_SIZE to be a nice macro representing the hardware pagesize, and the fault handling is done with some relatively localized complexity. Numerous s/PAGE_SIZE/MMUPAGE_SIZE/ bits are sprinkled around, along with a few more involved changes because a large number of distributed changes are required to handle oddities that occur when PAGE_SIZE changes from 4KB. The more involved changes are often for things such as the only reason it uses PAGE_SIZE is really that it just expects 4KB and says PAGE_SIZE, or that it wants some fixed (even across compiles) size and needs updating for more general PAGE_SIZE numbers, or sometimes that it expects PAGE_SIZE to be what a pte maps when this is now represented by MMUPAGE_SIZE. I have a bad feeling the diligence of the original code audit could be bearing against me (and though I'm trying to be equally diligent, I'm not hugh).

The fact merely elevating PAGE_SIZE breaks numerous things makes me rather suspicious of claims that minimalistic patches can do likewise.

The only new infrastructures introduced are the MMUPAGE_SIZE and a couple of related macros (defining numbers, not structures or code) and the fault handler implementations. The diff size is not small. The memory footprint is, and demonstrably so (c.f. March 27 2003).

My 2.6 code has been heavily leveraging the pfn abstraction in its favor to represent physical addresses measured in units of the hardware pagesize. Generally, my maintenance approach has been incrementally advancing the state of the thing while keeping it working on as broad a cross section of i386 systems as I can test or get testers on. It has been verified to run userspace on Thinkpad T21's and 16x/32GB and 32x/64GB NUMA-Q's at every point release it's been ported to, which since 2.5.68 or so has been every point release coming out of

Rusty Russell asked, "Can you give an example? One approach is to simply present a larger page size to userspace w/ getpagesize(). This does break ELF programs which have been laid out assuming the old page size (presumably they try to mprotect the read-only sections). On PPC, the ELF ABI already insists on a 64k boundary between such sections, and maybe for others you could simply round appropriately and pray, or do fine-grained protections (ie. on real pagesize) for that one case." And William replied:

Apps must, of course, be relinked for that, but that's userspace. This ABI change is largely out of the picture due to legacy binaries, user virtualspace fragmentation (most likely an issue for 32-bit threading), and so on. The choice of PAGE_SIZE in such schemes is also restricted to no larger than whatever choice used for userspace linking, which is a relatively ugly dependency. There's also a question of "smooth transition": the only way to "incrementally deploy" it on a mixture "ready" userspace and "unready" userspace is to turn it off. I suppose it has the minor advantage of being trivial to program.

I had in mind pure kernel internal issues, not ABI.

The issues from raising PAGE_SIZE alone are things like interpreting hardware descriptions in arch code, some shifts underflowing for things like hashtables, certain drivers doing ioremap() and the like either filling up vmallocspace or getting their math wrong, and some other drivers doing calculations on physical addresses getting them wrong, or using PAGE_SIZE to represent some 4KB or other fixed-size memory area interpreted by hardware, and filesystems that assume blocksize == PAGE_SIZE or assume PAGE_SIZE is less than some particular value (e.g. short offsets into pages, worst of all being signed shorts), and tripping BUG()'s in ll_rw_blk.c when 512*q->max_sectors < PAGE_SIZE.

These issues are the bulk of the work needing to be done for the driver and fs sweeps. Actual concerns about MMUPAGE_SIZE in drivers/ and fs/ are rather limited in scope, though drivers/char/drm/ was somewhat painful to get going (Zwane actually did most of this for me, as I have no DRM/DRI -capable graphics cards at my disposal).

Mike asked if there were some way for William to split his patch into smaller chunks, and merge into Andrew Morton's -mm tree; and William said:

I talked about this for a little while. Basically, there is only one concept in the entire patch, despite its large size. The vast bulk of the "distributed changes" are s/PAGE_SIZE/MMUPAGE_SIZE/.

At some point I was told to keep the whole shebang rolling out of tree or otherwise not answered by akpm and/or Linus, after I sent in what a split up (this is actually very easy to split up file-by-file) version of what just some of the totally trivial arch/i386/ changes would look like. The nontrivial changes are stupid in nature, but touch "fragile" or otherwise "scary to touch" code, and so sort of relegate them to 2.7. This is not entirely unjustified, as changes of a similar code impact wrt. the GDT appear to have affected some APM systems' suspend ability (I know for a fact my changes do not have impacts on APM suspend, but other, analogous support issues could arise after broader testing.)

Basically, the MMUPAGE_SIZE introductions didn't interest anyone a while ago, and I suspect people probably just want them all at once, since it's unlikely people want to repeat the pain analogous to PAGE_CACHE_SIZE (I should clarify later how this is different) where the incremental introduction never culminated in the introduction of functionality.

6. Adaptec/DPT I2O Gone In 2.6; Replacement Sought

26 Dec 2003 - 2 Jan 2004 (15 posts) Archive Link: "Adaptec/DPT I2O Option Omitted From Linux 2.6.0 Kernel Configuration Tool"

Topics: I2O, Ioctls, PCI

People: Samuel FloryGo Taniguchi

Leon Toh asked whatever became of the Adapec/DPT I2O option in the 2.6 kernel. It was gone. Samuel Flory replied, "The DPT I2O driver was never converted to the new driver model. The driver from what I can see is a mess. It doesn't even compile in 2.4 for a number of archs like amd64. A while back a bunch of people (including myself) raised the concern through various channels with adaptec. In theory someone at adaptec is working on it, but there was not an ETA." Leon started tinkering, trying to get the thing working again under 2.6; and Samuel said, "You might want to hold off on doing a lot of work for a bit. I think there was a beta driver that was being passed around." He found it in his own archives, and said he'd try to find out the current status. Go Taniguchi replied:

This is my patch.

Worked fine for me on quad xeon with 4G mem and 64bit PCI. It include.

However, It may differ from the Adaptec policy (linux-scsi ML).

7. Some RAID Recommendations

28 Dec 2003 - 30 Dec 2003 (18 posts) Subject: "Best Low-cost IDE RAID Solution For 2.6.x? (OT?)"

Topics: Disk Arrays: MD, Disk Arrays: RAID, Disks: IDE, Disks: SCSI, I2O, PCI, Serial ATA

People: Johannes RuscheinskiJoel JaeggliJohannesBert HubertArjan van de VenH. Peter AnvinWakko WarnerSamuel FloryTomas Szepe

Johannes Ruscheinski said, "We're looking for a low-cost high-reliability IDE RAID solution that works well with the 2.6.x series of kernels. We have about 1 TB (8 disks) that we'd like to access in a non-redundant raid mode. Yes, I know, that lack of redundancy and high reliability are contradictory. Let's just say that currently we lack the funding to do anything else but we may be able to obtain more funding for our disk storage needs in the near future." Joel Jaeggli replied:

well if you currently have 1tb in 8 non-redundant drives then you using 160GB disks... no?

the biggest p-ata disks right now are ~320GB so you can do a ~1TB software raid 5 stripe on a single 4 port ata controller such as a promise tx4000 using regular software raid rather than the promise raid. that would end up being fairly inexpensive and buy you more protection.

linux software raid has been as releiable as anything else we've used over the years, the lack of reliabilitiy in your situation will come entirely from failing disks, lose one and your filesystem is toast.

Johannes asked, "why not use the hardware raid capability of the Promise tx4000? and if we'd use software raid instead, what would be the CPU overhead?" Bert Hubert said:

For the cost differential between linux native RAID and an external device of similar capabilities, outfit yourself with an additional CPU. I don't use RAID5 a lot but to a modern CPU, checksumming dozens of megabytes/second is child's play:

raid5: measuring checksumming speed
   8regs     :  1479.600 MB/sec
   32regs    :   744.400 MB/sec
   pIII_sse  :  1649.200 MB/sec
   pII_mmx   :  1806.000 MB/sec
   p5_mmx    :  1915.200 MB/sec
raid5: using function: pIII_sse (1649.200 MB/sec)

This is on a 800MHz Celeron, so a recent >2Ghz system will do lots better still.

Arjan van de Ven also offered Johannes a word of advice, saying, "be careful, almost all ata raid controllers out there are *software raid* hidden in a binary only driver. Also generally the on-disk format of these is quite unfortionate resulting in slower access than linux software raid can do..." H. Peter Anvin added, "Not to mention, well, *proprietary*. Consider this: with Linux swraid, you don't have to worry about your manufacturer discontinuing your product or going out of business; as long as you can connect your disks to a CPU using any kind of controller you can recover your data. If a proprietary RAID controller croaks, and you can't get another one of the same brand/model, you might have no more data..." Wakko Warner also said:

Speaking of which, since most DIE^WIDE RAID controllers are really software driven, it may be possible to get the sw module to read the disks off of any controller

My machine at work has an onboard promise raid controller but I run linux sw raid0 on 2 of the disks (4 total, other 2 are not in a raid)

I'm not sure about real hardware raid. Like the Mylex or Adaptec (SCSI)

Elsewhere, Samuel Flory also replied to Johannes' initial inquiry, saying:

It really depends on what you mean by low cost? The ony ide raid controller that does 8 PATA drives well under linux is the 3ware controller. For SATA drives you have the 3ware, and adaptec controller.

In theroy the highpoint 8 port sata card would be a good canidate for software raid, but highpoint has yet to cough up an open source drive yet.

If you want to go the software raid route and have 2 spare pci solts. You can go with either the high point rocket raid 454 (PATA), or the promise SATA150 TX4.

I really don't recommend any of promise's cards that use use the i2o driver, or any sort of binary only driver.

PS- Why not at least run software raid 5? It takes far less cpu than you'd think, and can save your ass.

Tomas Szepe said, "Absolutely. With eight low-cost IDE disks, you'd be nuts to go raid0 or linear." Johannes replied that he'd probably go with raid5 and the Promise tx4000 card Joel had recommended, adding, "It looks like I'll have the funding to buy another box and another 1 TiB of disk space."

Samuel offered some additional advice, saying, "Be sure to run badblocks on all the disks before creating your array. Software raid isn't as nice about bad sectors as most hardware raid controllers. On the other hand the md driver kicks the ass of nearly every raid controller I've tried." Tomas mentioned that bad-sector-niceness was only worse with initial bad sectors, not necessarily sectors that went bad during the course of operation.

A few posts down the road, Wakko remarked, "One thing that keeps me from using the linux raid sw is the fact it can't be partitioned. I thought about lvm/evms, but I'm unwilling to make an initrd to set it up (mounting root). Unfortunately boot loaders don't seem to support anything other than raid1. (Mostly lilo, but I'm not sure grub would do this either)" Samuel replied, "You're thinking of it the wrong way. You just create a bunch of partitions and make them into raid devices. You shouldn't be using the entire disk or you will break autodetection." And added, "Lilo deals well with raid 1 devices. I typical create a small raid 1 mirror as /boot. Just be sure to install your bootloader on to all drives. Newer versions of lilo will do the right thing if told to use /dev/mdwhatever."

8. Status Of ATARAID In 2.6

28 Dec 2003 - 29 Dec 2003 (11 posts) Archive Link: "ataraid in 2.6.?"

Topics: Device Mapper, Disk Arrays: LVM, Disk Arrays: RAID, FS: initramfs, FS: ramfs

People: Nicklas BondessonArjan van de VenChristophe Saout

Nicklas Bondesson asked, "Is the ataraid framework planned to be ported to 2.6.x? If so, when could one expect it?" Arjan van de Ven replied, "the plan is to have a userspace device mapper app take it's place. As for the timeframe; I'm looking at it but the userspace device mapper code is still a bit of a mystory to me right now." A few posts down the road, Nicklas replied, "How do you set this (device mapping) up using the 2.6 kernel. I like the ease of using ataraid in 2.4.x. Why not have both alternatives as options (both ataraid and devicemapper)?" Christophe Saout replied:

I think the reason is to avoid unnecessary code duplication. device-mapper provides a generic method to do such things. Also the developers are heading towards removing code from the kernel that can be done in userspace. There are plans to remove partition detection from the kernel in 2.7 and move the detection and setup code to a userspace program (using device-mapper) which can be placed in the initramfs (so that the user won't notice any difference). Ataraid detection and setup could also be placed there later.

If someone writes an ataraid detection and setup program in userspace it could be placed on an initrd.

You can find the dmsetup tool in Sistina's device-mapper package.

Arjan agreed that code duplication was the big reason not to have both options; he added, "The outcome is to be a /sbin/ataraid binary or some such that will do all the magic to detect the raid and tell the kernel device mapper to set it all up."

Nicklas then said, "I'm planning to go with the device mapping in 2.6.0 to setup RAID1 using my Promise TX2000 card. How do I know which device name it will choose? I have looked in the '/kernel-traffic/devices.txt' but there is neither a ide-raid nor a specific Promise device name mapping." And Christophe replied, "If you cannot wait for Arjan or anybody else to write an ataraid setup tool, you can go with dmsetup and choose any name you want (see the dmsetup man page). The only restriction is that dmsetup creates device under /dev/mapper, you can though use symlinks to it like LVM2 does if you need it under /dev/ataraid or something (in your initscript). If you don't have a better idea you can just give it the same name as before."

9. Problems With Accusys Drives

28 Dec 2003 - 29 Dec 2003 (8 posts) Subject: "ide: "lost interrupt" with 2.6.0"

Topics: Disks: IDE

People: Andre HedrickMike FedykAndrew Miller

Andrew Miller had some problems with his Accusys ACS7500 C5VL, ATA disk drive, and Andre Hedrick said:

Accusys is a stolen technology from Dupli-Disk, there are law suits pending the last time I heard.

If it was based on the original firmware of the Dupli-Disk, it has totally wrong state machines for executing hdparm style callers.

I know the FSM's are wrong because I fixed them for Dupli-Disk. How they operate, I can not disclose. But Accusys can not handle correct settings for FSM to Taskfile.

Mike Fedyk asked, "Does that mean that if you use taskfile on Dupli-Disk controllers that they will fail, and that disabling taskfile access might help (is that still an option in 2.6?)" And Andre replied:

Dupli-Disk requires taskfile access and will correctly operate.

Accusys is total CRAP, unless they licensed the technology proper.

Elsewhere, Andrew said he contacted Accusys, and that they had "provided me with a firmware upgrade tool and the firmware version CDVL(as opposed to C5VL). It seems they fixed the problem with their firmware in the latest version. I'm not sure whether or not it is worth fixing the 2.6.x kernels to support the broken firmware, as they seem to offer the upgrade to all customers anyway."

10. Status Of 2.6 Maintainership

29 Dec 2003 - 30 Dec 2003 (27 posts) Archive Link: "2.6.0-mm2"

Topics: Disks: SCSI, Kernel Release Announcement

People: Andrew MortonStef van der MadeMike FedykLinus Torvalds

Andrew Morton announced 2.6.0-mm2:

Stef van der Made replied, "Is it possible to use the old schema of pre1, pre2 und so weiter releases so that we can use the incremental patch sets again." Someone explained that Andrew did not intend to use the -mm tree as the official sources. Mike Fedyk added:

I think Linus will be releasing the 2.6-pre kernels, and things will continue like that until 2.7 opens up.

I think Andrew is trying to get all of the after 2.6.0 fixes in -mm and tested before syncing up with Linus.

And Linus Torvalds said:

Indeed. The fact is, we do need a "testing ground" for some experimental fixes to stuff that needs to be fixed, and that's one of the things the -mm kernels used to do for 2.5.x.

Since everybody was pretty comfy with that setup, we'll just continue that way. By the time 2.7.x opens up, things should have calmed down, and commercial vendors have their support trees in shape etc, but for now the -mm tree is a testing ground, and I'll make -pre trees that should be fairly stable, and then we'll do the 2.6.1 etc "real releases" based off them.

11. Patch Submission Policies

29 Dec 2003 - 1 Jan 2004 (13 posts) Subject: "[CFT/PATCH] give sound/oss/trident a holiday cleanup for 2.6"

People: Linus TorvaldsMuliJeff GarzikMuli Ben-Yehuda

Muli Ben-Yehuda posted a patch, and Linus Torvalds said:

When doing things like this, can you split up the patches into two separate things: one that _only_ does whitespace changes, and that is guaranteed not to change anything else, and another that does the rest.

It's a total b*tch to try to figure out which change resulted in some difference, if the changes are intermixed with large whitespace cleanups.

Muli replied, "You're 100% right. Internally, the patch I sent is composed of 30 different patches. The reason I didn't seperate it into two patches is that the changes were interleaved and inter-dependant and seperating them was a b*tch." Jeff Garzik said, "Thirty separate patches is OK. We have scripts to handle "patchbombs"." And Linus said:

Yes and no.

Thirty separate patches make sense if they are independent and really do conceptually different things. Then it makes sense to have them as separate checkins, and be able to tell people "ok, try undoing that one, maybe that's the problem".

However, if they are all just "fix silly bugs in xxx", then I'd much rather see it as one big patch. Having it split up into "fix bug on line 50" and "fix bug on line 75" just doesn't make any sense - it only makes the patch history harder to follow.

So "many small patches" aren't automatically any better than one big one. The thing that matters is to keep things _conceptually_ separated. If one patch fixes whitespace, and another one fixes bugs, then that's good.

Jeff replied, "There's certainly a middle ground. For drivers I generally request that bug fixes for separate bugs be split up, since inevitably one bug fix out of twenty breaks for somebody on that somebody's weird hardware."

12. BitKeeper Usage Policy And Advice

29 Dec 2003 - 31 Dec 2003 (16 posts) Archive Link: "[PATCH] 2.6.0 - Watchdog patches"

Topics: Disks: SCSI, Version Control

People: Linus TorvaldsJeff GarzikPaul JacksonMatthias AndreeEd TomlinsonAndy IsaacsonWim Van SebroeckAndrew Morton

Wim Van Sebroeck wanted Linus Torvalds and Andrew Morton to pull some changes from his BitKeeper tree, but Linus said:

This tree has 38 deltas, all just merges.

The end result is a horribly messy revision tree, for a few one-liners.

I'm going to take the patch as a patch instead, and hope that you'll throw your BK tree away.

Please don't follow the release tree in your development trees, it makes it impossible to see how the revision history happened.

Jeff Garzik added:

Agreed. Several BK developers do this, forgetting that one of things that makes BK so useful is its merge technology.

I recommend (assuming no patches outstanding),

Pulling the latest, just to be up-to-date, just obfuscates things and needlessly increases the size of the master ChangeSet file.

Paul Jackson said:

Another possibility I like is to recreate my changes (what few so far ...) against a clean bk tree, before sending. Hide all my internal iterations and changes from others.

I will pull frequently and liberally into the bk clones that I use to track 2.6, 2.6-mm and whatever else I am based on. These in turn I pull into my main working bk tree, along with pulling in the various changes I have in progress, each from their own bk clone.

Then when it comes time to send out a patch, I:

  1. Generate an old fashioned patch (bk export -tpatch), containing just the revisions relevant to what I will send.
  2. Clone a fresh bk tree that is closest to whatever the recipient of my patch would like to work with
  3. Apply the patch to the fresh clone, generating a clean history of one change for just that patch.
  4. Double check that that builds and boots.
  5. Then send that change out, usually by exporting it as a -second- old fashioned patch, since for reasons not relevant here, I end up sending patches, not bk pulls, down stream.

The objective being:

My final "published work" is that patch - it should be as clean as practical.

By going into and back out of old fashioned patches, I isolate the anal history that bk kept of all my interim changes from the rest of the world.

Elsewhere, Wim explained that his messy BK tree was the result of postponing those patches until 2.6.0 had come out. Linus replied:

Yeah, I know. It's one of the downsides of having anal revision control, and BK is more anal than most.

I do end up taking patches that have this syndrome if it looks like the pain of not taking the messy revision history is larger than the pain of just fixing it. Sometimes it's hard to avoid.

But most of the time the proper thing to do is to just not merge unnecessarily - if something is pending for a while, Bk does the merge correctly anyway, so you can just leave it pending and have me pull from an old tree (after you have verified in your own tree that the pull will succeed and do the right thing).

That way it ends up being trivial to see where/when the changes happened.

Matthias Andree asked:

Not being very used to BK, does that mean I have several trees around:

  1. the official release tree
  2. an "old tree" with my local change that I'm forwarding
  3. a temporary test tree to see if the merge would succeed, which I'll get by cloning (1) and then pulling from (2)?

Well, talk about FAAAAAAST drives (10,025/min SCSI kind) unless you have time to waste on all those BK consistency checks (which are, of course, what #3 is all about).

Or am I missing some obvious short cut?

To the idea of keeping several trees around, Linus replied:

The answer to that question is always a resounding "yes".

BK really thrives on having several trees around. You don't _have_ to have them, but basically the default rule should be: use separate trees for everything you can think of that isn't directly dependent on stuff in another tree.

That does _not_ mean that you should necessarily create "temporary trees". I actually do that a lot of the time, because I tend to create a totally new clone when I start applying long series of patches or do anything even half-way strange: it's just a lot easier to throw failures away, than it is to try to sort it out later.

But most people probably do _not_ want to have that kind of "temporary tree" mentality in general. People should realize that it's ok, and in particular that if you're doing something experimental it's fine to just create a new tree and later on decide that it was crap and just do a full "rm -rf" on it, and realize that the only thing you lost was some time.

But to Matthias' temporary tree described in item 3 of his list, Linus also said, "The tree doesn't really need to be temporary per se. It can be your "work tree" - the tree where you merge all the different sources of BK input. I realize that a lot of people only really have two sources of input (the standard tree and their own development tree), but if you get that concept early, you'll find it trivial to merge in other peoples trees into your "work tree", and keep track of many different development trees at the same time and just let BK do the merging for you." And regarding any available short-cuts, Linus said:

Basically, the obvious shortcut is to keep your work-tree around, so you don't have to clone and re-pull it all the time.

After a while, your work-tree is really messy (especially if you pull from multiple different development trees), but the point should be that no actual development gets done there, so you don't care: you can always just flush it entirely, and re-create it anew.

But you don't have to flush and re-create it _all_ the time. That would just be wasteful. Although if you have the hardware, it isn't that painful..

Elsewhere, Ed Tomlinson asked if it were possible to tell BK to defer its consistency checks, since they typically took 15 to 20 minutes each time. Andy Isaacson was surprised by this, saying the checks shouldn't take more than about 30 seconds, depending on the hardware. Eric D. Mudama reported 2 to 3 minute consistency checks. Andy turned around and agreed that perhaps 30 seconds was too optimistic for all machines, and that RAM size would have a powerful influence on consistency check times. Ed reported his box as "an old K6-III 400 with 512MB with UDMA2 harddrives."

13. Status Of BitKeeper Snapshots For 2.6

29 Dec 2003 (1 post) Archive Link: "Daily 2.6.x BK snapshots going again"

Topics: Version Control

People: Jeff Garzik

Jeff Garzik said:

Updated my snapshot script for 2.6.0 release, and snapshots are once again flowing into

The snapshots for the 2.6.0-testX series were moved to

14. USB Updates For 2.6

29 Dec 2003 (1 post) Archive Link: "[BK PATCH] USB patches for 2.6.0"

Topics: USB, Version Control

People: Greg KH

Greg KH said:

Here are some USB patches for 2.6.0. There are a number of different fixes and a few new drivers added. Some of the highlights are:

Please pull from: bk://

Patches will be posted to linux-usb-devel as a follow-up thread for those who want to see them.

15. Experimental Net Driver Updates For 2.6

29 Dec 2003 - 30 Dec 2003 (2 posts) Archive Link: "[BK PATCHES] 2.6.x experimental net driver updates"

Topics: Networking, Version Control

People: Jeff Garzik

Jeff Garzik said:

Summary of new changes:

Summary of patchkit:

Patch: (NOTE: _requires_ 2.6.0-bk2 snapshot, or IOW Linus-latest from BK, in order to apply successfully)

Full changelog:

BK repo:


16. SCSI Updates

30 Dec 2003 (1 post) Archive Link: "[BK PATCH] SCSI updates"

Topics: Disks: SCSI

People: James Bottomley

James Bottomley said, "This represents the driver updates and other fairly stable changes that have been floating around in the SCSI trees for a while. The only controversial element is updating the aic7xxx/79xx drivers. I've been receiving reports that the 1.3.9 version of the aic79xx was non-functional in 2.6, so I updated it to 1.3.11 based on Justin's tree (after confirming with the reporters that this fixed their problems)."

17. Prototype For BitKeeper 'Undo Changeset' Improvement

30 Dec 2003 (1 post) Archive Link: "[BK] cset -x improvement (prototype)"

Topics: Version Control

People: Larry McVoy

Larry McVoy said:

Some of you use the cset -x feature in BK (which is a way of undoing the effects of a particular changeset).

The documented behaviour of cset -x is that it only undoes content changes, not renames, creates, permissions, etc. This interface is unique in that respect, all the other interfaces in BK operate on all attributes, not just contents.

I've prototyped a version of the interface which works on contents, names, and permissions, and in addition also uses the BK merge tools to merge any conflicting changes as a result of the undo of the changeset. This is JUST A PROTOTYPE and has a lot of limitations but I'd like some feedback on whether this is a good direction and we should productize this or if this is not helpful to you.

If you use cset -x and want a better version of that interface, drop me an email and I'll send you the shell script (please make sure to send mail to me directly, not to the lists, because (a) they don't need to see all the noise and (b) I'm no longer subscribed to the Linux kernel mailing list).

18. Status Of

30 Dec 2003 (2 posts) Archive Link: " is up"

Topics: Version Control

People: Larry McVoyNigel Cunningham

Larry McVoy said that was up, adding, "If you don't have an account on this and you want one (it's mostly for BK users but has sort of turned into a public machine for any of the kernel developers, DaveM describes it as "a friendly place") send me or davem a ssh2 key and we'll set you up." Nigel Cunningham asked what the benefit of an account would be, but there was no reply.

19. Linux 2.6.1-rc1 Released; Unanswered Questions About 2.6 Release Policy

31 Dec 2003 - 2 Jan 2004 (22 posts) Archive Link: "2.6.1-rc1"

Topics: Disks: SCSI, I2C, Kernel Release Announcement, USB

People: Linus TorvaldsMike Fedyk

Linus Torvalds announced 2.6.1-rc1, saying:

Ok, I've merged a lot of pending patches into 2.6.1-rc1, and will now calm down for a while again, to make sure that the final 2.6.1 is ok.

Most of the updates is for stuff that has been in -mm for a long while and is stable, along with driver updates (SCSI, network, i2c and USB).

Mike Fedyk pointed out, "Well there goes the -pre series. ;)" And asked, "Are we going to have 2.6.1-rc1-mm1? :-D" , but there was no reply.

20. Linux 2.4.24-pre3 Released

31 Dec 2003 (4 posts) Archive Link: "Linux 2.4.24-pre3"

Topics: Disk Arrays: LVM, I2C

People: Marcelo Tosatti

Marcelo Tosatti released 2.4.24-pre3, saying, "It contains a PPC32/SPARC update, some i2c cleanups, LVM update, network update, a new WAN driver, amongst others."

21. udev 012 Released

31 Dec 2003 (1 post) Archive Link: "[ANNOUNCE] udev 012 release"

Topics: FS: devfs, FS: sysfs, Hot-Plugging, Version Control

People: Greg KH

Greg KH announced:

I've released the 012 version of udev. It can be found at:

rpms built against Red Hat FC1 are available at:

with the source rpm at:

udev allows users to have a dynamic /dev and provides the ability to have persistent device names. It uses sysfs and /sbin/hotplug and runs entirely in userspace. It requires a 2.6 kernel with CONFIG_HOTPLUG enabled to run. Please see the udev FAQ for any questions about it:

For any udev vs devfs questions anyone might have, please see:

The major changes since the 011 release are:

Thanks again to everyone who has send me patches for this release, a full list of everyone, and their changes is below.

udev development is done in a BitKeeper repository located at:


Daily snapshots of this tree used to be found at:

But that box seems to be down now. Hopefully it will be restored someday. If anyone ever wants a tarball of the current bk tree, just email me.

22. Trouble With udev Removable Media Handling

1 Jan 2004 - 3 Jan 2004 (8 posts) Archive Link: "removable media revalidation - udev vs. devfs or static /dev"

Topics: FS: devfs, Hot-Plugging

People: Andrey BorzenkovGreg KH

Andrey Borzenkov reported:

udev names are created when kernel detects corr. device. Unfortunately for removable media kernel rescans for partitions only when I try to access device. Meaning - because kernel does not know partition table it did not send hotplug event so udev did not create device nodes. But without device nodes I have no way to access device in Unix :(

specifically I have now my Jaz and I have no (reasonable) way to access partition 4 assuming device nodes are managed by udev.

devfs solved this problem by

static /dev simply has all nodes available and does not suffer from this problem at all.

unfortunately there are no lookup events in case if udev ... meaning at this moment user must manually rescan partitions after inserting new media. I do not see any way to solve this problem at all given current implementation. The closest is to blindly create nodes for all partitions as soon as block device is available.

Greg KH said, "Doesn't the kernel always create the main block device for this device? If so, udev will catch that. If not, there's no way udev will work for this kind of device, sorry. You could make a script that just creates the device node in /tmp, runs dd on it, and then cleans it all up to force partition scanning." Andrey agreed that the kernel did make the main block device /dev/sda, but he didn't see how this was helpful, since he needed /dev/sda4 specifically. He could write a script to do what Greg suggested, but he could see no way to cause that script to execute at the proper time. He said, "There is no event when you just insert Jaz disk; nor is there any way to trigger revalidation on access to non-existing device like is the case without udev." But Greg pointed out that udev "does provide that mechanism. See the CALLOUT rule. It can run any program or script when a new device is seen by the kernel."

23. Fair Scheduling In 2.4

2 Jan 2004 (2 posts) Archive Link: "Fair scheduling in 2.4 ?"

People: Robert MenaRik van Riel

Robert Mena asked:

Back in the days of kernel 2.2 there was a patch (from Rik van Riel if I recall) to the kernel so no single user could use all available cpu.

I was wondering if this feature (or something like it) has been ported or integrated in 2.4.x series ?

Rik van Riel replied, "There's a few versions for 2.4 on my patches page ;)"







Sharon And Joy

Kernel Traffic is grateful to be developed on a computer donated by Professor Greg Benson and Professor Allan Cruse in the Department of Computer Science at the University of San Francisco. This is the same department that invented FlashMob Computing. Kernel Traffic is hosted by the generous folks at All pages on this site are copyright their original authors, and distributed under the terms of the GNU General Public License version 2.0.