Kernel Traffic #215 For 9�May�2003

By Zack Brown

Table Of Contents


Sorry about the delays in recent issues, folks. I've been preoccupied with non-KT stuff. Hopefully the schedule anomalies will be easing up soon. I'd like to thank all the folks who emailed their support and encouragement.

Mailing List Stats For This Week

We looked at 3335 posts in 15443K.

There were 734 different contributors. 383 posted more than once. 216 posted last week too.

The top posters of the week were:

1. Implementing Fine-Grained Control Over printk() Output

7�Apr�2003�-�24�Apr�2003 (52 posts) Archive Link: "[patch] printk subsystems"

Topics: Disks: SCSI, FS: sysfs, USB

People: Jes Sorensen,�Patrick Mochel,�Martin Hicks,�Karim Yaghmour,�Randy Dunlap,�Pavel Machek,�H. Peter Anvin

Martin Hicks introduced a patch to group printk() calls into categories that could be configured into the kernel, or left out. Each printk() call would identify its category (SCSI, USB, etc), so the user could decide at run-time, via a sysctl interface, if those particular types of messages would be welcome. He figured this was better than the old way, in which every feature just used printk() calls, and the user had to deal with everything, including the possibility that the more verbose drivers and subsystems might overflow the buffer used by printk().

Pavel Machek said the proper solution was just to fix the printk()s that overflowed the buffer; but Jes Sorensen replied, "Killing the printk's means they are not around if you have an end user who is running into problems at boot time. Having a feature like this means they can default to 'off' then if a problem arises, whoever is doing the support can ask the user to try and enable printk's for say SCSI and get the input, without haven to rebuild the kernel from scratch." Pavel insisted there were two problems: the problem of overflowing the buffer, and the problem of restricting unwanted printk()s. He said no printk() call should overflow the buffer, and all those that did should be fixed. But Jes and H. Peter Anvin didn't agree; and said that all the printk()s should be kept, since they might contain valuable information.

Pavel said the proper way to control whether a given driver produced output was to use "#define DEBUG" in the driver code itself, and surround all debugging messages with "#ifdef" commands. He said Martin's system did not well-define the groupings of printk() calls into "subsystems". Jes pointed out that this would require the user to recompile the kernel each time, which Martin's solution did not. And H. Peter said that the printk() groupings would tend to define themselves along sensible lines.

Randy Dunlap also said he was working on a very similar feature, that would add a sysfs-controlled debugging flag that could be set or unset on a per-module basis. Patrick Mochel also said:

Something I've pondered in the past is a per-subsystem (as in struct subsystem) debug field and log buffer. When the subsystem is registered, a sysfs 'debug' file is created, from which the user can set the noisiness level.

From there, each subsystem can specify the size of a log buffer, which would be allocated also when the subsystem is registered. Messages from the subsystem, and kobjects belonging to it, would be copied into the local log buffer.

Wrapper functions can be created, similar to the dev_* functions, which take a kobject as the first parameter. From this, the subsystem and log buffer, can be derived (or rather, passed to a lower-level helper).

This all falls under the 'gee-whiz-this-might-be-neat' category, and may inherently suck; I haven't tried it. Doing the core code is < 1 day's work, though there would be nothing that actually used it..

Martin replied, "I'm not sure that this addresses the core problem that I'm trying to deal with. The problem is that machines with certain configurations (large number of CPUs, Nodes, or a bunch of SCSI and disks) display far too many messages to the console, resulting in the log buffer being overflowed. The method that I'm proposing simply allows you to decide what gets logged when a printk() happens, depending on the message's priority and which subsystem it originated from." Karim Yaghmour pointed out:

I'm not going to address the "filtering" aspect of the problem, but I would like to point out that this issue of printk overflowing and having multiple streams of printk is already solved by relayfs:

With relayfs, one could easily have multi-channel printks (e.g. one for each "subsystem" and a main one for important messages of all subsystems.) The advantages of relayfs are obvious:

We've already started playing around with printk on relayfs, though we don't have code to offer at this time.

In terms of init-time printk'ing with relayfs, this is the scheme I suggest:

That's it. Thereafter, all statically allocated printk buffers are dropped and all buffer management is left to relayfs.

[The filtering aspect is not taken care of by relayfs because it is not part of its "mandate". relayfs only aims at providing a very reliable lightweight high-speed data transfer mechanism for providing kernel data to user space. Higher-level mechanisms can easily use different relayfs channels to filter/mux data.]

Elsewhere, Martin said, "I don't think relayfs solves the problem either. This just adds an extra dependency for yet another pseudo-filesystem. printk is something that needs to "just work" even if the kernel is in the midst of crashing. Adding the extra complexity of all printk going out through a filesystem/buffer layer is not desirable, IMHO." But Karim replied, "There's a point where we've got to stop saying "oh, this buffering mechanism is special and it requires its own code." relayfs is there to provide a unified light-weight mechanism for transfering large amounts of data from the kernel to user space."

The various advocates continued advocating for a bit, and the discussion petered out inconclusively.

2. SELinux API Changes

8�Apr�2003�-�17�Apr�2003 (14 posts) Archive Link: "[RFC][PATCH] Extended Attributes for Security Modules"

Topics: Capabilities, Extended Attributes, FS: ext3, POSIX

People: Stephen Smalley,�Andreas Gruenbacher

Stephen Smalley said:

As part of preparing SELinux for submission to mainline 2.5, the SELinux API is being reworked based on earlier discussions (starting when sys_security was removed from 2.5). As a preliminary step toward submitting SELinux, I'd like to request comments on an extended attribute handler for security modules. This message includes a patch against 2.5.67 (also available from that implements the changes to the base kernel and the LSM framework to support the use of extended attributes by security modules. You can obtain a full SELinux patch against 2.5.67 that includes these changes along with the SELinux code that uses them from, and some relevant userland components from Note that the full SELinux patch also contains some other changes to the base kernel and the LSM framework that will be submitted as separate RFCs.

The patch below implements an extended attribute handler for ext3 (as an initial example, not as an intended limitation) for a attribute that can be used by a security module and by security-aware applications to get and set file security labels. The patch also adjusts the LSM hook in setxattr and adds a post_setxattr hook so that the security module can update the inode security field upon a successful change to the file security label and can ensure atomicity for the security check and the update to the inode security field.

I should note that we will ultimately need such xattr handlers not only for conventional filesystems such as ext3 but also for pseudo filesystems such as devpts, e.g. so that sshd can set the security label properly on the pty that will be used for a user session. The SELinux release includes a patched sshd program that does this using the old SELinux API for setting file security labels, but this will need to be migrated to using setxattr if we are going to use the xattr API for all of our file labeling operations.

Andreas Gruenbacher replied privately:

Could you please try to priefly summarize the intended use of these security labels? Is this for MAC? Also it would be interesting to know what the required privileges would be to access the labels. There are probably some accesses that are allowed in the user's security context, and some others that are performed on behalf of a user process, but within the kernel's security context.

There may be some overlap with trusted extended attributes (see for a manual page that contains a minimal description).

Stephen quoted Andreas' email, and replied:

SELinux implements a flexible MAC architecture that can support many different kinds of MAC security models and includes Type Enforcement, Role-Based Access Control, and optionally Multi-Level Security in the example security server (policy engine). It is not based on POSIX.1e MAC, and POSIX.1e MAC doesn't work so well for non-traditional MAC models like Type Enforcement and Role-Based Access Control. We define a set of permissions that control the ability of a user process to get and set the security label of a file, and the kernel module internally performs get and set operations as appropriate when files are looked up and when new files are created. We originally implemented our own persistent label mapping using some meta-files, but have reworked the SELinux implementation to use xattr if they are available, as you can see in the patch on the NSA site.

However, SELinux is merely one of the possible security modules that might be implemented via LSM, so we didn't want to limit this to just SELinux. It seems preferable to reserve a single index and attribute name that can be used by any security module, and use the first few bytes of the attribute value to indicate the particular security module. Most security modules seems to be implementing some form of non-discretionary access control, but the LSM framework isn't specifically limited to that.

The xattr_security.c code is actually derived from xattr_trusted.c, but I thought that we should have a separate index and name for an attribute that will be used by MAC schemes like SELinux. Also, the xattr_security.c code differs from xattr_trusted.c in the following important respects:

  1. We use a fixed attribute name ( that is not extensible. Every security module would use that name for its attributes (LSM only allows one security module at a time, and any stacking has to be handled by the "principal" security module), and would sanity check the value by checking the first few bytes against some module identifier. Using the "system" prefix seemed appropriate given that this attribute is used internally by the security module and not just by userspace.
  2. Permission checking is handled via the security_inode_setxattr hook in fs/xattr.c:setxattr, and updating of the inode's security field to reflect changes to the attribute is handled by a new security_inode_post_setxattr hook added by the patch. The inode semaphore ensures atomicity for the check and update (note that the down is moved by the patch). There is no permission check embedded in the handler itself, since it will vary depending on the security module and depending on whether the call is made from userspace or from the security module itself.

Andreas replied:

LSM only allows one principal security module at a time, but it allows to switch between security modules. I am wondering what will happen if a user switches between multiple security modules that label files. The new module will see labels from the old module. It's a question of policy how to deal with that case. Probably the policy restrictions the old module was implementing should be considered invalid after another module was used, and so the old labels should be ignored/removed.

Another case is stacked modules where more than one module needs file labels. Your proposed API does not support that. I would rather use individual attribute names for each module (e.g., "security.selinux", etc.).

The design of filesystem EAs differentiates rough access policies by attribute namespace ("system.*", "user.*", "trusted.*"). The system namespace is special in that each "system.*" attribute may have different access restrictions. Attributes in the "user.*" namespace are subject to the same restrictions as the contents of the file the attributes are attached to. Attributes in the "trusted.*" namespace are accessible only to users capable of CAP_SYS_ADMIN.

The "security" namespace/attribute you are proposing is quite similar to the "trusted.*" namespace, except that CAP_SYS_ADMIN does not grant any rights there. It is unlikely that security modules will/can remove the powers of the CAP_SYS_ADMIN capability; many areas in the kernel depend on it. I would expect that these modules make sure that no process will be able to attain that capability in the first place. In that light, wouldn't it be possible to use the "trusted.*" namespace for storing LSM file labels instead (e.g., "trusted.selinux")? There's nothing wrong with introducing another namespace if necessary, but we might be able to avoid that.

Stephen didn't think there was much of a problem regarding switching between security modules. He felt it was much more likely that security modules would be loaded once, early in initialization, and never removed. Switching between multiple security modules seemed even less likely to him. To the problem of module stacking, he replied, "Note that LSM intentionally does not provide any mechanism itself for sharing the security fields of the kernel data structures. Stacking has to be handled by the principal security module. In practice, I would expect that any "stacking" of multiple security modules that use security fields and xattr will actually involve creation of a new module that integrates the logic of the individual modules. This is preferable anyway to ensure that the interactions among the security modules are well understood, that the logic is combined in a sensible manner, and that the individual logics can not subvert one another. Given this view, using an individual attribute name for each module would seem to serve no purpose. An integrated module that combines logic of several modules can store all of the necessary security data as a single attribute value. Note that SELinux already does this for the set of security models implemented by its policy engine." Richard Offer said that regardless of which modules were loaded, an attribute was a permanent quality. If he rebooted his system and loaded a different LSM, he wanted the attributes set during the first run, to be preserved. But Stephen felt this was not a realistic scenario. Even rebooting in order to go into a different security "environment", he said, was a fundamentally flawed idea. The problem was always that, in order to actually provide security, a security module had to be able to exert strong control over the system. There would never arise a case in which security modules would be swapped in and out, or a system would be rebooted with a different idea of security. To truly be secure, Stephen said, required a single security module loaded early, with no other shenanigans.

3. Multilingual Kernel Messages; Linus On Documentation

8�Apr�2003�-�25�Apr�2003 (140 posts) Archive Link: "kernel support for non-english user messages"

People: Oliver Neukum,�Alan Cox,�Andreas Dilger,�Linus Torvalds,�Frank Davis

Frank Davis suggested that printk() messages should be output in the default language of the machine running that kernel, instead of just in english as they were currently. Oliver Neukum replied, "These messages are for administrators and developers. Everybody needs to be able to read them. They have to be in English." But Alan Cox replied:

Everyone cannot read English. Many non English speakers will be admins of their own desktop boxes.

For the general case I agree. It would be nice to have message catalogues and translation capability within klogd and maybe of a few key messages to console but for most cases it would make things more complex not simpler.

Andreas Dilger also said to Frank, "I don't think you will get support from anyone for non-english messages in the kernel. Some people think there is already too much text segment in the kernel (c.f. tests that show kernel size shrinks by 200kB or whatever when printk is defined to a no-op)."

A number of folks started discussing whether such a thing should be done, and if so, how, until Linus Torvalds said:

This has come up before.

The answer is: go ahead and do it, but don't do it in the kernel. Do it in klogd or similar.

I refuse to clutter the kernel with inane and fragile (and totally unmaintainable) internationalization code. The string lookup can equally well be done in user space where it isn't a stability and complexity issue.

Elsewhere, in a part of the thread discussing the importance of coders documenting their code, Linus gave his ideas on that as well:

Some people care about documentation, some people don't. That's a fact, and spouting platitudes about "improving their work" just doesn't _matter_. The whole open source idea is that people do what they care about and what they are good at, and exactly because they aren't forced to deal with issues they don't have a heart for they take more pride and interest in the stuff they _do_ do.

Personally, I don't write documentation. I don't much even write comments in my code. My personal feeling is that as long as functions are small and readable (and logical), and global variables have good names, that's all I need to do. Others - who do care about comments and docs - can do that part.

And you know what? That _lack_ of comments and documantation improves my work. Not because documentation is bad, but because I DO NOT CARE. So I concentrate on the stuff I do care about.

4. Static Device Numbering Enhancements

13�Apr�2003�-�23�Apr�2003 (24 posts) Archive Link: "[PATCH] kdevt-diff"

Topics: Backward Compatibility

People: Andries Brouwer,�Joel Becker,�Linus Torvalds,�Roman Zippel

Continuing his effort to extend static device number handling, Andries Brouwer posted a new patch and said:

This is the part that changes MAJOR/MINOR/MKDEV, that is, the structure of dev_t.

The structure here is 8+8, except when more bits are present, in which case it is 16+16, except when more bits are present, in which case it is 32+32. Since dev_t is 64-bit the structure of the middle part is not very important, but in some contexts we naturally get 16+16 (e.g. from CDROM) and 16+16 avoids messy conversion.

The macros here are written with the casts and typechecking that is otherwise implicit in the use of inline functions. Because of name clashes they cannot be inline functions. The MKDEV as given is not accepted by gcc in an enum, that is why I changed root_dev.h.

Since MINORBITS disappears I gave md_k.h and dasd_int.h local definitions.

Joel Becker thought Andries' scheme was too complicated, with 8+8 unless 16+16, unless 32+32. He said, "We'd all have to know about the mess when dealing with userspace." Linus Torvalds replied:

Well, the thing is, we absolutely _do_ need to have the 8+8 split, in order to make old devices look the same old way for old binaries.

And the 32+32 split is what the new maximum would be, so ..

The 16+16 split is not strictly necessary, but Andries pointed out to me that there are filesystems etc external storage that only support a 32-bit opaque dev_t, so we'd need to marshall the device number _some_ way for them anyway, and having a standard way to do that is better than having everybody come up with their own variations.

(My prefernce for the 32-bit version would be 12+20 bits, but it's not a very strong one, and it doesn't really matter for the kernel proper, so I think Andries who has been tirelessly working on this for five years or more gets the final say on it).

Elsewhere, Roman Zippel replied to Andries' post, ccing Linus. He said, "Linus, if you still want to go for a single block device major, this patch is bad idea (at least in this form). The patch below demonstrates how we can use the space above 0x10000 as one big major. Drivers only have to set disk->major to 0 and it gets a device number. Simply expanding the dev_t number does not solve the problems, e.g. changing the number of partitions is still a problem. Below I added a GENHD_FL_DYNAMIC flag so the upper layer knows, that some values are only a hint and that it can change them (e.g. when the user requests it)." But Linus didn't think Andries' patch interfered with having a single block device major. He said, "I think the single block-device major is a totally separate issue, and has nothing to do with allowing big device_t representations. I do not see why Andries patch would be anything else than infrastructure for future expansion." He and Roman went back and forth on this, with Roman's point being that any extension to the device numbering system would have to be supported long after static device numbers were no longer necessary. He saw no reason to inflict such requirements of backward compatibility when dynamic device numbers were so close. But Linus was not convinced.

5. Passing System Call Parameters

16�Apr�2003 (6 posts) Archive Link: "System Call parameters"

People: Richard B. Johnson,�Bruce Harada,�H. Peter Anvin

Richard B. Johnson asked:

How does the kernel get more than five parameters?


        eax     = function code
        ebx     = first parameter
        ecx     = second parameter
        edx     = third parameter
        esi     = fourth parameter
        edi     = fifth parameter

Some functions like mmap() take 6 parameters! Does anybody know how these parameters get passed? I have an "ultra-light" 'C' runtime library I have been working on and, so-far, I've got everything up to mmap() (in syscall.h) (89 functions) working. I thought, maybe ebp was being used, but it doesn't seem to be the case.

Maybe after 5 functions, there is a parameter list passed by pointer???? I don't have a clue and I can figure out the code, it's really obscure...

Bruce Harada gave a link to some documentation in PDF ( and quoted, "Certain Linux 2.4 calls pass a sixth parameter in EBP. Calls compatible with earlier versions of the kernel pass six or more parameters in a parameter block and pass the address of the parameter block in EBX (this change was probably made in kernel 2.4 because someone noticed that an extra copy between kernel and user space was slowing down those functions with exactly six parameters; who knows the real reason, though)." Richard said this was exactly what he wanted, and added, "FYI, I experimentaly I found out that the 6th parameter is passed in EBP if I use __NR_mmap2 as the function call instead of __NR_mmap. Thanks -- and I now have that working..."

H. Peter Anvin confirmed that EBP held the sixth parameter; and added, "However, on i386, SYS_mmap is a four-parameter system call where the last parameter is a pointer to a parameter block. SYS_mmap2 is the full six-parameter sane version."

6. Megaraid Driver Update

16�Apr�2003�-�26�Apr�2003 (10 posts) Archive Link: "[ANNOUNCE]: version 2.00.3 megaraid driver for 2.4.x and 2.5.67 kernels"

People: Atul Mukker,�Christoph Hellwig,�Mukker

Atul Mukker said, "New megaraid driver 2.00.3 is now available at For this driver, a patch is also available for 2.5.67 kernel." Christoph Hellwig pointed out that, as 2.00.4beta was already out, it seemed pointless to announce 2.00.3; and he also had some technical comments about the driver. Later, Atul also said, "The patch for kernel 2.5.6[78] for driver 2.00.5 is now available at"

7. Outage

16�Apr�2003�-�24�Apr�2003 (31 posts) Archive Link: " outage"

Topics: Version Control

People: Larry McVoy,�Roman Zippel,�Ben Collins,�Shachar Shemesh,�Alan Cox,�H. Peter Anvin

Larry McVoy reported:

Some of you use (*) and have noticed that it is off the air. It's looking like it may have a blown power supply and Penguin is on it; ETA for a fix is tomorrow some time (it's colocated and they are going down there to grab the box and swap out parts).

(*) This is a fast machine which was provided by Penguin Computing and BitMover as a place for people who do not have access to a high end machine and could use one. It's used for various BitKeeper tasks (translation: it can run a bk -r check -ac in about 15 seconds on the 2.5 tree). It is also the host for the BK->CVS repositories, so those are off the air as well (we have a mirror here so if you are dieing for your bits in CVS let me know).

If you use BK and you need a login on fast x86 box, contact me or and we'll set you up an account when it comes back up. If you appreciate this box, buying some hardware from Penguin Computing (or getting someone else to do so) is a good way to show that appreciation. Penguin deserves a lot of thanks, it's a nice box and they provide the power, bandwidth, and support which are all hidden costs and thankless tasks.

Elsewhere, under the Subject: BK->CVS, () , Larry said:

It's back up, and the CVS server up to date with the 2.4 2.5 kernels as of a few minutes ago. The CVS server is at

There are linux-2.4/ and linux-2.5/ subdirectories there (should this go in a FAQ someplace or does nobody except Andrea care?).

H. Peter Anvin said this should definitely be in a FAQ; and Larry replied:

OK, so how about this? I assume you manage DNS for, right? How about a DNS entry for -> If you ever find a machine to host this then you already own and you can just reset the address. By the way, I think the bandwidth is pretty darn low, after all that fuss almost nobody seems to use this, it just gives them warm fuzzies to know that the history has been captured in an open format which is worth it if it means no more BK flame wars, eh?

Then whoever maintains the kernel FAQ these days could add something like this:

SCM access to the kernel trees:

Linus started using an SCM (source code management) tool called BitKeeper in February of 2002. Since BitKeeper isn't free software, he does not require that anyone else use BitKeeper, he continues to accept patches just like he always did. The only difference is that information about who did what, and maybe why they did it, is recorded and is useful for learning the source base, tracking down bugs, etc. Many, but not all, of the core developers have switched to using BitKeeper because it makes their life easier in various ways.

Some people haven't switched because BitKeeper isn't free software and they feel uncomfortable using non-free software as part of working on the kernel. That's fine, it's an explicit goal of both Linus and the BitKeeper developers that nobody is required to use BitKeeper to work on the kernel. Some senior developers have decided they'd rather not use BitKeeper, Alan Cox being a good example. That's not a problem, the BitKeeper developers worked with Linus to streamline the importing of traditional patches so that anyone can work in any way they see fit.

If you want to use BitKeeper ( then the official trees are maintained on - to get a particular release try this:

bk clone bk://

There was a fair amount of fuss amongst the free software purists, over the fact that a lot of information that was available in BitKeeper was lost when Linus provided the traditional tarball releases and patch updates. Flame wars happened and when the dust settled, the BitKeeper folks built a BitKeeper to CVS gateway which captures the bulk of the information (as of this writing on April 19th 2003 there are 9,311 snapshots captured). If you would prefer to get your source with 100% God fearing, politically correct, open source, fully buzzword enabled software, then you can do this:

cvs co linux-2.4

As releases progress, the release numbers will change so some day you might say

bk clone bk://
cvs co linux-4.2

Roman Zippel drew attention to Larry's statement, "Some people haven't switched because BitKeeper isn't free software and they feel uncomfortable using non-free software as part of working on the kernel." Roman pointed out, "You forgot to mention that some people are not allowed to use bk (without paying)" . Larry accused Roman of trying to start a flame war, and refused to answer.

Elsewhere, Ben Collins said to Larry:

I hate asking this on top of the work you already provide, but would it be possible to allow rsync access to the repo itself? I have atleast 6 computers on my LAN where I keep source trees (2.4 and 2.5), and it would be much less b/w on my metered T1 and on your link aswell if I could rsync one main "mirror" of the cvs repo and then point all my machines at it.

I do a lot of diff'ing and log reading, so it would help out there too if I didn't have to connect back to bkbits to perform those frequent operations.

Larry replied, "If HPA wants to provide that, that's cool. I think he might already. If not, ping me again, no problem, we'll set something up." And H. Peter Anvin said that rsync:// and rsync:// should work. Ben thanked him. Shachar Shemesh, however, remarked:

There is a better tool (for this particular task), called "cvsup". It does a wonderful job of keeping cvs repositories in synch. I realize I just asked for a THIRD tool, so it should only go in if the admins are willing to take care of it.

The idea is that it uses the full duplexity of the channel to get client side information about the repository on that end while downloading changes, thus increasing the effective bandwidth. It only falls back to rsynch if CVS repository specific updates are not possible. I use it on the Wine repository, and it does, indeed, work very efficiently.

On the negative side - as far as I could tell, neither RedHat nor Mandrake carry it as a standard package (Debian does, at least in unstable).

Later he explained:

"cvsup" is for synching repositories (I was not talking about "cvs up" - the command line). It achives the exact same end effect as rsync, except it is much more bandwidth efficient when used to sync CVS repositories. Homepage at

As Adam Richter said in private, however, the tool is a bitch to compile. It is written in Modula-3, and most people don't have the development environment to build it. Add to that the fact that most distros don't carry it as a package (a while back I tried, unsuccessfully, to locate an RPM for it, anywhere), and you get something that should be deployed with care.

On the other hand, both Wine (where I got to know it) and KDE seem to offer cvsup for getting the repository, so it can't be THAT difficult. As also noted above, Debian does carry it in easy to deploy .deb, as part of the main distro's archive (confirmed available on stable).

8. New 64-Bit mknod Tool

18�Apr�2003�-�19�Apr�2003 (4 posts) Archive Link: "mknod64(1)"

People: Robert Love,�H. Peter Anvin

Robert Love announced:

I wrote a mknod64(1) tool, so we can play with 64-bit device numbers. It is available at:

for testing. And that is really its whole purpose because I see no reason why the mknod in coreutils will not eventually support mknod64(2).

But for now this version works and supports the 64-bit dev_t with a 32:32 split. It is also identical in functionality to mknod(1), except it does not support an initial mode other than the default (i.e., no --mode option).

Installation is simple but RPM packages are also available.

Usage is the same as mknod, except you may specify a 32-bit value for the major and the minor device number.

This currently requires 2.5.67-mm4, but I suspect the 64-bit dev_t work will eventually make its way into Linus's tree.

Note that most utilities cannot see the 64-bit device numbers, i.e. ls(1) only displays 8-bits of each. You can do a homemade stat64() or just trust the code.

With the above kernel and this utility, you can play with 64-bit device numbers.

H. Peter Anvin pointed out, regarding the future of mknod in coreutils:

actually, once glibc is updated to call SYS_mknod64 and have the right MAJOR() and MINOR() macros, it shouldn't require any changes to mknod(1).

What would probably be useful for mknod(1), if it doesn't already, is to allow the major/minor to be specified in any of the standard bases, i.e. using strtoul(...,...,0).

I belive HP/UX (which have had 32-bit minors for a long time) actually had ls -l display hexadecimal minors. I am not advocating that, however, it probably would break too many scripts.

9. Linux 2.5.68 Released

19�Apr�2003�-�23�Apr�2003 (17 posts) Archive Link: "Linux 2.5.68"

Topics: Digital Video Broadcasting, FS: devfs

People: Linus Torvalds,�Ben Collins

Linus Torvalds announced 2.5.68 ( and said:

Tons of changes all over the map. The diff is large, partly because the s390x support got merged into the s390 port as a 64-bit subset, and the old s390x architecture files thus became irrelevant. And a merge with Alan gave us a another architecture instead - h8300.

Lots of dvb updates (digital video), again through Alan. And a major aic79xx driver update.

Oh, and the devfs stuff by Christoph means that devfs users should beware: in particular, devfs users need to mount the pts filesystem like everybody else does, that duplication got killed.

Other than that, just a ton of updates. See changelog for details.

Ben Collins pointed out that the devfs changes required more than just mounting the pts filesystem. He said, "you need to build devpts explicitly now too. Before you could get away with not selecting devpts as an option."

10. More 64-Bit mknod Discussion

20�Apr�2003�-�22�Apr�2003 (43 posts) Archive Link: "[PATCH] new system call mknod64"

Topics: Backward Compatibility, Ioctls

People: Andries Brouwer,�Linus Torvalds,�Roman Zippel,�Alexander Viro,�David S. Miller,�Christoph Hellwig

Andries Brouwer added a mknod64 system call, to deal with his recent implementation of larger static device numbers. Christoph Hellwig felt this was putting the cart before the horse, since the data structures necessitating the use of this system call had not actually been modified in a way to make the call needed. Christoph felt there was other work to be done first, before any discussion of a new system call. Andries replied:

Yes, there is a dozen rather uninteresting patches that can be applied any moment. But a new system call is more important, so I show it in public at some earlier stage, so that Linus and others, like you, can comment.

Yesterday or the day before Linus preferred __u32 etc for this loopinfo64 ioctl, so I did it that way. Here, since mknod is a traditional Unix system call, I am still inclined to prefer (unsigned) int above __u32. Of course it doesn't matter much.

David S. Miller started to address this issue, considering the best data type to hold the input value; when Linus Torvalds said he rejected the entire idea of having a single input value to represent both the major and minor device number. He said, "The kernel should get major and minor numbers. It's a sad mistake that UNIX uses "dev_t" in the first place, and clearly the glibc interface to user mode will have to be that historical braindamage. But we should realize that the _right_ interface is keeping the <major, minor> tuple explicit, and any new system call interfaces should be of that type." Roman Zippel asked what the advantage was, in this. He said, "Everywhere it's just a simple number, only when we present that number to the user, we create some kind of illusion that this split has any meaning." Linus replied:

the split has huge meaning inside the kernel. We split the number every time we open the device, and use that split to look up the result.

There's another issue, though, which may or may not be a good thing. If we split and re-create the device number, that will always force the "dev_t" to be in "canonical form", ie if the major and minor both fit in 8 bits, then we will always fit the whole dev_t in 16 bits.

This shows up as a difference in the two approaches: if you consider the user-supplied number as a unsplit binary "dev_t", then the user can supply a 64-bit number like 0x00000001000000001, and we will actually use that as the dev_t. However, if we split it up, and the user supplies <1,1>, then we will always generate 0x0000000000000101 as the 64-bit dev_t, and there is never any way to generate the "non-canonical" form.

Does it matter? Probably not. I actually think it's slightly preferable to alway shave things in the canonical form, and the networked filesystems will generally canonicalize it anyway since they usually split it up into major/minor. But it _is_ potentially a user-visible difference.

Christoph pointed out that splitting the number was arbitrary, especially since the split no longer existed internally in the block device layer, and would soon be going away for character devices as well, as a result of Alexander Viro and others' work in those areas. But Alexander replied:

Oh, we certainly _do_ split - simply because there are ranges that belong to same driver (or driver and object).

However, the split boundary is not uniform - it depends on driver/object/whatnot. IMO it's a moot point by now, anyway - most of the kernel couldn't care less about device numbers.

But Linus also replied to Christoph with a different take:

Actually, we still do it for both block _and_ character devices.

Look at "nfs*xdr.c" to see what's up.

In other words, that split is definitely not virtual. It's a real thing with real visibility for users.

The fact that the kernel internally has generalized it away doesn't matter. Any kernel virtualization of the number still _has_ to account for the fact that it's a real thing.

Put another way:


_has_ to open the same file as


because otherwise the kernel virtualization is broken (since they will look the same to a user, and they will end up being written to disk the same way).

Thus any code that only looks at 64-bit dev_t without taking this into account is BUGGY.

One way to avoid the bug is to always keep all dev_t numbers in "canonical format". Which happens automatically if the interface is <major, minor> rather than a 64-bit blob.

I personally think that anything that uses "dev_t" in _any_ other way than <major,minor> is fundamentally broken.

Close by, part of the discussion involved implementation details that could potentially lose backward-compatibility. This was unacceptable to Linus, and he said:

Old and new drivers alike will use the MAJOR() macro, and that macro had better work with old and new kernels. Agreed? (And if you don't agree, don't even bother to answer, I'm not really interested in even discussing something so fundamental).

And regardless of whether a person uses an old or a new library, they had better see the same MAJOR() and MINOR() values for a legacy device, like /dev/hda1. In other words, the library version of MAJOR(),MINOR() _has_ to return the value 3,1, or it can break perfectly valid programs.

Again, if you don't agree, don't even bother sending me email any more about this issue. This is not negotiable. We _will_ have backwards and forwards compatibility, and that's final.

This means that MAJOR() has to look at bits 8..15 if the value is small. No ifs, buts and maybes about it.

HOWEVER, clearly MAJOR() has to look at other bits too, otherwise it wouldn't make any sense to make a bigger dev_t in the first place. The current MAJOR() is the logical extension.

But that _will_ force aliasing, unless you start doing some really funky things (make the dev_t look more like a UTF-8 unicode-like extension, which is obviously possible). In other words, there will be OTHER values for "dev_t" that will _also_ look like the tuple <3,1>.

And my requirements are that

These are not things open for discussion. We know what the behaviour MUST BE. Aliases _have_ to behave identically, anything else is _indisputably_ crap.

And I claim that this means that you have to have a mapping somewhere. You're free to come up with new ideas, but I don't think it will work. Keep the above rules in mind: backwards compatibility and aliases that work identically. That's all it really boils down to.

11. Many Drivers Broken By IRQ API Changes

21�Apr�2003�-�22�Apr�2003 (4 posts) Archive Link: "updates for the new IRQ API"

Topics: User-Mode Linux, Version Control

People: Andrew Morton,�Roman Zippel,�David S. Miller,�Oleg Drokin

Andrew Morton announced:

A change was made today to the kernel's IRQ handlers. See

for details.

The patch at

Is Linus's current bitkeeper tree, plus fixes for 350 files. I got most of it, but various scsi drivers and non-x86 architectures will still need work.

Since this change had an impact on most drivers, requiring source-level changes, Roman Zippel suggested:

Hmm, if we are breaking already every driver, how about gettting rid of the pt_regs argument. The timer interrupt is the only real user and it can also be stored in irq_desc_t, from where the timer can get it with the irq number. To preserve compatibility we could add something like this for 2.4:

#define alloc_irq(irq, handler, flags, name, id) \
        request_irq(irq, (irqreturn_t (*)(int, void *, struct pt_regs *))handler, flags, name, id)

David S. Miller replied, "Casting 'handler' is not acceptable, see Linus's comments he added at the top of interrupt.h"

Elsewhere, Oleg Drokin posted a patch, saying, "Here is UML's part. I tried it and stuff compiles and works for me."

12. Status Of Hyperthreading Scheduler Enhancements

21�Apr�2003�-�23�Apr�2003 (12 posts) Archive Link: "[patch] HT scheduler, sched-2.5.68-A9"

Topics: Big O Notation, Hyperthreading, Version Control

People: Ingo Molnar,�Martin J. Bligh

Ingo Molnar announced:

the attached patch (against 2.5.68 or BK-curr) is the latest implementation of the "no compromises" HT-scheduler. I fixed a couple of bugs, and the scheduler is now stable and behaves properly on a 2-CPU-2-sibling HT testbox. The patch can also be downloaded from:

bug reports, suggestions welcome.

Martin J. Bligh took a look and noticed that, while there were definite performance improvements under high loads; for lower loads there seemed to be some performance degradation. Ingo felt this was most likely just a bug, and not systemic to the scheduler changes.

13. Status Of matroxfb

22�Apr�2003�-�23�Apr�2003 (4 posts) Archive Link: "2.5.68 state of matroxfb"

Topics: Framebuffer

People: Ed Sweetman,�Matt Reppert,�Petr Vandrovec

Ed Sweetman asked, "I'm just wondering what the state of the matroxfb driver is and why it's an option in the kernel when it's completely uncompilable and has been for many months." Matt Reppert replied, "A lot of FB stuff has been nonworking at various stages, since the whole console layer has been more or less rewritten over the course of 2.5. (Of course, a lot of kernel internals have changed, so entire *CPU architectures* have been uncompilable for significant periods of time.)" He added that the configuration option was still included in kernel compilation in spite of the breakage, because "support is planned but there hasn't been time to get all the issues hammered out. A lot of kernel devs still are volunteers that have to do this in whatever free time they have, between actual jobs or, sometimes more importantly, between searching for jobs, so things don't always happen instantly.)" At some point, Petr Vandrovec also said, "I'll try to put stripped down matroxfb into kernel sometime in future... Unfortunately as 2.5.x fbdev infrastructure does not fullfill my needs, it will be stripped down version (which fits into current fbdev, and which will not support text mode & fbset -fb /dev/tty*), and I'll not test it a lot, as it is hard to test something you cannot use in daily work..."

14. Device Class Rework

22�Apr�2003�-�23�Apr�2003 (10 posts) Archive Link: "[RFC] Device class rework [0/5]"

Topics: FS: sysfs, Hot-Plugging

People: Greg KH,�Mike Anderson

Greg KH said:

Here's a set of patches that rework the current class support in the kernel today into something that works a bit better, and is simpler to use.

Currently, classes are assigned to drivers at compile time (or at the latest, at driver_register() time) and enforce a 1 to 1 relationship between class devices and struct device objects. This is not practical in the kernel, as there are a number of physical devices that correspond to multiple "class" devices. It's also unwieldy to bind classes to devices so early. They should be explicitly done later when the class device is registered with that subsystem.

So with that in mind, here's some changes. The rework of the driver core is all done right now in one big patch, but I'll split it up smaller for inclusion later on.

This patch gets rid of struct device_class and struct device_interface, and replaces them with struct class[1], struct class_device, and struct class_interface. struct class is much like struct device_class used to be, but is much smaller, and not bound to any drivers. This makes the driver core a lot smaller, as we get hotplug events for free now (struct class_device is a kobject), and is more flexible.

A struct class_device is registered with a class when that device is registered within the kernel. As an example of this, see the tty patch later on in this email thread.

A struct class_interface is used to get a callback whenever a struct class_device is registered or unregistered with a class. This can be used to attach files to the class_device, or do more complicated things. The patch that changes the cpufreq code in this email thread shows how this can be used (although the cpufreq code can be further simplified based on these changes, I've not done it yet.)

If there are no major objections to this, I'll split it up into smaller pieces for inclusion in the kernel tree.

I'll follow this message up with 5 patches that do the following things:

Oh, and I didn't touch the pcmcia code yet either, with these patches, that code will not compile properly.

15. IDE Maintainership And Licensing Changes

23�Apr�2003�-�24�Apr�2003 (2 posts) Archive Link: "ide-2.5.68.patch"

Topics: Disks: IDE, Serial ATA

People: Andre Hedrick,�Shane Shrybman,�Alan Cox

Andre Hedrick said:

This patch is to promote Bartlomiej to a well earned position as the Global IDE Maintainer, as I have stepped tp the side to handle the SATA and vendor chipset issues. My time here was to clean up the transport to allow the simplicity of the protocol to be expressed.

If there is an issue where one believes I need tp be included in the thread please CC me as I limit my reading of lkml directly.

This patch also addresses some text whose intent may be a restriction to GPL; however, GPL itself is flawed as it relates to concatenation or appending addition information to a binary program, be it a kernel or app. whose current license status is GPL or LGPL. With this in mind, I will formally announce that all of my contributions to the kernel over the past 5 years or so are to be dual licensed in OSL/GPL format. I would strongly suggest any person who has a stake in the kernel to review OSL and consider.

If anyone has an issue with this change, get over it. If you can not get over it, you can pay for the legal opinion for review.

Shane Shrybman replied:

I didn't see any replies to this post so I felt compelled to say something.

Thanks Andre!! Thanks for being the linux-ide guy for so long and sticking with it threw the tougher times. And for trying (and doing) the right thing by us. ( and for not frying any of my data :)

And just as big a thanks to the linux-ide coordinator, Alan Cox!

And a big hearty welcome to the new linux-ide guy, Bartlomiej!

16. IDE Power Management

24�Apr�2003�-�25�Apr�2003 (14 posts) Archive Link: "[RFC/PATCH] IDE Power Management try 1"

Topics: Disks: IDE

People: Benjamin Herrenschmidt

Benjamin Herrenschmidt said:

This patch is a first try at implementing proper IDE power management. It's not intended to be merged as-is, there is some uglyness here or there, it's here for comments and discussion.

The point is to pipe the power management requests through the request queue for proper locking. Since those requests involve several operations that have to be tied together with the queue beeing locked for further 'user' requests, they are implemented as a state machine with specific callbacks in the subdrivers.

The guts of the patch is the core changes. I did the actual implementation for ide-disk & ide-cd as a way to quickly validate it (it works), but I'm sure we want to do more than just what I implemented here. At least it's ok for PPC, but I beleive some x86 will want you to do more (Alan suggested reverting to PIO for example, to please some BIOSes, though that would suck badly for suspend-to-disk, there may be more involved regarding recovering from errors in flush too).

One thing that should probably be cleaned up is the difference between the suspend and the resume request. I didn't want to implement 2 different request bits to avoid using too much of that bit-space, and because most of the core handling is the same. So right now, I carry in the special structure attached to the request, 2 fields. An int indicating if we are doing a suspend or a resume op, and an int that is the actual state machine step.

However, for convenience, my ide-cd and ide-disk implementation implement resume just as different step number in the same routine...

So we could either get rid of the "int suspend" field completely and just define 2 different ranges for "step". Or we could keep "suspend", but then it may make sense to split the sub-driver ops in 2 (suspend/suspend_completion & resume/resume_completion).

17. Speeding Up Boot-Time

24�Apr�2003 (3 posts) Subject: "[ANNOUNCE] OSDL Whitepaper: "Reducing System Reboot Time With Kexec""

Topics: Kexec

People: Andy Pfiffer,�Timothy D. Witham

Andy Pfiffer announced:


Reducing System Reboot Time With kexec

kexec is a developing feature for Linux 2.5.x that allows an x86 Linux kernel to load and run another kernel instead of the platform BIOS and bootloader. By skipping the platform BIOS during a reboot, kexec can reduce downtime in enterprise class systems, and reduce turn-around time for Linux kernel developers. This paper presents measurements of boot time reduction through the use of kexec.

Timothy D. Witham replied:

So questions and comments are being accepted?

The actual values from the measurements in an appendix would be helpful. Including the boot time breakdown for the 8 way.

On your chart, the time saved column is distracting to me as it is extra data. On the relative percentage column if it could be kexec/full boot that would make it so that I wouldn't have to go back to the text to understand the column. Also on the kernel boot time, I think that you are talking about the kernel init time. So why not call it that?

On future and ongoing work.

The crash dump seems to be orthogonal to fast booting. I would like to see future and ongoing work that applies to fast booting.

Andy thanked him for the feedback. To Timothy's objection to the organization of the data columns, Andy replied, "Fair enough. The 8-way measurements aren't currently available because that specific system has been assigned to another OSDL Lab Associate." Also on the question of 'boot time' versus 'init time', Andy said, "I was hoping to avoid dancing around the semantics and interpretation of "booted" vs. "initialized". I guess that didn't work as well as I had hoped. ;^)"

18. Finding Patches Using ChangeLog Information

25�Apr�2003�-�26�Apr�2003 (16 posts) Archive Link: "ChangeLog suggestion"

Topics: USB, Version Control

People: Zack Brown,�Linus Torvalds,�Alan Cox,�Arjan van de Ven,�John Bradford,�Andrew Morton

Zack Brown (I) asked:

Linus, Marcelo,

In each changelog entry, it would be really useful to include the Message-ID of that email in a regex-parsable location. This way, if the email was cced to lkml it would be possible for folks to track down the actual patch.

I'm not familiar with your scripts, but I'd be surprised if this were very difficult to implement. At the same time, there are many cases of changelog entries that read only 'USB' or something equally unhelpful, where there is little chance that anyone could track down the corresponding patch. Having the Message-ID in those cases would make all the difference in the world.

Linus Torvalds replied:

Well, the scripts can take it, but quite frankly I'd rather not clutter the changelogs up with crud that really doesn't matter.

The thing is, the stuff that already _is_ in the changelog is certainly enough to identify the email if you just have a reasonable search engine. You have author, comments and diff, and if that isn't enough to identify the thing then something is wrong.

Also, _most_ of the patches by far end up coming as personal emails, and while they have often shown up on linux-kernel in _some_ way, it won't be the same email that got sent to me. The email that showed up on the mailing list will often have been of the type "please test this out and if it works for you I'll send it to Linus", or it will have been posted by the original author and then the actual patch made it to me either through somebody elses BK tree _or_ through a person like Andrew Morton or Alan Cox.

In other words, what you ask for is ugly (yes, I actually look at the output of "bk changes") _and_ not very useful.

Elsewhere and before Linus' reply, John Bradford said to Zack that instead of a Message-ID, it might be better to have a URL linking directly to the archived patch. This seemed feasible to him, since BitKeeper generated the changelogs itself. Arjan van de Ven pointed out that connecting patches to changelog entries was already fairly trivial, via the bk-commits mailing lists for 2.4 and 2.5 development. Linus also replied to John, saying that actually, it would be quite difficult to isolate a URL corresponding to a particular patch. He said:

yes, the changelogs are generated by BitKeeper, but what gets fed into bitkeeper is controlled by some scripts I wrote, which are the ones that take the email and munge it into a readable format etc. So by the time the thing hits my BK repository, the email headers will all have been thrown away, except for "From: " and "Subject: ". So BK never sees the full email.

(Even my scripts don't see the full email a large percentage of the time: I end up prettifying the emails for actual application by first removing things like "Hi Linus, please apply this" etc which are pointless in the changelog).

This and Linus' other post made sense to John, and he agreed that there seemed to be no sensible way to provide a URL to each patch in the changelogs.

19. Open POSIX Test Suite 1.0.0

30�Apr�2003 (1 post) Archive Link: "Open POSIX Test Suite 1.0.0"

Topics: POSIX, Scheduler

People: Rolla N. Selbak,�Robert Williamson

Rolla N. Selbak said:

Release 1.0.0 of the Open POSIX Test Suite is now available at

This release contains complete core POSIX conformance interface tests for: Signals, Message queues, Semaphores, Timers, <sched.h> process scheduling. Threads core tests 95% complete.

It also contains bug fixes from 0.9.0. The release notes that appear on download describe how to compile and run these tests. The file QUICK-START is a quick and practical way to get started building and running the tests. The README page and the Open POSIX Test Suite website (above) will give more information on the project goals and progress as well as information on how to contribute or contact us if you are interested.

Many thanks to Jerome Marchand, Robert Williamson and other members of the POSIX testing community for their bug fixes, patches, and suggestions on how to improve the 0.9.0 suite.

The Open POSIX Test Suite is an open source test suite with the goal of creating conformance test suites, as well as potentially functional and stress test suites, to the functions described in the IEEE Std 1003.1-2001 System Interfaces specification. Initial work is focusing on timers, threads, semaphores, signals, and message queues. Feel free to contact if you would like further information.

Sharon And Joy

Kernel Traffic is grateful to be developed on a computer donated by Professor Greg Benson and Professor Allan Cruse in the Department of Computer Science at the University of San Francisco. This is the same department that invented FlashMob Computing. Kernel Traffic is hosted by the generous folks at All pages on this site are copyright their original authors, and distributed under the terms of the GNU General Public License version 2.0.