Kernel Traffic #278 For 19 Oct 2004

By Zack Brown

Table Of Contents

Mailing List Stats For This Week

We looked at 2931 posts in 16130K.

There were 545 different contributors. 307 posted more than once. 243 posted last week too.

The top posters of the week were:

1. SmartLink Almost GPLs Modem Driver Code

9 Sep 2004 - 22 Sep 2004 (31 posts) Archive Link: "GPL source code for Smart USB 56 modem (includes ALSA AC97 patch)"

Topics: Modems, PCI, Sound: ALSA, USB, Version Control

People: Luke Kenneth Casson LeightonTheodore Ts'oSasha KhapyorskyMikael PetterssonErik MouwJaroslav Kysela

Luke Kenneth Casson Leighton was thrilled to discover that SmartLink had published a GPLed driver for their smart USB 56K modem ( . They also provided a PCI version, as well as an AC97 ALSA driver, all GPLed. He remarked, "this PCI ALSA driver is based on the i8x0 / MX 440 modem driver, by Jaroslav Kysela." He also added, "the swansmart usb 56k modem is dirt cheap (it was available in the uk six months ago for about \2439), and is extremely popular in australia and the far east." Theodore Ts'o remarked:

It's mostly GPL'ed, but there are binary-only objects both in the user-mode daemon (modem/dsplibs.o) and in the kernel driver (drivers/amrlibs.o).

The good news is that there a completely GPL'ed, source-complete driver already in the 2.6 kernel, sound/pci/intel8x0m.c, which will work with the user-mode daemon found in the distribution. This driver doesn't have all of the functionality of slamr driver (which requires the propietary, binary-only object file) --- most notably, ATM1 doesn't work when using the completely open-source intl8x0m driver. However, it does work just fine, and so as long as you don't mind using the propietary object file in user-space, it's a great solution. I've been using the smlink daemon with both the open-source and partial-propietary driver, and both work just fine on my T40 laptop.

Erik Mouw echoed this, and Luke was deeply disappointed. He sent some email to SmartLink, asking if they'd be willing to license the full source under the GPL. Sasha Khapyorsky from SmartLink replied, saying that "The final goal is to replace proprietary slamr driver completely." Mikael Pettersson said, "I hope you succeed with open-sourcing all of slmodem's driver code. My Targa Athlon64 laptop has the AMR thingy and the 32-bit x86 binary only slmodem driver prevents me from using the modem while running a 64-bit kernel." Sasha replied, "You mean to GPL user-space program slmodemd? I think it is good idea, but unfortunately this code is not just my, and final decision was 'no'." But Mikael explained, "No, I meant the 'slamr' kernel driver module, which is built from a big binary-only library (amrlibs.o) and a small amount of kernel glue source code. As long as amrlibs.o is distributed only as a 32-bit x86 binary, I won't be able to use it with a 64-bit amd64 kernel. slmodemd is not the problem since an amd64 kernel can support 32-bit x86 user-space binaries." Sasha replied, "This is exactly that was discussed - 'slamr' is going to be replaced by ALSA drivers. I don't know which modem you have, but recent ALSA driver (CVS version) already supports ICH, SiS, NForce (snd-intel8x0m), ATI IXP (snd-atiixp-modem) and VIA (snd-via82xx-modem) AC97 modems."

2. Real-Time LSM (Linux Security Module)

11 Sep 2004 - 20 Sep 2004 (22 posts) Archive Link: "[PATCH] Realtime LSM"

Topics: Real-Time

People: Lee RevellJack O'Quin

Lee Revell said, "The realtime-lsm Linux Security Module, written by Torben Hohn and Jack O'Quin, selectively grants realtime capabilities to specific user groups or applications. The typical use for this is low latency audio, and the patch has been extensively field tested by Linux audio users. The realtime LSM is a major improvement in security over the 2.4 capablities patch and other workarounds like jackstart, which rely on CAP_SETPCAP." A bunch of folks dove in, with various technical comments and criticisms.

3. Status Of BKL (Big Kernel Lock); Some Comparison With FreeBSD

15 Sep 2004 - 18 Sep 2004 (32 posts) Archive Link: "[patch] remove the BKL (Big Kernel Lock), this time for real"

Topics: BSD: FreeBSD, FS: procfs, SMP

People: Ingo MolnarLinus TorvaldsWilliam Lee Irwin IIIAndi KleenBill DavidsenBill HueyDavid S. MillerAndrew Morton

The quest to remove the BKL (big kernel lock) has been ongoing for quite some time. Ingo Molnar said:

the attached patch is a new approach to get rid of Linux's Big Kernel Lock as we know it today.

The trick is to turn the BKL spinlock + depth counter into a special type of cpu-affine, recursive semaphore, which gets released by schedule() but not by preempt_schedule().

this gives the following advantages:

code using lock_kernel()/unlock_kernel() will see very similar semantics as they got from the BKL, so correctness should be fully preserved. Per-CPU assumptions still work, locking exclusion and lock-recursion still works the same way as it did with the BKL.

non-BKL code sees no overhead from this approach. (other than the slighly smaller code due to the uninlining of the BKL APIs.)

(the patch is against vanilla 2.6.9-rc2. I have tested it on x86 UP+PREEMPT, UP+!PREEMPT, SMP+PREEMPT, SMP+!PREEMPT and x64 SMP+PREEMPT.)

But Linus Torvalds replied:

I really think this is wrong.

Maybe not from a conceptual standpoint, but that implementation with the scheduler doing "reaquire_kernel_lock()" and doing a down() there is just wrong, wrong, wrong.

If we're going to do a down() and block immediately after being scheduled, I don't think we should have been picked in the first place.

Yeah, yeah, you have all that magic to not recurse by setting lock-depth negative before doing the down(), but it still feels fundamentally wrong to me. There's also the question whether this actually _helps_ anything, since it may well just replace the spinning with lots of new scheduler activity.

And you make schedule() a lot more expensive for kernel lock holders by copying the CPU map. You may have tested it on a machine where the CPU map is just a single word, but what about the big machines?

Spinlocks really _are_ cheaper. Wouldn't it be nice to just continue removing kernel lock users and keeping a very _simple_ kernel lock for legacy issues?

In other words, I'd _really_ like to see some serious numbers for this.

William Lee Irwin III remarked of Ingo's patch, "One thing I like is that this eliminates the implicit dropping on sleep as a source of bugs (e.g. it was recently pointed out to me that setfl() uses the BKL to protect ->f_flags in a codepath spanning a sleeping call to ->fasync()), where the semaphore may be retained while sleeping. I originally wanted to make sleeping under the BKL illegal and sweep users to repair when it is, but maybe that's unrealistic, particularly considering that the sum of my BKL sweeps to date are one removal from procfs "protecting" its access of nr_threads." Ingo also defended his approach, agreeing with Linus that it numbers would be a good measure of the patch's value.

Elsewhere, Andi Kleen said of Ingo's original post, "Interesting approach. Did you measure what it does to context switch rates? Usually adding semaphores tends to increase them a lot." Bill Davidsen replied:

Is that (necessarily) a bad thing? If it results in less time waiting for BKL, and/or more time doing user work, that may drive throughput and responsiveness up. It depends if the time for two ctx is greater or less than the spin time on BKL.

It would be nice to have the best of both worlds, use the semaphore if there is a process on the run queue, and spin if not. That sounds complex, and hopefully not worth the effort.

Bill Huey replied, "FreeBSD-current uses adaptive mutexes. However they spin on that mutex only if the thread owning it is running across another CPU at that time, otherwise it sleeps, maybe priority inherited depending on the circumstance." And David S. Miller remarked, "This is how Solaris MUTEX objects work too." Bill H. replied:

FreeBSD can be considered a Solaris style kernel. In contract, I think the Linux community has a few things up on FreeBSD/Solaris style SMP. Specifically, the FreeBSD community has ignored a lot of the really hard work of pushing down locks in favor of "getting fancier locks", which only abuses thread priorities and the scheduler. A large part of it is because they have really create a very complicated SMP infrastructure that less than a handful of their kernel engineers really know how to use, 2-3, it seems.

Judging from how the Linux code is done and the numbers I get from Bill Irwin in casual conversation, the Linux SMP approach is clearly the right track at this time with it's hand honed per-CPU awareness of things.

And David replied, "This is what Linus proclaimed 6 or 7 years ago when people were trying to convince us to do things like Solaris and other big Unixes at the time." Bill H. said, "FreeBSD's SMPng project is stalled for the most part and developers that disagree with that approach have move onto the DragonFly BSD community. It has a much more top-down driven locking system that's conceptually CPU local called tokens, effectively deadlock free and difficult to misused. It's already been able to multi-thread the networking stack using lock-less techniques, while the FreeBSD-current tree had to retract their "all or nothing" approach with threading their network stack. Jeffery Hsu is the main developer pushing that subsystem."

Completely elswhere, after Ingo had posted several revisions of his original patch, Linus admitted that, although he didn't "love it to death," he recommended putting it into Andrew Morton's -mm tree and seeing if anything shook out.

4. inotify 0.9 Released Against Kernel

15 Sep 2004 - 20 Sep 2004 (21 posts) Archive Link: "[RFC][PATCH] inotify 0.9"

Topics: Big O Notation, Ioctls, Real-Time

People: John McCutchanRobert Love

John McCutchan said:

I am releasing a new version of inotify. Attached is a patch for

I am interested in getting inotify included in the mm tree.

Inotify is designed as a replacement for dnotify. The key difference's are that inotify does not require the file to be opened to watch it, when you are watching something with inotify it can go away (if path is unmounted) and you will be sent an event telling you it is gone and events are delivered over a fd not by using signals.

New in this version: Driver now supports reading more than one event at a time Bump maximum number of watches per device from 64 to 8192 Bump maximum number of queued events per device from 64 to 256


I have been asked what the complexity of inotify is. Inotify has 2 path codes where complexity could be an issue:

Adding a watcher to a device
This code has to check if the inode is already being watched by the device, this is O(1) since the maximum number of devices is limited to 8.

Removing a watch from a device
This code has to do a search of all watches on the device to find the watch descriptor that is being asked to remove. This involves a linear search, but should not really be an issue because it is limited to 8192 entries. If this does turn in to a concern, I would replace the list of watches on the device with a sorted binary tree, so that the search could be done very quickly.

The calls to inotify from the VFS code has a complexity of O(1) so inotify does not affect the speed of VFS operations.


The inotify data structures are light weight:

inotify watch is 40 bytes
inotify device is 68 bytes
inotify event is 272 bytes

So assuming a device has 8192 watches, the structures are only going to consume 320KB of memory. With a maximum number of 8 devices allowed to exist at a time, this is still only 2.5 MB

Each device can also have 256 events queued at a time, which sums to 68KB per device. And only .5 MB if all devices are opened and have a full event queue.

So approximately 3 MB of memory are used in the rare case of everything open and full.

Each inotify watch pins the inode of a directory/file in memory, the size of an inode is different per file system but lets assume that it is 512 byes.

So assuming the maximum number of global watches are active, this would pin down 32 MB of inodes in the inode cache. Again not a problem on a modern system.

On smaller systems, the maximum watches / events could be lowered to provide a smaller foot print.

Older release notes: I am resubmitting inotify for comments and review. Inotify has changed drastically from the earlier proposal that Al Viro did not approve of. There is no longer any use of (device number, inode number) pairs. Please give this version of inotify a fresh view.

Inotify is a character device that when opened offers 2 IOCTL's. (It actually has 4 but the other 2 are used for debugging)

Which takes a path and event mask and returns a unique (to the instance of the driver) integer (wd [watcher descriptor] from here on) that is a 1:1 mapping to the path passed. What happens is inotify gets the inode (and ref's the inode) for the path and adds a inotify_watcher structure to the inodes list of watchers. If this instance of the driver is already watching the path, the event mask will be updated and the original wd will be returned.

Which takes an integer (that you got from INOTIFY_WATCH) representing a wd that you are not interested in watching anymore. This will:

send an IGNORE event to the device remove the inotify_watcher structure from the device and from the inode and unref the inode.

After you are watching 1 or more paths, you can read from the fd and get events. The events are struct inotify_event. If you are watching a directory and something happens to a file in the directory the event will contain the filename (just the filename not the full path).

Aside from the inotify character device driver. The changes to the kernel are very minor.

The first change is adding calls to inotify_inode_queue_event and inotify_dentry_parent_queue_event from the various vfs functions. This is identical to dnotify.

The second change is more serious, it adds a call to inotify_super_block_umount inside generic_shutdown_superblock. What inotify_super_block_umount does is:

find all of the inodes that are on the super block being shut down, sends each watcher on each inode the UNMOUNT and IGNORED event removes the watcher structures from each instance of the device driver and each inode. unref's the inode.

I have tested this code on my system for over three weeks now and have not had problems. I would appreciate design review, code review and testing.

Robert Love added:

I want to expand on why dnotify is awful and why inotify is a great replacement, because dnotify's limitations are really showing up on modern desktop systems.

Some technical issues with dnotify and why inotify solves the problem:

I have been going over the code for awhile now, and it looks good. I would really like to hear Al's opinion so we can move on fixing any possible issues that he has.

There was not a universally positive response to the patch. Very little constructive discussion took place, but it was clear that some folks consider John's approach a bit bloated.

5. Stricter I/O Typechecking In 2.6

15 Sep 2004 - 18 Sep 2004 (35 posts) Archive Link: "Being more anal about iospace accesses.."

Topics: PCI, Serial ATA

People: Linus TorvaldsJörn EngelRoland DreierDeepak SaxenaDavid WoodhouseJeff Garzik

Linus Torvalds said:

This is a background mail mainly for driver writers and/or architecture people. Or others that are just interested in really low-level hw access details. Others - please feel free to ignore.

[This has been discussed to some degree already on the architecture mailing lists and obviously among the people who actually worked on it, but I thought I'd bounce it off linux-kernel too, in order to make people more aware of what the new type-checking does. Most people may have seen it as only generating a ton of new warnings for some crufty device drivers.]

The background for this iospace type-checking change is that we've long had some serious confusion about how to access PCI memory mapped IO (MMIO), mainly because on a PC (and some non-PC's too) that IO really does look like regular memory, so you can have a driver that just accesses a pointer directly, and it will actually work on most machines.

At the same time, we've had the proper "accessor" functions (read[bwl](), write[bwl]() and friends) that on purpose dropped all type information from the MMIO pointer, mostly just because of historical reasons, and as a result some drivers didn't use a pointer at all, but some kind of integer. Sometimes even one that couldn't _fit_ a MMIO address in it on a 64-bit machine.

In short, the PCI MMIO access case was largely the same as the user pointer case, except the access functions were different (readb vs get_user) and they were even less lax about checking for sanity. At least the user access code required a pointer with the right size.

We've been very successful in annotating user pointers, and that found a couple of bugs, and more importantly it made the kernel code much more "aware" of what kind of pointer was passed around. In general, a big success, I think. And an obvious example for what MMIO pointers should do.

So lately, the kernel infrastructure for MMIO accesses has become a _lot_ more strict about what it accepts. Not only do the MMIO access functions want a real pointer (which is already more type-checking than we did before, and causes gcc to spew out lots of warnings for some drivers), but as with user pointers, sparse annotations mark them as being in a different address space, and building the kernel with checking on will warn about mixing up address spaces. So far so good.

So right now the current snapshots (and 2.6.9-rc2) have this enabled, and some drivers will be _very_ noisy when compiled. Most of the regular ones are fine, so maybe people haven't even noticed it that much, but some of them were using things like "u32" to store MMIO pointers, and are generally extremely broken on anything but an x86. We'll hopefully get around to fixing them up eventually, but in the meantime this should at least explain the background for some of the new noise people may see.

Perhaps even more interesting is _another_ case of driver, though: one that started warning not because it was ugly and broken, but because it did something fairly rare but something that does happen occasionally: it mixed PIO and MMIO accesses on purpose, because it drove hardware that literally uses one or the other.

Sometimes such a "mixed interface" driver does it based on a compile option that just #defines 'writel()' to 'inl()', sometimes it's a runtime decision depending on the hardware or configuration.

The anal typechecking obviously ended up being very unhappy about this, since it wants "void __iomem *" for MMIO pointers, and a normal "unsigned long" for PIO accesses. The compile-time option could have been easily fixed up by adding the proper cast when re-defining the IO accessor, but that doesn't work for the dynamic case.

Also, the compile-time switchers often really _wanted_ to be dynamic, but it was just too painful with the regular Linux IO interfaces to duplicate the code and do things conditionally one way or the other.

To make a long story even longer: rather than scrapping the typechecking, or requiring drivers to do strange and nasty casts all over the place, there's now a new interface in town. It's called "iomap", because it extends the old "ioremap()" interface to work on the PIO accesses too.

That way, the drivers that really want to mix both PIO and MMIO accesses can very naturally do it: they just need to remap the PIO space too, the same way that we've required people to remap the MMIO space for a long long time.

For example, if you don't know (or, more importantly - don't care) what kind of IO interface you use, you can now do something like

void __iomem * map = pci_iomap(dev, bar, maxbytes);
status = ioread32(map + DRIVER_STATUS_OFFSET);

and it will do the proper IO mapping for the named PCI BAR for that device. Regardless of whether the BAR was an IO or MEM mapping. Very convenient for cases where the hardware migt expose its IO window in either (or sometimes both).

Nothing in the current tree actually uses this new interface, although Jeff has patches for SATA for testing (and they clean up the code quite noticeably, never mind getting rid of the warnings). The interface has been implemented by yours truly for x86 and ppc64, and David did a first-pass version for sparc64 too (missing the "xxxx_rep()" functions that were added a bit later, I believe).

So far experience seems to show that it's a very natural interface for most non-x86 hardware - they all tend to map in both PIO and MMIO into one address space _anyway_, so the two aren't really any different. It's mainly just x86 and it's ilk that actually have two different interfaces for the two kinds of PCI accesses, and at least in that case it's trivial to encode the difference in the virtual ioremap pointer.

The best way to explain the interface is to just point you guys at the <asm-generic/iomap.h> file, which isn't very big, has about as much comments than code, and contains nothing but the necessary function declarations. The actual meaning of the functions should be pretty obvious even without the comments.

Feel free to flame or discuss rationally,

Jörn Engel was a bit alarmed by Linus' use of void pointer arithmetic in his code example. He said, "C now supports pointer arithmetic with void*? I hope the width of a void is not architecture dependent, that would introduce more subtle bugs." Jeff Garzik and others pointed out that this was a GCC extension and had been used in the kernel for a long time. Roland Dreier also said, "However, I somewhat agree -- it's ugly for drivers rely on this and do arithmetic on void *. It should be OK for a driver to use char __iomem * for its IO base if it needs to add in offsets, right?" Linus replied:

"char __iomem *" will certainly work - all the normal pointer conversions are ok. Some people in fact use pointers to structures in MMIO space, and this is quite reasonable when working with a chip that uses "mailboxes" for commands.

However, I disagree with "void *" arithmetic being ugly. It's a very nice feature to have a pointer that can be validly cast to any other type, and that is the whole _point_ of "void *". The fact that C++ got that wrong is arguably the worst failing of the language, causing tons of unnecessary casts that can silently hide real bugs (maybe the thing you cast wasn't a "void *" in the first place, but you'll never know - the compiler will do the cast for you).

For example, to go back to the mailbox example, let's say that your hardware has an IO area that is 8kB in size, with the last 4kB being mailboxes.

The _sane_ way to do that is to do

void __iomem *base_io = ioremap(...);
struct mailbox __iomem *mbox = base_io + MAILBOX_OFFSET;

and then just work on that.

In contrast, havign to cast to a "char *" in order to do arithmetic, and then casting back to the resultant structure type pointer is not only ugly and unreadable, it's a lot more prone to errors as a result.

In other words, think of "void *" as a pointer to storage. Not "char" (which is the C name for a signed byte), but really, it's the pointer to whatever underlying memory there is. And a _fundamental_ part of such memory is the fact that it is addressable. Thus "pointer to storage arithmetic" really does make sense on a very fundamental level. It has nothing to do with C types, and that also explains why "void *" silently converts to anything else. It's a very internally consistent world-view.

Now, I disagree with gcc when it comes to actually taking the "size" of void. Gcc will silently accept

void *x;
x = malloc(sizeof(*x));

which I consider to be an abomination (and the above _can_ happen, quite easily, as part of macros for doing allocation etc - nobody would write it in that form, but if you have an "MEMALLOC(x)" macro that does the sizeof, you could end up trying to feed the compiler bogus code).

The fact that you can do arithmetic on typeless storage does _not_ imply that typeless storage would have a "size" in my book.

So sparse will say:

warning: cannot size expression

and refuse to look at broken code like the above. But hey, the fact that I have better taste than anybody else in the universe is just something I have to live with. It's not easy being me.

Elsewhere, Deepak Saxena asked, "Since we are on the subject of io-access, I would like a clarification/opinion on the read*/write* & in*/out* accessors (and now the ioread/write equivalents). Are these functions only meant to be used for PCI memory-mapped devices or _any_ memory mapped devices? Same with ioremap(). I ask because there are bits of code in the kernel that use these on non-PCI devices and this sometimes causes some complication in platform-level code." Linus replied:

It really depends on the bus architecture.

At some point, if the bus is different enough from a "normal" setup, you should just use your own accessor functions. Trying to overload "readl/writel" is just too painful.

However, at that point you should also realize that you can't re-use _any_ of the existing chip drivers, and you'll have to write your own. If the bus is exotic enough, that's not a problem, and you'd have to do that anyway. But there really aren't all that many "exotic" buses around any more.

Quite frankly, of your two suggested interfaces, I would select neither. I'd just say that if your bus is special enough, just write your own drivers, and use

cookie = ixp4xx_iomap(dev, xx);
ixp4xx_iowrite(val, cookie + offset);

which is perfectly valid. You don't have to make these devices even _look_ like a PCI device. Why should you?

Deepak replied, "some of those devices are not that special. For example, the on-board 16550 is accessed using readb/writeb in the 8250.c driver. I don't think we want to add that level of low-level detail to that driver and instead should just hide it in the platform code. I look at it from the point of view that the driver should not care about how the access actually occurs on the bus. It just says, write data foo at location bar regardless of whether bar is ISA, PCI, on-chip, RapidIO, etc and that writing of the data is hidden in the implementation of the accessor API." Linus did not reply to this.

Elsewhere, Roland asked, "while we're on the subject of new sparse checks, could you give a quick recap of the semantics of the new __leXX types (and what __bitwise means to sparse)? I don't think I've ever seen this stuff described on LKML." Linus replied:

[The bitwise checks are actually by Al Viro, but I'll explain the basic idea. Al is Cc'd so that he can add any corrections or extensions.]

Sparse allows a number of extra type qualifiers, including address spaces and various random extra restrictions on what you can do with them. There are "context" bits that allow you to use a symbol or type only in certain contexts, for example, and there are type qualifiers like "noderef" that just say that a pointer cannot be dereferenced (it looks _exactly_ like a pointer in all other respects, but trying to actually access anything through it will cause a sparse warning).

The "bitwise" attribute is very much like the "noderef" one, in that it restricts how you can use an expression of that type. Unlike "noderef", it's designed for integer types, though. In fact, sparse will refuse to apply the bitwise attribute to non-integer types.

As the name suggests, a "bitwise" expression is one that is restricted to only a certain "bitwise" operations that make sense within that class. In particular, you can't mix a "bitwise" class with a normal integer expression (the constant zero happens to be special, since it's "safe" for all bitwise ops), and in fact you can't even mix it with _another_ bitwise expression of a different type.

And when I say "different", I mean even _slightly_ different. Each typedef creates a type of it's own, and will thus create a bitwise type that is not compatible with anything else. So if you declare

int __bitwise i;
int __bitwise j;

the two variables "i" and "j" are _not_ compatible, simply because they were declared separately, while in the case of

int __bitwise i, j;

they _are_ compatible. The above is a horribly contrieved example, as it shows an extreme case that doesn't make much sense, but it shows how "bitwise" always creates its own new "class".

Normally you'd always use "__bitwise" in a typedef, which effectively makes that particular typedef one single "bitwise class". After that, you can obviously declare any number of variables in that class.

Now apart from the classes having to match, "bitwise" as it's name suggests, also restricts all operations within that class to a subset of "bit-safe" operations. For example, addition isn't "bit-safe", since clearly the carry-chain moves bits around. But you can do normal bit-wise operations, and you can compare the values against other values in the same class, since those are all "bit-safe".

Oh, as an example of something that isn't obviously bit-safe: look out for things like bit negation: doing a ~ is ok on an bitwise "int" type, but it is _not_ ok on a bitwise "short" or "char". Why? Because on a bitwise "int" you actually stay within the type. But doing the same thing on a short or char will move "outside" the type by virtue of setting the high bits (normal C semantics: a short gets promoted to an "int", so doign a bitwise negation on a short will actually set the high bits).

So as far as sparse is concerned, a "bitwise" type is not really so much about endianness as it is about making sure bits are never lost or moved around.

For example, you can use the bitwise operation to verify the __GFP_XXX mask bits. Right now they are just regular integers, which means that you can write

kmalloc(GFP_KERNEL, size);

and the compiler will not notice anything wrong. But something is _seriously_ wrong: the GFP_KERNEL should be the _second_ argument. If we mark it to be a "bitwise" type (which it is), that bug would have been noticed immediately, and you could still do all the operations that are valid of GFP_xxx values.

See the usage?

In the byte-order case, what we have is:

typedef __u16 __bitwise __le16;
typedef __u16 __bitwise __be16;
typedef __u32 __bitwise __le32;
typedef __u32 __bitwise __be32;
typedef __u64 __bitwise __le64;
typedef __u64 __bitwise __be64;

and if you think about the above rules about what is acceptable for bitwise types, you'll likely immediately notivce that it automatically means


In short, "bitwise" is about more than just byte-order, but the semantics of bitwise-restricted ops happen to be the semantics that are valid for byte-order operations too.

Oh, btw, right now you only get the warnings from sparse if you use "-Wbitwise" on the command line. Without that, sparse will ignore the bitwise attribute.

David Woodhouse replied:

Yeah right, that latter case is _so_ much more readable, and makes it _so_ easy for the compiler to optimise precisely when it wants to do the byte-swapping, especially if the back end has load-and-swap or store-and-swap instructions. :)

It's even nicer when it ends up as:

sum = cpu_to_le16(le16_to_cpu(a) + le16_to_cpu(b)); /* Ok */
sum |= c;
sum = cpu_to_le16(le16_to_cpu(sum) + le16_to_cpu(d));

I'd really quite like to see the real compiler know about endianness, too. I dare say I _could_ optimise the above (admittedly contrived but not _so_ unlikely) case, but I don't _want_ to hand-optimise my code -- that's what I keep a compiler _for_.

Linus replied:

It's not about readability.

It's about the first case being WRONG!

You can't add two values in the wrong byte-order. It's not an operation that makes sense. You _have_ to convert them to CPU byte order first.

I certainly agree that the first version "looks nicer".

Regarding David's posted snippet, Linus went on:

This is actually the strongest argument _against_ hiding endianness in the compiler, or hiding it behind macros. Make it very explicit, and just make sure there are tools (ie 'sparse') that can tell you when you do something wrong.

Any programmer who sees the above will go "well that's stupid", and rewrite it as something saner instead. You can certainly rewrite it as

cpu_sum = le16_to_cpu(a) + le16_to_cpu(b);
cpu_sum |= le16_to_cpu(c);
cpu_sum += le16_to_cpu(d);
sum = cpu_to_le16(d);

which gets rid of the double conversions.

But if you hide the endianness in macro's, you'll never see the mess at all, and won't be able to fix it.

And regarding the compiler having knowledge of endianness, he added:

I would have agreed with you some time ago. Having been bitten by too damn many bompiler bugs I'e become convinced that the compiler doing things behind your back to "help" you just isn't worth it. Not in a kernel, at least. It's much better to build up good typechecking and the infrastructure to help you get the job done.

Expressions like the above might happen once or twice in a project with several million lines of code. It's just not worth compiler infrastructure for - that just makes people use it as if it is free, and impossible to find the bugs when they _do_ happen. Much better to have a type system that can warn about the bad uses, but that doesn't actually change any of the code generated.

6. New Maintainers Sought For kbd, man, man-pages, And util-linux

19 Sep 2004 - 22 Sep 2004 (44 posts) Archive Link: "OOM & [OT] util-linux-2.12e"

People: Andries Brouwer

Andries Brouwer said:

Just released (on in /pub/linux-local/utils/util-linux ( ) util-linux-2.12e.

The reason for this release were complaints that mount and umount OOM the kernel when the number of mounts is large. And indeed - I tried with 30000 mounts and the OOM-killer killed everything in sight, including X's console, making X exit, killing all remaining processes.

The new versions have been polished a little bit so as not to waste too much memory, and now survive the 30000 mount/umount test for me. Further polishing is needed for the case of large numbers of mounts; when /etc/mtab is not a symlink to /proc/mounts then umount -a has quadratic behaviour (it updates mtab after each unmount) and that gets terribly slow.

About OOM: I am still of the opinion that the default state of the kernel must be one where OOM does not occur and malloc() tells us that we are out of memory. A system that suddenly decides to kill all processes is really very poor and unreliable. Users can enable other behaviours if they don't care about reliability.

About mount: I wondered whether I should rewrite [u]mounts's handling of /etc/mtab so as to be a bit faster. But it seems a waste of time - /proc/mounts has many advantages: automatically up-to-date, correct also when namespaces are used, much faster. On the other hand, /etc/mtab contains mount options that are sometimes needed later. If it were possible to store the mount options in the kernel, making them visible in /proc/mounts, then we could forget /etc/mtab altogether.

People have asked repeatedly for a way to mark lines in /etc/fstab so as to make clear that such lines are managed by some GUI or other external program. Labels like "kudzu". In this release I added a comment convention for /etc/fstab: options can have a part starting with \; - that part is ignored by mount but can be used by other programs managing fstab.

If we would put the mount options in /proc/mounts, and introduced a comment convention (say, the part starting with \: is ignored by the kernel but can be used by programs reading /proc/mounts), then /etc/mtab can die. Comments? Better solutions?

About util-linux and stuff: I have maintained various packages for ten years or so - it may be time to pass things on to someone else. Write to if you are interested in taking over or co-maintaining kbd or man or man-pages or util-linux.

There was a medium-to-lengthy technical discussion, but no one responded publically to his request for new maintainers.

7. ACPI SysFS Interface And Documentation

20 Sep 2004 - 21 Sep 2004 (15 posts) Archive Link: "[PATCH/RFC] exposing ACPI objects in sysfs"

Topics: FS: sysfs, Power Management: ACPI, Version Control

People: Alex WilliamsonPavel MachekAndi KleenAndrew Morton

Alex Williamson said:

I've lost track of how many of these patches I've done, but here's the much anticipated next revision ;^) The purpose of this patch is to expose ACPI objects in the already existing namespace in sysfs (/sys/firmware/acpi/namespace/ACPI). There's a lot of information currently available in ACPI namespace, but no way to get at it from userspace. What's new in this version:

Changes to existing kernel code are pretty trivial now. The major change is adding open() and release() functions to the sysfs bin_file support. This allows backing store on a per-open basis, and eliminates multiple reader/writer problems. Besides, it seems reasonable for a file entry to able to have a little more control over it's private_data structure.

The other generic kernel change is to export acpi_os_allocate(). This is because I chose to use acpi_buffers for internal management and wanted a consistent alloc/free interface for them. I'd be happy to separate these into individual patches if they're acceptable.

I'll try to make my debug utility available shortly so people can poke around on their systems and see what's available. For a lot of things, using xxd to dump the object provides some info and is sufficient for _ON/_OFF type methods. Let me know if you have any feedback or bug reports. Patch is against current bitkeeper, but should apply against almost anything recent. Thanks,

Pavel Machek suggested adding some stuff to the /Documentation directory, and Alex agreed with this. A couple of hours later, he posted a documentation patch to /Documentation/acpi/acpi_sysfs, explaining the ACPI interface throught SysFS. Andi Kleen and Pavel offered some technical criticisms, and the three of them probably went to private email with Andrew Morton to hash out some details.

8. hotplug Scripts Version 20040920 Released

20 Sep 2004 (1 post) Archive Link: "[ANNOUNCE] 2004-04-20 release of hotplug scripts"

Topics: Backward Compatibility, Hot-Plugging

People: Greg KH

Greg KH said:

I've just packaged up the latest Linux hotplug scripts into a release, which can be found at: ( or for those who like bz2 packages: (

It contains a lot of little bug fixes, and the addition of the isapnp.rc support.

The main web site for the linux-hotplug project can be found at: which contains lots of documentation on the whole linux-hotplug process.

The release is still backwards compatible with 2.4, so there is no need to worry about upgrading.

9. Year 9223372034708485227 Problem

22 Sep 2004 (5 posts) Archive Link: "year 9223372034708485227 problem"

Topics: FS: ReiserFS

People: Pavel Machek

Pavel Machek's brain exploded one day, and as he was picking up the broken shards, he noticed that the 2.4 Linux kernel has a 'year 9223372034708485227 problem'. According to his tests, on January 1, 9223372034708485227 all 2.4 systems will cease to process commands, and just give segfaults. He said, "I wonder how much damage it will do to my filesystems: touch foo seems to store the right year into reiserfs. I wonder if it is still there after reboot? No, it is not. That looks like kernel bug :-)."







Sharon And Joy

Kernel Traffic is grateful to be developed on a computer donated by Professor Greg Benson and Professor Allan Cruse in the Department of Computer Science at the University of San Francisco. This is the same department that invented FlashMob Computing. Kernel Traffic is hosted by the generous folks at All pages on this site are copyright their original authors, and distributed under the terms of the GNU General Public License version 2.0.