Kernel Traffic
Latest | Archives | People | Topics
Latest | Archives | People | Topics
Latest | Archives | People | Topics
Home | News | RSS Feeds | Mailing Lists | Authors Info | Mirrors | Stalled Traffic

Kernel Traffic #245 For 14 Dec 2003

By Zack Brown

Table Of Contents


I'm going to be on vacation in Manhattan from December 19th until January 5th. If anyone knows of some nice geeky events in Manhattan during that time, please let me know.

This issue of Kernel Traffic is dedicated to Pat McGovern, the head of SourceForge. Also Chris Conrad the SourceForge site code maintainer, and Adi Alurkar, the SourceForge DBA. These are some nice guys, and if you like SourceForge, you should let them know, since it will make them happy. I'd like to also dedicate this issue to Allan Cruse, for a really enjoyable evening we had once with Pat.

Special thanks go out to Vadim Lebedev for writing the script I use to create the Subject links from each summary to the mailing list discussion. The script as it stands looks like this:

SUBJ=`echo $SUBJ | sed -e "s/ /%20/g"`
AUTHOR=`echo $AUTHOR | sed -e "s/ /%20/g"`
URL=`echo $URL | sed -e "s/\\&/\\&/g"`
echo $URL

The script still isn't perfect, though still very useful. If you try running it on some of the data from the summaries below, you'll see that the links I ended up using are different from the links the script returned. If anyone sees a way to directly generate the URLs I've used (or an equivalent), please let me know. Or if you're interested in working with Vadim to improve this version, let me know and I'll put you in touch with him.

Mailing List Stats For This Week

We looked at 1389 posts in 6841K.

There were 496 different contributors. 227 posted more than once. 162 posted last week too.

The top posters of the week were:

1. Status Of 2.4; Some Discussion Of Interface Stability In All Kernels

1 Dec 2003 - 6 Dec 2003 (203 posts) Archive Link: "Linux 2.4 future"

Topics: Backward Compatibility, Binary-Only Modules, FS: XFS, FS: autofs, Networking, SMP

People: Marcelo TosattiDavid S. MillerChristoph HellwigIan KentPeter C. NortonArjan van de VenHaris PecoJan-Benedict GlawLinus TorvaldsJan-Benedict

In keeping with the discussions covered in Issue #244, Section #5  (28 Nov 2003: Linux 2.4.23 Released; 2.4 Series Enters Deep Freeze) and Issue #244, Section #7  (30 Nov 2003: Status Of XFS In 2.4; More Evidence Of 2.4 Deep Freeze) , Marcelo Tosatti announced:

The intention of this email is to clarify my position on 2.4.x future.

2.6 is becoming more stable each day, and we will hopefully see a 2.6.0 release during this month or January.

Having that mentioned, I intend to:

David S. Miller replied, "I think this is fine, 2.4.x really needs to go into super-maintainence mode whilst 2.6.x is being brought on stage."

Ian Kent asked if his AutoFS4 patches stood a chance of getting into the 2.4 tree. Christoph Hellwig asked what the patches were all about, and remarked, "if they aren't in 2.6 yet I don't think it makes sense trying to get them into 2.4 anymore at all." Ian acknowledged this, and said, "I have just finished porting them to 2.6 and will be attempting to get the help of autofs list inhabitants for initail testing in the next few days." Peter C. Norton also clarified, "Ian has lots of bugfixes and and feature patches (like direct mounts) going to the autofs mailing list. Autofs4 has always had stability issues in 2.4.x, and its been lacking in features. This makes myself and others run a bastard combination of amd, autofs and editing /etc/fstab to get "automounter" features even close to the solaris automounter. If these can go into 2.4, which will be "stable" and in use in lots of places for the next couple of years it could help by encouraging the distros to get behind autofs4 (hint hint, redhat, hint)." Christoph replied that it was a bit late for this stuff to get into 2.4; and "As for Red Hat: I'll bet the next Red Hat product will be based on a 2.6 kernel, as is fedora as their public beta testing community whizbang version." Arjan van de Ven echoed this, saying to Peter, "I suspect you'll have a really hard time finding ANY distro that still wants to actively develop new products on a 2.4 codebase."

Elsewhere, in a different train of thought, Haris Peco asked, "Is there linux-abi for 2.6 kernel?" Christoph said, he didn't think so, and Jan-Benedict Glaw also put in, "Nobody really cares about ABI (at least, not enough to keep one stable) while there's a good API. That requires sources, though, but that's a good thing..." Linus Torvalds came in at this point, with:

People care _deeply_ about the user-visible Linux ABI - I personally think backwards compatibility is absolutely _the_ most important issue for any kernel, and breaking user-land ABI's is simply not done.

Sometimes we tweak user-visible stuff (for example, removing truly obsolete system calls), but even then we're very very careful. Like printing out warning messages for several _years_ before actually removing the functionality.

The one exception tends to be "system management" ABI's, ie stuff that normal programs don't use. So kernel updates do sometimes require new utilities for doing things like firewall configuration, hardware setup (ethernet tools, ifconfig etc), or - in the case of 2.6 - module loading and unloading. Even that is frowned upon, and there has to be a good reason for it.

At times, we've modified semantics of existing system behaviour subtly: either to conform to standards, or because of implementation issues. It doesn't happen often, and if it is found to break existing applications it is not done at all (and the thing is fixed by adding a new system call with the proper semantics, and leaving the old one broken).

You are, however, correct when it comes to internal kernel interfaces: we care not at all about ABI's, and even API's are fluid and are freely changed if there is a real technical reason for it. But that is only true for the internal kernel stuff (where source is obviously a requirement anyway).

Jan-Benedict replied, "Whenever The ABI Question (TM) comes up, it seems to be about claiming a (binary compatible) interface - mostly for modules. But I think it's widely accepted that there isn't much work done to have these truly binary compatible (eg. UP/SMP spinlocks et al.)." Linus replied:

Absolutely. It's not going to happen. I am _totally_ uninterested in a stable ABI for kernel modules, and in fact I'm actively against even _trying_. I want people to be very much aware of the fact that kernel internals do change, and that this will continue.

There are no good excuses for binary modules. Some of them may be technically legal (by virtue of not being derived works) and allowed, but even when they are legal they are a major pain in the ass, and always horribly buggy.

I occasionally get a few complaints from vendors over my non-interest in even _trying_ to help binary modules. Tough. It's a two-way street: if you don't help me, I don't help you. Binary-only modules do not help Linux, quite the reverse. As such, we should have no incentives to help make them any more common than they already are. Adn we do have a lot of dis-incentives.

2. Real-Time Kernel-Based Mutexes

3 Dec 2003 - 6 Dec 2003 (12 posts) Archive Link: "[RFC/PATCH] FUSYN Realtime & Robust mutexes for Linux try 2"

Topics: Big O Notation, POSIX, Real-Time

People: Inaky Perez-GonzalezJamie LokierScott Wood

Inaky Perez-Gonzalez proposed, "This code proposes an implementation of kernel based mutexes, taking ideas from the actual implementations using futexes (namely NPTL), intending to solve its limitations (see doc) and adding some features, such as real-time behavior, priority inheritance and protection, deadlock detection and robustness." He added, "We have a site at with references to all the releases, test code and NPTL modifications (rtnptl) to use this code. As well, the patch is there in a single file, in case you don't want to paste them manually." Jamie Lokier replied:

Here's my first thoughts, on reading Documentation/fusyn.txt.

Inakey replied:

I thought that initially, and my first tries (last year?) went into that direction, but there are many holes (unless I am wrong). For example:

Jamie also made a bunch of other points in his initial reply to Inaky's proposal. He said, "Priority inheritence is ok _when_ you want it. Sometimes if task A with high priority wants a resource which is locked by task B with lower priority, that should be an error condition: it can be dangerous to promote the priority of task B, if task B is not safe to run at a high priority." Inaky replied, "That's why it is enabled only on request; it bothers me that having it forces some things, like having to do wait_cancel from interrupt contexts and stuff like that. Fortunately, chprio also requires that, so serves as a justification for having it. I still need to quantify the overall effects of that, btw." Inaky also pointed out that it was really the responsibility of the system designer to safely allow lock-sharing. He added, "I have requests from some vendors to extend this behavior to even SCHED_OTHER tasks, so that a FIFO task could promote it to FIFO. I personally shiver when thinking about this, but it makes sense in some environments (some real time tasks doing important things and a normal task doing low priority cleanups, for example)" .

In his initial reply to Inaky's first post, Jamie also said, "The data structures and priority callbacks which are used to implement priority inheritance, protection and highest-priority wakeup are fine. But highest-priority wakeup (at least) shouldn't be just for fuqueues: it should be implemented at a lower level, directly in the kernel's waitqueues. Meaning: wake_up() should wake the highest priority task, not the first one queued, if that is appropriate for the queue or waker." Inaky replied:

That was the first thing I thought of; however, it is not an easy task--for example, now you have to allocate a central node that has to live during the life of the waitqueue (unlike in futexes), and that didn't play too well -- with the current code, unlike my previous attempt with rtfutexes, it is not that much of a deal and could be done, but I don't know how much of the interface I could emulate.

As well, supporting the priority change while waiting requires some more work...

It is in my todo list to add some more bits to the fuqueue layer so it can do everything waitqueues do with the priority based interface.

It'd be interesting to experiment in some subsystem by changing the usage of waitqueues for fuqueues, see what happens.

Jamie asked what the central node was for, that Inaky had mentioned in the first paragraph of his reply, and Inaky explained:

In futexes we have that each hash chain has all the waiters, in FIFO arrival order, and we just wake as many as needed. In the fuqueues, this wake up has to be by priority. We could walk that chain pedaling back and forth to wake them up in the correct order, but that would be anything but O(1), and being "real-time" like, or predictable, is one of the requirements.

So the wait list has a head, the fuqueue->wlist, and the waiters are appended there in their correct position (so addition is O(N) now, will change to O(1) IANF, and removal of the highest priority guy is O(1)-- take the guy in the head).

Now, on ufuqueues (the ones that are associated to user space addresses, the true futex equivalent) that means you can't do the trick of the futex chain lists, so you have on each chain a head per ufuqueue/user space address. That ufuqueue cannot be declared on the stack of one of the waiters, as it would disappear when it is woken up and might leave others dangling.

So we have to allocate it, add the waiters to it and deallocate it when the wait list is empty. This is what complicates the whole thing and adds the blob of code that is vl_locate() [the allocation and addition to the list, checking for collisions when locks are dropped]. As the whole thing is kind of expensive, we better cache it for a few seconds, as chances are we will have some temporal locality (in fact, it happens, it improves the performance a lot), so that leads to more code for the "garbage collector" that cleans the hash chains of unused queue heads every now and then. All this is what the vlocator code does.

However, Scott Wood suggested:

However, instead of allocating the memory on demand, you can keep a pool of available queues. Every time a task is created, allocate one and add it to the queue; every time a task dies, retrieve one and free it. Since a task can only wait on one queue at a time, you won't run out of queues (unless you want to implement some sort of wait for multiple objects; however, in such a case you could allocate the extra queues on demand without affecting the normal single-object case).

Thus, it would be a simple linked-list operation plus a spinlock to acquire and release a queue whenever something blocks. It would be slower than the current waitqueue implementation, but not by much (and it could be made configurable for those who want every last cycle and don't care about real-time wait queues).

This would be beneficial for userspace usage as well, as blocking on a queue would no longer be subject to a return value of -ENOMEM (which is generally undesireable in what's supposed to be a predictable real-time application).

He added, "If the pool is kept as a stack, you keep the cache benefits, as well as allowing re-use of the queue across different locks." Inaky replied:

I like the idea, specially because the guarantees, but I don't know how accepted it will be:

  1. the wait queue is just the base type, fulock builds on top of it, so a fulock is bigger than a fuqueue. So, you would have to allocate on of each for each task for it to make sense (if some time we add rwlocks, something similar would happen).

    We could also say: ok, we split them, but then you'd have to allocate the extra stuff somewhere else and are back in square 1 (not to mention all the complications introduced for doing that).

  2. Many tasks are not going to use locks in user space, so they will not need at all that associated queue/lock, thus the space would be wasted for every task that does not use them.
  3. It'd slow down task creation time.

It really comes up to be a balancing decision: are we willing to put up with the wasted space (2) and (3) to get rid of the -ENOMEM problem?\

I think I will implement a configurable proof of concept see how it works.

As a side note, the best thing for user space would be to call back if it sees an -ENOMEM (something akin to -EAGAIN).

On the other side, we could somehow force a hi/lo watermarks in the number of readily available fulocks/fuqueues in the kmem caches for each.

Way back in Jamie's original reply to Inaky's first post, Jamie asked, "Is there a method for passing the locked state to another task? Compare-and-swap old-pid -> new-pid works when there isn't any contention, but a kernel call is needed in any of the kernel-controlled states." Inaky replied, "That can be done, because if you are in non-KCO mode (ie: pid), the kernel by definition knows nil about the mutex, so just do the compare and swap in user space and you are ready. No need to add any special code." Jamie said, "The question asks what to do in the KCO state. I.e. when you want to transfer a locked state and there are waiters." Inaky replied:

Ah, ah, ah ... makes sense. Ok, so this is like an unlock operation but "unlock to this guy". Well, same thing, but extended. You need to try it first in user space, if that fails because it is KCO (locked and there are waiters), then go to the kernel and ask it to transfer ownership in there.

Piece of cake, more or less, and can be done O(1) by checking dest_task->fuqueue_wait. Why the interest for this? I am curious to see what could it be used for.

I will give it a shot as soon as I have a minute (unless your question was purely academic and plan no uses for it).

Also in Jamie's initial reply to Inaky's first post of the thread, Jamie said, "It's very unpleasant that rwlocks enter the kernel when there is more than one reader. Hashed rwlocks can be implemented in userspace to reduce this (readers take one rwlock from a hashed set; writers take them all in order), but it isn't wonderful." Inaky agreed with this assessment, and said he'd keep it in mind, though he hadn't thought much about it at that time.

Also in Jamie's initial reply to Inaky's initial proposal, Jamie said, "For architectures which can't do compare-and-swap, a system call which does the equivalent (i.e. disables preemption, does compare-and-swap, enables preemption again) would be quite useful. Not for maximum performance, but to allow architecture-independent locking strategies to be written portably." Inaky replied, "But the minute you are doing a system call you are better off calling directly the kernel for it to arbitrate the mutex in pure KCO mode. I think the overhead saving is worth an #ifdef in the source code for the lock operation..." And Jamie said, "If it is as simple as just keeping the mutex in KCO mode all the time on archs which don't have compare-and-swap, or those that do if an application doesn't have explicit asm code for that arch, that would be very convenient. I haven't thought through whether keeping a mutex in KCO mode has this capability. Perhaps you have and can state the answer?" Inaky replied:

First I should make a distinction that is causing way too much confusion, one thing is KCO for the vfulock (telling the user that it has to go to the kernel) and the other is using it always in KCO mode by passing a yet-to-be-implemented-flag FULOCK_FL_KCO to the sys_ufulock_*() system calls (so it skips all the ugly synchronization code).

This FULOCK_FL_KCO is needed for priority protection anyway, so it will be there no matter what; thus, in arches without atomic compare-and-swap, it becomes a matter of ops-I-don't-have-the-fast-path so I just call the syscall with that bit set.

Also in Jamie's long first reply to Inaky's initial post, Jamie said, "It's a huge patch. A nice thing about futex.c is that it's relatively small (your patch is 9 times larger). The original futex design was more complicated, and written specifically for mutexes. Then it was made simpler and I think smaller at the same time. Perhaps putting some of the RT priority capabilities directly into kernel waitqueues would help with this." Inaky said:

I agree with that, but think about the pieces. The only part that is strictly equivalent to futexes is the ufuqueues, so that's ufuqueue.c, fuqueue.c and vlocator.c. The splitting is necessary to allow parts and pieces to be shared by the fulocks.

Asides from the comments, it adds the most complex/bloating part, the priority-sorted thingie and chprio support vs not having the FUTEX_FD or requeue comes to be, more or less equivalent, considering all the crap that has to be changed for the prioritization to work respecting the incredibly stupid POSIX semantincs for mutex lifetimes.

Jamie asked, "Are there specified POSIX semantics for prioritisation and mutex interaction?" Inaky replied, "Yep, they state different things on all that, and how you don't need to necessarily destroy non-shared mutexes (vs shared) and a few more things that made my life a little bit more exciting...I think I still need to tweak a few bits more on the priority inheritance stuff to get it completely up to the spec, but it should be pretty good already."

A few other interesting points were discussed, but this summary is already too complex.

3. Status Of Andrea's VM Contributions In 2.4

3 Dec 2003 - 6 Dec 2003 (10 posts) Archive Link: "2.4.23 includes Andrea's VM?"

Topics: Big Memory Support, Virtual Memory

People: Mike FedykAndrea ArcangeliStephan von KrawczynskiIan SoboroffBill Davidsen

Ian Soboroff noticed that in the 2.4.23 ChangeLog, at least some of Andrea Arcangeli's Virtual Memory Subsystem code had been merged. He asked for some clarification on that, and Mike Fedyk replied, "A good amount of the VM was merged into 2.4.23-pre3, so the -aa patches against pre6 should show you what is missing." Andrea Arcangeli also said that the latest 2.4 kernels would probably run much better in some cases, "However I'd still recommend to use my tree, the last two critical bits you need from my tree are inode-highmem and related_bhs. Those two are still missing, and you probably need them with 12G. I'm going to release a 2.4.23aa1 btw, that will be the last 2.4-aa." Mike asked if Andrea would start releasing 2.6-aa branches, but there was no reply. Stephan von Krawczynski and Bill Davidsen took the opportunity to thank Andrea for all his work on the VM subsystem.

4. Filesystem Encryption And Compression

3 Dec 2003 - 9 Dec 2003 (50 posts) Archive Link: "partially encrypted filesystem"

Topics: BSD, Big O Notation, Compression, FS: ext2, Virtual Memory

People: Richard B. JohnsonLinus TorvaldsPhillip LougherErez ZadokMatthew WilcoxDavid WoodhouseKallol BiswasJoern EngelBill Davidsen

Kallol Biswas wanted a way to have a filesystem store some data encrypted and some in the clear; Richard B. Johnson said that this really should be handled by the application, not the filesystem; and Bill Davidsen agreed. As Richard put it, "The file-systems are a bunch of inodes. Every time you want to read or write one, something has to decide if it's encrypted and, if it is, how to encrypt or decrypt it. Even the length of the required read or write becomes dependent upon the type of encryption being used. Surely you don't want to use an algorithm where a N-byte string gets encoded into a N-byte string because to do so gives away the length, from which one can derive other aspects, resulting in discovering the true content. So, you need variable-length inodes --- what a mess. The result would be one of the slowest file-systems you could devise."

Elsewhere, Joern Engel suggested that it might be possible to add optional encryption to an existing filesystem like JFFS2. But Linus Torvalds replied:

Encryption is not that easy to just tack on to most existing filesystems for one simple reason: for performance (and memory footprint) reasons, most of the filesystems out there are doing "IO in place". In other words, they do IO directly into and directly from the page cache.

With an encrypted filesystem, you can't do that. Or rather: you can do it if the filesystem is read-only, but you definitely CANNOT do it on writing. For writing you have to marshall the output buffer somewhere else (and quite frankly, it tends to become a lot easier if you can do that for reading too).

And that in turn causes problems. You get all kinds of interesting deadlock schenarios when write-out requires more memory in order to succeed. So you need to get careful. Reading ends up being the much easier case (doesn't have the same deadlock issues _and_ you could do it in-place anyway).

So encryption per se isn't hard. But adding the extra indirect buffer layer _can_ be pretty nasty, and makes it nontrivial to retrofit later.


If you don't need to mmap() the files, writing becomes much easier. Because then you can make rules like "the page cache accesses always happen with the page locked", and then the encryption layer can do the encryption in-place.

So it is potentially much easier to make encrypted files a special case, and disallow mmap on them, and also disallow concurrent read/write on encrypted files. This may be acceptable for a lot of uses (most programs still work without mmap - but you won't be able to encrypt demand-loaded binaries, for example).

Joern pointed out that some compressed filesystems handled all these problems, and still provided read-write support. And Linus replied:

Yes, compression and encryption are really the same thing from a fs implementation standpoint - they just have different goals. So yes, any compressed filesystem will largely have all the same issues.

And compression isn't very easy to tack on later either.

Encryption does have a few extra problems, simply because of the intent. In a compressed filesystem it is ok to say "this information tends to be small and hard to compress, so let's not" (for example, metadata). While in an encrypted filesystem you shouldn't skip the "hard" pieces..

(Encrypted filesystems also have the key management issues, further complicating the thing, but that complication tends to be at a higher level).

Joern replied that in this case, JFFS2 might be the best solution after all; and Phillip Lougher replied, "Considering that Jffs2 is the only writeable compressed filesystem, yes. What should be borne in mind is compressed filesystems never expect the data after compression to be bigger than the original data. In the case where the compressed data is bigger, the original data is used instead, which is hardy ideal for an encrypted filesystem, and so more than a direct substitution of compression function for encrypt function is needed - this is of course only relevant if the encryption algorithm used could return more data..." Erez Zadok replied:

Part of our stackable f/s project (FiST) includes a Gzipfs stackable compression f/s. There was a paper on it in Usenix 2001 and there's code in the latest fistgen package. See

Performance of Gzipfs is another matter, esp. for writes in the middle of files. :-)

A couple of posts later, he elaborated:

We compress each chunk separately; currently chunk==PAGE_CACHE_SIZE. For each file foo we keep an index file foo.idx that records the offsets in the main file of where you might find the decompressed data for page N. Then we hook it up into the page read/write ops of the VFS. It works great for the most common file access patterns: small files, sequential/random reads, and sequential writes. But, it works poorly for random writes into large files, b/c we have to decompress and re-compress the data past the point of writing. Our paper provides a lot of benchmarks results showing performance and resulting space consumption under various scenarios.

We've got some ideas on how to improve performance for writes-in-the-middle, but they may hurt performance for common cases. Essentially we have to go for some sort of O(log n)-like data structure, which'd make random writes much better. But since it may hurt performance for other access patterns, we've been thinking about some way to support both modes and be able to switch b/t the two modes on the fly (or at least let users "mark" a file as one for which you'd expect a lot of random writes to happen).

If anyone has some comments or suggestions, we'd love to hear them.

And Matthew Wilcox said:

Sure. I've described it before on this list, but here goes:

What Acorn did for their RISCiX product (4.3BSD based, ran on an ARM box in late 80s/early 90s) was compress each 32k page individually and write it to a 1k block size filesystem (discs were around 50MB at the time, 1k was the right size). This left the file full of holes, and wasted on average around 512/32k = 1/64 of the compression that could have been attained, but it was very quick to seek to the right place.

Now 4k block size filesystems are the rule, and page size is also 4k so you'd need to be much more clever to achieve the same effect. Compressing 256k chunks at a time would give you the same wastage, but I don't think Linux has very good support for filesystems that want to drop 64 pages into the page cache when the VM/VFS only asked for one.

If it did, that would allow ext2/3 to grow block sizes beyond the current 4k limit on i386, which would be a good thing to do. Or perhaps we just need to bite the bullet and increase PAGE_CACHE_SIZE to something bigger, like 64k. People are going to want that on 32-bit systems soon anyway.

Close by, Phillip said, "FYI, Acorn's scheme was described in "Compressed Executables: An Exercise in Thinking Small" by Mark Taunton, in the Usenix Spring '91 conference, it doesn't seem to be online, but a search on google groups for "group:comp.unix.internals taunton compressed executables" brings up a description. I used to work with Mark Taunton at Acorn."

Erez replied to Matthew, "Thanks for the info, Matthew. Yes, clearly a scheme that keeps some "holes" in compressed files can help; one of our ideas was to leave sparse holes every N blocks, exactly for this kind of expansion, and to update the index file's format to record where the spaces are (so we can efficiently calculate how many holes we need to consume upon a new write)." And Matthew said, "But the genius is that you don't need to calculate anything. If the data block turns out to be incompressible (those damn .tar.bz2s!), you just write the block in-place. If it is compressible, you write as much into that block's entry as you need and leave a gap. The underlying file system doesn't write any data there. There's no need for an index file -- you know exactly where to start reading each block." Phillip replied:

Of course this is all being done at the file level, which relies on proper support of holes in the underlying filesystem (which Acorn's BSD FFS filesystem did). FiST's scheme is much more how it would be implemented without hole support, where you *have* to pack the data, otherwise the "unused" space would physically consume disk blocks. In this case an index to find the start of each compressed block is essential.

I'm guessing that FiST lacks support for holes or data insertion in the filesystem model, which explains why on writing to the middle of a file, the entire file from that point has to be re-written.

Of course, all this is at the logical file level, and ignores the physical blocks on disk. All filesystems assume physical data blocks can be updated in place. With compression it is possible a new physical block has to be found, especially if blocks are highly packed and not aligned to block boundaries. I expect this is at least partially why JFFS2 is a log structured filesystem.

David Woodhouse replied:

Not really. JFFS2 is a log structured file system because it's designed to work on _flash_, not on block devices. You have an eraseblock size of typically 64KiB, you can clear bits in that 'block' all you like till they're all gone or you're bored, then you have to erase it back to all 0xFF again and start over.

Even if you were going to admit to having a block size of 64KiB to the layers above you, you just can't _do_ atomic replacement of blocks, which is required for normal file systems to operate correctly.

These characteristics of flash have often been dealt with by implementing a 'translation layer' -- a kind of pseudo-filesystem -- which pretends to be a block device with the normal 512-byte atomic-overwrite behaviour. You then use a traditional file system on top of that emulated block device.

JFFS2 was designed to avoid that inefficient extra layer, and work directly on the flash. Since overwriting stuff in-place is so difficult, or requires a whole new translation layer to map 'logical' addresses to physical addresses, it was decided just to ditch the idea that physical locality actually means _anything_.

Given that design, compression just dropped into place; it was trivial.

5. Status Of Storing .config Data In Compiled Kernels

4 Dec 2003 - 5 Dec 2003 (8 posts) Archive Link: "Where'd the .config go?"

People: Randy DunlapRobert L. Harris

Robert L. Harris noticed that the option to save the .config data within the compiled kernel was not present in 2.4.23-bk3; he asked if the feature had been removed. Randy Dunlap replied, "It's never been merged in 2.4.x. Marcelo didn't want it. It's in 2.6.x. There's a 2.4.22-pre patch in this dir that you can try:" Lucio Maciel suggested including the feature in 2.4, but Randy said, "It's Marcelo's" [Tossatti] "decision and he's trying to reduce 2.4.x patches."

6. Status Of OOM Killer In 2.4

4 Dec 2003 - 9 Dec 2003 (19 posts) Archive Link: "oom killer in 2.4.23"

Topics: Forward Port, OOM Killer, Virtual Memory

People: Guillermo Menguez AlvarezAndrea ArcangeliMaciej Zenczykowski

Peter Bergmann noticed that the out-of-memory (OOM) killer had been removed from 2.4.23; and said the results were really bad. Maciej Zenczykowski also suggested putting it back in, though perhaps making it a configuration option. Guillermo Menguez Alvarez pointed out:

As I see in the ChangeLog:

aa VM merge: page reclaiming logic changes: Kills oom killer

OOM Killer has been removed due to AA VM changes, so maybe it can't be cleanly enabled again.

Andrea Arcangeli said:

it can be re-enabled without too much pain if you can accept the desktop behaviour of 2.4.22 and previous not suitable for servers.

the oom killer had deadlocks and it was relaying on very inaccurate accounting, so it had a number of corner cases were it was killing tasks by mistakes (it's fooled by shm/mlock/noswap etc..), read also the bugreports for 2.4.22 with tasks being killed because there was no swap in the box (or just try to run your machine w/o swap, swap is not a must, it's a wish). Fixing those in 2.4 sounds too complicated, and now it's too late to even hope to make a proper oom killer for 2.4.

For the record 2.2 was capable of checking iopl to defer a few times the killing of the X server, that wasn't forward ported to 2.4. For 2.6 we can do something better than all the past oom killers at least. 2.6 gets fooled by mlock too btw, ranom kill tasks etc.. so it's not much better than 2.4.22 was in oom killing respect.

Elsewhere, Andrea also said:

it's that simple to reenable it in 2.4.22 status, so if you're ok to deadlock. 2.4.23 can't deadlock, it can live lock if you're unlucky with timings yes (think if you add 32G of swap and your ram runs at 1k/sec instead of 1G/sec), but not deadlock and it won't random kill tasks even if it shouldn't to. deadlock is a bug, killing task despite there's ram free is a bug, livelock is something you can avoid by dropping all swap. if you drop all swap with 2.4.22 it'll go nuts killing tasks (see the bugreports).

Since doing it right wasn't possible in 2.4, I dropped it years ago, -aa users are w/o an oom killer for years and I never heard a single complain. somebody asked why yes, but they were happy afterwards. I don't think I asked Marcelo to merge it, I explained why I dropped it, people sent him bugreports about the oom killer going nuts, and he agreed my solution was the best short term w/o adding lots of effort to make the oom killer right. Note the oom killer goes nuts in 2.6 too, nobody did it right yet, that's why I don't think it's a 2.4 issue.

Marcelo asked me to to make it configurable at runtime so you could go in the deadlock prone stautus of 2.4.22 on demand, but I'm not going to add more features to 2.4 today unless they're blocker bugs (even if that would be simple to implement), actually it's not even my choice so don't ask me for that sorry.

7. Patents Affecting FAT Support

5 Dec 2003 - 8 Dec 2003 (17 posts) Archive Link: "Large-FAT32-Filesystem Bug"

Topics: FS: FAT, FS: VFAT, Microsoft, Patents

People: Joanne DowHelge HaftingTomasz Torcz

Torsten Scheck reported a problem with the VFAT filesystem, but Joanne Dow replied, "This all may be moot. Microsoft is about to charge a royalty for use of the FAT file system. Software patents are the death of innovation and competition." Helge Hafting said, "They claim some patents, but aren't FAT so old that they have expired?" Tomasz Torcz replied, "Patents for storing long names of files (which Microsoft is charging for) are from 1995 or something."

8. Hand-Off Of 2.6 To Andrew Still In Progress

5 Dec 2003 (3 posts) Archive Link: "[BK PATCHES] libata fixes"

People: Linus TorvaldsJeff Garzik

Jeff Garzik posted a few fixes, and Linus Torvalds said, "Right now, I'm accepting one-liners that I think are "obvious" and also "very important" (ie fixes for oopses that anybody can trigger, rather than for example updates to one particular driver). So it sounds like I might accept _one_ of these." He added, "Andrew is still off, and he can make a decision independently, but right now I'm not going to apply anything bigger."

9. Status Of XFS In 2.4

8 Dec 2003 (6 posts) Archive Link: "XFS merged in 2.4"

Topics: FS: XFS

People: Marcelo TosattiDan Yocum

Marcelo Tosatti said, "Christoph reviewed XFS patch which changed generic code, and it was stripped down later to a set of changes which dont modify the code behaviour (except for a few bugfixes which should have been included separately anyway) and are pretty obvious. So its that has been merged, along with fs/xfs/." Dan Yocum thanked him heartily for this, and folks were generally happy.

10. kgdb 1.7 Released

9 Dec 2003 (1 post) Archive Link: "kgdb 1.7"

People: Amit S. Kale

Amit S. Kale said:

I have integrated some of the several enhancements submitted by TimeSys Corporation into mainline kgdb at The kgdb version containing these features is 1.7. It's available for kernel 2.4.23. Here is a brief description of the enhancements.

  1. Hasslefree gdb detach reconnect: You can now detach gdb from a kgdb stub by using gdb "detach" command. Reconnection later is as simple as typing "target remote" command from gdb.
  2. Restructured source files: Several kgdb source files have been restructured to separate architecture dependent and independent code into respective directories. It's a move towards making unification of kgdb sourcecode from different architectures.

11. Linux 2.6 Code Freeze In Full Effect

9 Dec 2003 (4 posts) Archive Link: "[PATCH 2.4.23, 2.6.0-test11] fix d_type in readdir in isofs"

Topics: Code Freeze

People: Linus TorvaldsDomen Puncer

Domen Puncer posted a fix for ISOFS, and Linus Torvalds replied, "Looks ok, but I can't convince myself to apply this at this point: there's just no way I can call this a major stability fix ;). Can somebody keep this around for later?"

12. Post Halloween Document New Location

9 Dec 2003 (1 post) Archive Link: "post halloween document moved.."

People: Dave Jones

Dave Jones said, "The box hosting died dramatically recently, resulting in the post halloween 2.6 document being offline. I've had a few requests for me to put this someplace else whilst I get things fixed up, so for the time being, you can find the last version I was able to find in backups at"

13. Software Suspend 2.0rc3 For 2.4 And 2.6

9 Dec 2003 - 11 Dec 2003 (3 posts) Archive Link: "Announce: Software Suspend 2.0rc3 for 2.4 and 2.6."

Topics: Compression, FS: NFS, FS: sysfs, SMP, Software Suspend

People: Nigel Cunningham

Nigel Cunningham announced:

This is to announce 2.0-rc3, now being uploaded to

A number of small but significant user-visible changes have been made with this release, so please read these notes carefully.

  1. New format. There is now one 'core' patch which should be applied regardless of your kernel version. In addition to the core patch, a version-specific patch should be applied. These are available for both 2.4 and 2.6 series kernels. Core and version-specific patches should be able to be updated independently.


  2. Changed kernel command line parameters. Instead of resume=, resume_block= and resume_blocksize=, there is now a single resume2= command line parameter. Note that that's RESUME2, not RESUME. The format for this parameter is:


    At the moment, there is only one method of storing images - the swapwriter. It is envisaged, that NFS support will be implemented sometime in the future. (After I do the work of merging with Patrick). For now, then, you will want to replace

    resume=/dev/hda1 resume_block=0x560 resume_blocksize=1024



    Later, you'll hopefully end up being able to have

  3. /proc/sys/kernel/swsusp is now deprecated. It is still in this version, but I'd appreciate it if scripts could be changed to use the new /proc/swsusp/all_settings entry instead. The functionality is exactly the same. Only the location has changed.

    In addition, a ton of user-invisible changes have been made. This accounts for the size of the patch. A new internal API implements two new kinds of 'plugins', designed to make adding new methods of transforming the data to be stored ('transformers') and saving the data ('writers') easier to implement. This has allowed me to separate out the swap specific code and the compression code as part of the big cleanup I've also done. The /proc code has also been enhanced, so that plugins can dynamically register new entries. This will also form a foundation for kobject support in the 2.6 kernel. (That is to say, 2.6 swsusp will soon stop using proc, and will use sysfs instead).

  4. Compatibility with other 2.6 implementations.

This version should play nicely with the existing software suspend implementations in the 2.6 kernel. Patrick's pmdisk implementation can be activated as always using the sysfs interface, and Pavel's using echo 4 > /proc/acpi/sleep. This patch does replace the freezer implementation those versions use, and Pavel's suspend will initialise but not use the nice display. Apart from these minor changes, no differences should be seen.

For those who simply with to upgrade from rc2, an incremental patch is also available from Sourceforge. I've put it there rather than attaching it because of its size.

Apart from the kobject changes mentioned above, this should be the last set of big changes to the code base. Unless something has slipped my mind, I believe I've just about implemented all the functionality we need. From now on, then, I'll only be looking to update/improve the documentation and clean and further document the code, to implement kobject support and perhaps also SMP support (which should be a minor changeset).

14. XFS Updates For 2.4

11 Dec 2003 (1 post) Archive Link: "Announce: XFS split patches for 2.4.24-pre1"

Topics: Access Control Lists, FS: XFS

People: Keith Owens

Keith Owens said:

2.4.24-pre1 now contains the bulk of the XFS code. These split patches only contain XFS updates from 2.4.24-pre1, plus add on code such as ACL, DMAPI and KDB. If you do not need those features and you are not seeing any bugs in the base 2.4.24-pre1 XFS code then you do not need to apply any of these patches.

Read the README in each directory very carefully, the split patch format has changed over a few kernel releases. Any questions that are covered by the README will be ignored. There is even a 2.4.24/README for the terminally impatient :).







Sharon And Joy

Kernel Traffic is grateful to be developed on a computer donated by Professor Greg Benson and Professor Allan Cruse in the Department of Computer Science at the University of San Francisco. This is the same department that invented FlashMob Computing. Kernel Traffic is hosted by the generous folks at All pages on this site are copyright their original authors, and distributed under the terms of the GNU General Public License version 2.0.