Kernel Traffic #81 For 21�Aug�2000

By Zack Brown

Table Of Contents

Introduction

I'd like to thank Robert Szokovacs for finding a serious bug with the new indexing feature, in which index links from printer-friendly issues all gave 404s. That's been fixed, and many thanks for the report!

Thanks also go to Robert Casties for feedback on how the index links should display in the printer-friendly versions. They now don't expand to show the full path, the way other links do in printer-friendly pages. Thanks for the suggestion, Robert!

Joe Buehler reported a silly typo in one of the section titles last week (I'd used the nonword "Consistancy"). Thanks Joe! ;-)

Mailing List Stats For This Week

We looked at 1891 posts in 8585K.

There were 551 different contributors. 249 posted more than once. 184 posted last week too.

The top posters of the week were:

1. Ramdisks, Compression, Embedded Systems, Loopback, And The VM Situation

12�Jul�2000�-�11�Aug�2000 (81 posts) Archive Link: "Do ramdisk exec's map direct to buffer cache?"

Topics: BSD: FreeBSD, Compression, FS: ReiserFS, FS: XFS, FS: ext2, FS: ext3, FS: ramfs, Small Systems, Virtual Memory

People: Graham Stoney,�Linus Torvalds,�Theodore Y. Ts'o,�Bjorn Wesen,�Jim Gettys,�Paul Rusty Russell,�Chris Wedgwood,�Steve Dodd,�Alan Cox,�Rik van Riel,�Andrea Arcangeli,�David Woodhouse,�Pavel Machek,�Jeff Garzik,�Stephen Tweedie,�Mike Galbraith,�Rusty Russell

A serpentine discussion.

RAM-Based Filesystems

Graham Stoney started it off with a question on how to minimize RAM requirements on embedded systems. He asked:

I know the Linux ramdisk uses the buffer cache, but when the kernel exec's a file from the ramdisk, is it smart enough to map the virtual address space for .text and .data directly into the buffer cache without copying?

Can it do a similar job when "loading" a shared library? And if so, what impact do shared library fixups have on the memory space used by the code of a dynamically linked executable? Are these likely to cause a significant number of pages to be copied-on-write?

David Woodhouse was dubious about using ramdisks for embedded systems. He felt embedded systems programmers resorted to ramdisks by default, because there was no support yet for flash chips or flash filing systems. But he felt a ramdisk actually wasted space when preloaded by the bootloader from any place accessible to the kernel. He recommended JFFS on flash chips, with ramfs for /tmp and possibly /var. To answer Graham's question, he added that he was pretty sure ramfs did share pages directly with the buffer cache. Linus Torvalds replied:

ramfs does indeed share pages, and does more than that: it shares the directory structure with the directory cache directly, so there is not any wasted memory even in meta-data. That was one of the design goals (along with extreme simplicity - it's the smallest and fastest filesystem around).

That said, ramfs isn't perfect. It's not a "tmpfs" in that it cannot page anything out to disk (a non-issue in embedded devices, but it would make ramfs more useful in general use), and it's new code that isn't used by all that many people - so it could have (and has had) problems. As 99% of the functionality of ramfs is actually just using VFS code directly the basics of ramfs are very solid, but the devil is in the details..

If you want to pre-populate ramfs (the way initrd does the old-style ramdisk), I would suggest using a compressed tar-file approach. In the long run I definitely want to get rid of the old-style ramdisk, as it has some serious problems both from a design standpoint and from a maintenance angle (the mm games it plays are much worse than the simple page-pinning stuff that ramfs can do - ramfs plays at a much higher level and has access to better abstractions through that).

The only problem with JFFS is that it doesn't do compression, so there's a lot to be said for using cramfs if you have space constraints (and in embedded devices, if you don't have space constraints you're doing something wrong ;). So mixture of ramfs, cramfs and JFFS can be a good thing.

Compressed Filesystems

Bjorn Wesen (one of the JFFS developers) replied that compression was already on their To Do list and would be fairly easy to do, in spite of needing a few special hacks.

Pavel Machek was happy to hear about cramfs, and asked if there were a read/write compressed filesystem driver available. Linus replied that as far as he knew, there wasn't. He said, "cramfs is _wonderful_ as long as you don't have to write. Very simple, and very efficient." But he went on:

All the read-write compressions tend to also compress much less than something like cramfs, because the metadata requirements for a read-write filesystem are usually a lot stricter and more complicated.

A compressing JFFS sounds wonderful, although I suspect that even then it might be useful to have a read-only side that compresses even better.

Compressed ext2 is horrible compression-wise. The metadata takes up a large amount of space..

Theodore Y. Ts'o gave his own take on the problems with read/write compressed filesystems:

It's not the metadata that takes a lot of space, it's the fact that it's using cluster cluster. That is, ext2compr takes a chunk of 8 1k blocks (for example), compresses it down to 3100 bytes which takes 4 1k blocks to store, and then stores it in 4 blocks, leaving a "hole" of 4 blocks. Repeat for the next 8k chunk.

There are two major problems with this approach.

  1. Every 8k (or whatever your compression cluster size is), you end up resetting the compression algorithm. This makes for very lousy compression ratios.
  2. Because of the compression clusters, it means that you suffer internal fragmentation and lose an average of 512 bytes (half the 1k block size) for every 8k of compressed data.

Why is it done this way? So that random-access reads (and especially writes) work efficiently. If you are willing to live with two constraints:

  1. Files are written sequentially once, and ever written to again. (appending is possible, but will be *slow*)
  2. You have enough ram that you can afford to keep the entire compressed file in memory at once --- or be willing to suffer nasty performance penalties if you do random access seeks into the file.

It's possible to have much better compression ratios, since you don't have to be the compression clusters game. However, many people would find these constraints untenable, especially for a general purpose filesystem. Life is full of tradeoffs.....

Linus continued:

Note that cramfs shares the compression algorithm side: everything is compressed as a 4kB block, because of the random-access issues. Going to bigger blocks is not _that_ much of a win, and gets painful on a small machine (and small machines is where this usually matters the most).

However, where cramfs shines is: _no_ fragmentation. Forget about block device issues, it does data on 4 byte boundaries. That, together with basically having very minimalistic meta-data (who needs meta-data anyway, when it's all read-only: _just_ enough to find stuff and no more) is the biggest win.

But you can't basically do these things if you want to be read-write. A truly log-based approach (ie not just meta-data journalling) might work out ok, actually, but most log-based stuff seem to want to have fairly large caches in order to work well. Which in embedded spaces isn't exactly a good idea.

Theodore continued:

The large caches are needed because log-based filesystems very quickly tend to fragment files all over hell-and-gone. But if you're using flash memory in your embedded device, this is much less of an issue, since you aren't impacted by the seek times that you have when you have to move heads over spinning media.

The real trick is being able to allocate on non-block boundaries, and dealing with fragmentation issues as you delete files and create irregularly shaped holes. Making a read/write filesystem that is optimized for the characteristics of flash memory would certainly be "interesting".

One potential problem with log-based schemes is that they tend to rewrite many more blocks (for example, normally you have to rewrite every single directory up to the root every time you so much as touch an inode to update the atime). For flash memory, this is non-optimal since you have a limited number of write cycles. Although modern flash memories they've extended the number of write cycles significantly, it's still an issue.

Of course, if you don't care about backing store, and want a pure memory-based compressed ram disk, life is much easier --- but writing to it is much less interesting, since it won't survive a reboot.

Bjorn replied that JFFS was already a working read/write filesystem optimized for flash memory - though he acknowledged that the 2.4 port was "early alpha". The 2.0 version, he said, was the one that really worked. Theodore was thrilled to hear this, and asked, "So does it do compression as well? If not, please consider adding it, as the iPAQ folks would love JFFS to pieces if it had that. (They're very space limited on the amount of flash they have --- which is not surprising if you're trying to run a complete Linux operating system plus XFree86 on something that consumers can afford to buy.....)" Bjorn replied, "It does not do compression yet although it would be simple to add it. Probably we'll add it in the 2.0 branch in parallel with the 2.4 version getting mature." And recommended that the iPAQ folks "run cramfs for all the read-only stuff and just use the current JFFS for configuration/params."

Embedded Systems

Theodore agreed, and Jim Gettys also replied:

We suspect that a combination of cramfs and jffs will serve handhelds very well... I can say from first hand experience that compressed ramdisks and/or cramfs gets very old, very quickly: we really want a writable file system, with (read) compression.

We have both cramfs and jffs cranking over right now on the iPAQ, but haven't cut over to using either quite yet (but probably will in the next few weeks sometime. We initially tried using cramfs, but had a flash driver bug we didn't know about at the time so did ramdisk to get the iPAQ to Usenix. But the current state is not ideal.

NOTE: we don't want/need general compression of data being written. Most data being written (in terms of volume) is likely to be in already compressed formats (e.g. note taking via audio which will then be compressed before being written). You don't want to pay the joules or performance to compress the data twice. Most of the flash is likely to be executables or shared libraries.

A scheme Keith Packard proposed which would work very well for read only data is somewhat similar to cramfs: just compress on 4K boundaries and store an offset table, and mark the file as compressed when you are done; do the obvious uncompression on the marked files on read. My intuition is that this might be a performance win even on vanilla machines today.

So we argue that in fact full general compression of files automatically behind the application's back will in fact be highly counter productive on handheld devices. The stuff we will write the most of will already be compressed...

Ted says Keith's scheme can be done with a stacking file system: this would get it for all file system types, which strikes me as a win. We'll do it someday, if no one gets to it in the meanwhile (it will probably still be months before we get to that point).

Paul Rusty Russell had a lower opinion of cramfs, and said, "If you've already got a filesystem, may I recommend you drop cramfs, and use jffs over readonly compressed loopback? That way you don't need the cramfs code (or its in-built limitations), and you get much better compression (in fact, despite its cuteness, I believe cramfs is the wrong solution for everything). I hacked together a compressed loopback one afternoon for the Linuxcare Bootable Business Card (http://www.linuxcare.com/bootable_cd/index.epl) : I think the source is on the LC site somewhere."

But David pointed out that JFFS wouldn't run on block devices, only memory devices like flash chips. And Linus remarked, "At least cramfs works. I have about ten reports of loopback not working lately, and I'm likely to disable it completely unless somebody steps in to maintain the damn thing."

Loopback

Steve Dodd felt that loopback should really be maintained alongside the block device/buffer cache/page cache layers, since it was a fairly special case, and likely always to be fragile. Jeff Garzik pointed out that Mandrake relied on loopback for various things, but agreed it should be disabled if it was broken - as long as it would be fixed eventually. Chris Wedgwood pointed out, "If it is broken -- then it is less so than the 2.3.99-pre kernels. Back then most certainly I couldn't use it, these days I use it all the time -- and I've yet to had it fail on a recent kernel." And Steve Dodd added, "Before .99-pre4 (ish), there was a deadlock which kicked in more or less instantly (related to tq_disk). Disabling plugging on loop cured that, but there are still ways to make it deadlock pretty quickly. Booting with mem=8m and running iozone -a on an ext2-backed loop device dies pretty quickly for me."

Mike Galbraith posted a patch which seemed to fix the deadlock in loopback he'd been seeing. Linus replied:

This is exactly the kind of patch that the loopback device has always needed, and is exactly the reason why I would prefer to kill loopback as soon as possible.

Either loopback is a block device driver, or it isn't. If it is, then it has absolutely no reason to start messing with fs/buffers.c and add special case logic for itself. And if it isn't, then the whole point of loopback is gone.

I'm inclined to mark loopback DANGEROUS because there apparently still isn't a maintainer for it. And the next person who suggests using it instead of a real filesystem (ramfs, cramfs, JFFS) should be forced to actually make it work right first!

Alan Cox explained, "Several folks have tried fixing it. The idea of replacing it with a raid layer equivalent was also kicked around at UKUUG and other places. The theory being that loopback is better done as a block remapping algorithm at the block layer, thus killing the double caching problem, sorting out the lack of read ahead and more." Mike agreed it should be marked dangerous until seriously rewritten, but he said he didn't know how to fix it himself. Linus took another look at loopback and noticed what appeared to be a bug. He gave a fix and there was a bit of technical discussion until Rik van Riel burst through the door, yelling, "NOOOOOOOOOO!!!!!"

Virtual Memory Redesign

Rik said the fixes they were discussing had problems, and proposed a more invasive one of his own. But Linus replied:

Ehh..

We're close to 2.4.x

We need to fix this bug.

We're not adding new untested code. We're fixing bugs.

Rik protested:

But with this "fix" you'd be adding another one in the process.

Admitted, it's only a performance bug, but I found it to grind the machine to an absolute halt when doing IO intensive stuff or running large programs...

Stephen Tweedie, Andrea Arcangeli and me have been looking at this bug and others and have found there's pretty much NO WAY to fix this without some bigger changes in the VM code. Performance will suck in the earlier 2.4 kernels, but I hope to have some new VM code ready later on for a more readable, better maintainable, more stable VM subsystem with somewhat higher performance.

He said he'd write up the new design and post it soon, but Linus replied, "Performance bugs are definitely secondary," and added:

Quite frankly, nobody has convinced me that there any way to fix VM balancing issues even _if_ people were to re-write the VM.

Yes, I've seen a lot of hot air.

The fact is that I suspect that it is fundamentally impossible to balance the VM so that everybody is always happy. People should realize that making more changes in the hope of finally reaching some elusive goal is not always worthwhile.

Strive for a good, stable system that avoids _most_ of the bad performance under normal load. And be prepared to live with the fact that there will always be things you can do to make it behave in nasty ways.

Right now I want things to _work_. Big VM changes are for 2.5.x anyway.

(See 2.2.x for how playing with the VM can cause untold stability woes. I think Alan learned that the hard way).

To this last point, Alan put in:

Yes. Its taken from 2.1.121 or so to 2.2.17pre to get the VM acceptable, and it'll take another 2 or 3 releases doing only gradual tested changes to verify the final few bits to get it to be almost as good as 2.0.

For just about every load I've tested 2.0 is the best stable VM we ever had, late 2.1 was better, 2.2 was bad, 2.4 I can get to the point the box stalls for 45 seconds - as any user.

Most of the post 2.4 proposals look good, because we know they work well and the overhead looks like it can be no worse than in 2.4 for the light load cases. FreeBSD is a very nice test suite for that.

No argument - we cannot do major VM work for 2.4, it'll just have to be tuned to try and get it as good as 2.2. Post 2.4 I'll take a look at doing a 2.4.x-ac with the newer VM work and whatever else escapes for later folding in, providing 2.2 is rock solid by then.

Rik pointed out that Linus had been CCed on most of the emails discussing the new VM system, and Linus replied:

I've seen a lot of discussion, yes.

I haven't seen any really convining arguments that any of the rewrites would really make things all that better.

Yes, they'll probably fix the thing that you try to fix. And they'll introduce new cases where _they_ work badly, and the old code happened to work fine.

For example, the "dd if=/dev/zero of=file" thing can be made to be very nice on interactive behaviour, and you can obviously design a VM subsystem that does that on purpose. Fine. I bet you that such a VM subsystem has serious problems with some other workloads..

Or the old idea to start writebacks early in order to try to minimize having dirty pages in memory that are hard to get rid of. It's wonderful. For certain loads. And it really sucks on others that have big temp-files that will get deleted (like bench).

The thing that is dangerous about designing a new VM is that you can design it so that it avoids the current pitfalls. But you won't even be aware of the things that the current thing does well, and you may not design it to do as well on those.

And in the end, reality always tends to hit theory hard in the face when you least expect it. That's why I'm not holding my breath for some magical VM rewrite that will fix all performance problems. No matter _how_ much people talk about it..

Alan pointed out that for this very reason, "the fact these folks are understanding why the FreeBSD VM works (something not all the freebsd folks seemed to know) and are working from a known good VM implementation is promising. 2.5 will tell." Rik explained:

Yup, we know why FreeBSD VM works and what its weak points could be. We've also had some help from SGI and Sequent/IBM people as to what the scalability problems of our new VM design could be.

The new VM will be heavily based on FreeBSD VM, which we know works, with some small tweaks where we've tried to come up with scenarios where they'd break (and we failed, so we'll try those tweaks).

Linus replied, "The new VM _will_ be explained to me before anything else." To which Rik agreed, and reiterated that he planned to post the description later that day. Linus also compared the situation to ext3 of 4 years before, in which similar "hot air" was released without patches. Alan pointed out that ext3 had been out and working for awhile already, and Linus replied:

yes, within the last four months or so ext3 has actually become reality.

In large part, I suspect, because it became so painfully obvious that ReiserFS was getting quite a lot of attention.

THAT is what I'm complaining about. Not the last couple of months. But the years that preceded it.

I hope the MM thing doesn't turn into that. We need incremental improvements, not grand schemes that get talked about.

Alan replied that it had been a good bit longer than 4 months (10 seemed more accurate), and had the last word:

You've been reading too many conspiracy theories. Ext3 and Reiserfs are not competitors. Ext3 is a tool to journal ext2fs. Its still slow on huge directories and its still got every other ext2 feature good and bad

Reiserfs has fast handling of large file trees, efficient packing of small files and a whole pile of stuff which puts it and XFS as the competitors.

Really its

versus

and more researchy stuff like the tree-phase ext2 which offers a whole pile of interesting future paths that journalling doesn't handle well

Thus endeth the thread.

2. DTR/DSR Handshaking Deferred To 2.5; Linus Firm On Code Freeze

1�Aug�2000�-�10�Aug�2000 (23 posts) Archive Link: "[patch] DTR/DSR hardware handshake support for 2.0/2.2/2.4"

Topics: Code Freeze

People: Martin Schenk,�Theodore Y. Ts'o,�H. Peter Anvin

Martin Schenk posted small patches against 2.0, 2.2 and 2.4 and explained, "As I needed support for DTR/DSR (http://metalab.unc.edu/pub/Linux/docs/HOWTO/other-formats/html_single/Text-Terminal-HOWTO.html#s10) hardware handshake to communicate with a serial printer (and the "wire RTS/CTS to DTR/DSR" tip from the serial-HOWTO works fine for a demo, but is not applicable for a few thousand POS terminals *gg*), I implemented this functionality." Theodore Y. Ts'o was nonplussed, and replied:

Yilch. This is specialized enough that I'd much rather this *not* go into the kernel. Next thing we know, someone will want a DTR/CD handshaking mechanism, etc.,etc.

This should probably be a private kernel patch, or (much more strongly suggested) that you just get specially wired RS-232 cables. This is in fact a very standard thing to do, and you can order cables from Black Box or some other company specializing selling cables which connect crufty legacy hardware.

But H. Peter Anvin objected, "DTR/DSR is the most common handshake mechanism for RS-232-B (as opposed to RS-232-C and RS-232-D) devices. This feature has been requested on and off for the last five years. I think it's worthwhile." Martin added:

A lot of supermarket cash registers are in fact PCs with special hardware, which for some unclear reason typically knows only about DTR/DSR handshaking (even if it is from other vendors: SNI or EPSON does not make a difference: only DTR/DSR).

Sending people to a thousand stores around the country putting special cables between printers and computers is simply not acceptable (if you know POS service people, you know that about half of the cables would be put on the wrong serial ports).

Frank da Cruz also got requests for DTR/DSR handshaking in kermit, and recommended putting the patch in. Theodore was not dead-set against the patch, though he still said, "I really do wonder how many people *really* need this. Maybe I'll set up a survey on serial.sourceforge.net to determine whether or not there's enough people to want this kind of feature bloat..... in any case, I'm not terribly inclined to consider this before 2.4, especially since Linus finally seems to be serious about the code freeze." Later he confirmed that he'd only consider adding the patch after 2.5 started up.

3. Status Of Dual Athlon Support

2�Aug�2000�-�8�Aug�2000 (44 posts) Archive Link: "Dual athlon support?"

Topics: SMP

People: Stephen Frost,�Alan Cox,�Pavel Machek,�Tom Leete,�Dan Hollis

Pavel Machek asked about the status of dual Athlon support. Since AMD sold boxes with AMD-760 and SMP Athlons, he was keen to find out if anyone was working on this, and how far they'd gotten. Stephen Frost remarked, "Athlons aren't really SMP but operate more like Alphas with a P-T-P architecture, from my understanding." He asked for a URL since he hadn't seen any Athlon motherboards for sale anywhere, but there was no reply. Tom Leete also replied to Pavel, saying he'd previously posted a patch to compile Athlon SMP, though he'd only tested it on UP systems.

Alan Cox had also not seen any dual Athlon boards, and didn't know the status of any work on them, but he did say, "I've always understood from AMD that because of the way the apic appears that SMP will just work although the hardware behind the apparent APIC is unrelated." Dan Hollis mentioned here that AFAHK dual Athlons were not scheduled for releast until the 4th quarter of 2000 at the earliest.

At one point in the course of discussion, Pavel added, "AMD now considers us pretty important, which seems like good news for linux community."

4. VM Design Dispute

2�Aug�2000�-�13�Aug�2000 (52 posts) Archive Link: "RFC: design for new VM"

Topics: BSD: FreeBSD, Clustering, Ottawa Linux Symposium, Virtual Memory

People: Rik van Riel,�Linus Torvalds,�Andrea Arcangeli,�Ben LaHaise,�Chris Wedgwood,�Stephen Tweedie

Rik van Riel proposed (quoted in full):

here is a (rough) draft of the design for the new VM, as discussed at UKUUG and OLS. The design is heavily based on the FreeBSD VM subsystem - a proven design - with some tweaks where we think things can be improved. Some of the ideas in this design are not fully developed, but none of those "new" ideas are essential to the basic design.

The design is based around the following ideas:

  1. center-balanced page aging, using
    1. multiple lists to balance the aging
    2. a dynamic inactive target to adjust the balance to memory pressure
  2. physical page based aging, to avoid the "artifacts" of virtual page scanning
  3. separated page aging and dirty page flushing
    1. kupdate flushing "old" data
    2. kflushd syncing out dirty inactive pages
    3. as long as there are enough (dirty) inactive pages, never mess up aging by searching for clean active pages ... even if we have to wait for disk IO to finish
  4. very light background aging under all circumstances, to avoid half-hour old referenced bits hanging around

Center-balanced page aging:

  1. goals
    1. always know which pages to replace next
    2. don't spend too much overhead aging pages
    3. do the right thing when the working set is big but swapping is very very light (or none)
    4. always keep the working set in memory in favour of use-once cache
  2. page aging almost like in 2.0, only on a physical page basis
    1. page->age starts at PAGE_AGE_START for new pages
    2. if (referenced(page)) page->age += PAGE_AGE_ADV;
    3. else page->age is made smaller (linear or exponential?)
    4. if page->age == 0, move the page to the inactive list
    5. NEW IDEA: age pages with a lower page age
  3. data structures (page lists)
    1. active list
      1. per node/pgdat
      2. contains pages with page->age > 0
      3. pages may be mapped into processes
      4. scanned and aged whenever we are short on free + inactive pages
      5. maybe multiple lists for different ages, to be better resistant against streaming IO (and for lower overhead)
    2. inactive_dirty list
      1. per zone
      2. contains dirty, old pages (page->age == 0)
      3. pages are not mapped in any process
    3. inactive_clean list
      1. per zone
      2. contains clean, old pages
      3. can be reused by __alloc_pages, like free pages
      4. pages are not mapped in any process
    4. free list
      1. per zone
      2. contains pages with no useful data
      3. we want to keep a few (dozen) of these around for recursive allocations
  4. other data structures
    1. int memory_pressure
      1. on page allocation or reclaim, memory_pressure++
      2. on page freeing, memory_pressure-- (keep it >= 0, though)
      3. decayed on a regular basis (eg. every second x -= x>>6)
      4. used to determine inactive_target
    2. inactive_target == one (two?) second(s) worth of memory_pressure, which is the amount of page reclaims we'll do in one second
      1. free + inactive_clean >= zone->pages_high
      2. free + inactive_clean + inactive_dirty >= zone->pages_high + one_second_of_memory_pressure * (zone_size / memory_size)
    3. inactive_target will be limited to some sane maximum (like, num_physpages / 4)

The idea is that when we have enough old (inactive + free) pages, we will NEVER move pages from the active list to the inactive lists. We do that because we'd rather wait for some IO completion than evict the wrong page.

Kflushd / bdflush will have the honourable task of syncing the pages in the inactive_dirty list to disk before they become an issue. We'll run balance_dirty over the set of free + inactive_clean + inactive_dirty AND we'll try to keep free+inactive_clean > pages_high .. failing either of these conditions will cause bdflush to kick into action and sync some pages to disk.

If memory_pressure is high and we're doing a lot of dirty disk writes, the bdflush percentage will kick in and we'll be doing extra-agressive cleaning. In that case bdflush will automatically become more agressive the more page replacement is going on, which is a good thing.

Physical page based page aging

In the new VM we'll need to do physical page based page aging for a number of reasons. Ben LaHaise said he already has code to do this and it's "dead easy", so I take it this part of the code won't be much of a problem.

The reasons we need to do aging on a physical page are:

  1. avoid the virtual address based aging "artifacts"
  2. more efficient, since we'll only scan what we need to scan (especially when we'll test the idea of aging pages with a low age more often than pages we know to be in the working set)
  3. more direct feedback loop, so less chance of screwing up the page aging balance

IO clustering

IO clustering is not done by the VM code, but nicely abstracted away into a page->mapping->flush(page) callback. This means that:

  1. each filesystem (and swap) can implement their own, isolated IO clustering scheme
  2. (in 2.5) we'll no longer have the buffer head list, but a list of pages to be written back to disk, this means doing stuff like delayed allocation (allocate on flush) or kiobuf based extents is fairly trivial to do

Misc

Page aging and flushing are completely separated in this scheme. We'll never end up aging and freeing a "wrong" clean page because we're waiting for IO completion of old and to-be-freed pages.

Write throttling comes quite naturally in this scheme. If we have too many dirty inactive pages we'll write throttle. We don't have to take dirty active pages into account since those are no candidate for freeing anyway. Under light write loads we will never write throttle (good) and under heavy write loads the inactive_target will be bigger and write throttling is more likely to kick in.

Some background page aging will always be done by the system. We need to do this to clear away referenced bits every once in a while. If we don't do this we can end up in the situation where, once memory pressure kicks in, pages which haven't been referenced in half an hour still have their referenced bit set and we have no way of distinguishing between newly referenced pages and ancient pages we really want to free. (I believe this is one of the causes of the "freeze" we can sometimes see in current kernels)

Over the next weeks (months?) I'll be working on implementing the new VM subsystem for Linux, together with various other people (Andrea Arcangeli??, Ben LaHaise, Juan Quintela, Stephen Tweedie). I hope to have it ready in time for 2.5.0, but if the code turns out to be significantly more stable under load than the current 2.4 code I won't hesitate to submit it for 2.4.bignum...

There was some documentation discussion: since Rik had based his proposal on the FreeBSD design, Chris Wedgwood asked if the differences between the two could be clarified, so that performance differences etc., could be identified with different parts of the design when appropriate. Rik agreed that the differences should be clearly indicated, and went on to bemoan, "The amount of documentation (books? nah..) on VM is so sparse that it would be good to have both systems properly documented. That would fill a void in CS theory and documentation that was painfully there while I was trying to find useful information to help with the design of the new Linux VM..." Matthew Dillon had also found it difficult to find anything that didn't focus on only single aspects of VM design.

Linus Torvalds came down pretty hard on Rik's design, saying that using a multi-list approach would be more difficult, and wouldn't help balancing. He acknowledged that it would help avoid the overhead of walking extra pages, but this seemed beside the point. He felt Rik's attitude that the current VM was irreparably broken, didn't jibe with the fact that Linus felt the old and new designs were functionally equivalent. He also accused Rik of "selling" his design, rather than putting it forward on technical merits. He asked Rik to explain why the new design was so much better than the old. He summarized at length:

The reason I'm unconvinced about multiple lists is basically:

To make a long story short, I'd rather see a proof-of-concept thing. And I distrust your notion that "we can't do it with the current setup, we'll have to implement something radically different".

Bascially, IF you think that your newly designed VM should work, then you should be able to prototype and prove it easily enough with the current one.

I'm personally of the opinion that people see that page aging etc is hard, so they try to explain the current failures by claiming that it needs a completely different approach. And in the end, I don't see what's so radically different about it - it's just a re-organization. And as far as I can see it is pretty much logically equivalent to just minor tweaks of the current one.

(The _big_ change is actually the addition of a proper "age" field. THAT is conceptually a very different approach to the matter. I agree 100% with that, and the reason I don't get all that excited about it is just that we _have_ done page aging before, and we dropped it for probably bad reasons, and adding it back should not be that big of a deal. Probably less than 50 lines of diff).

Rik countered that basing the lists on the page age made a big difference; there was more to be gained from multiple lists, he said, than just to save time walking pages. He explained:

We need different queues so waiting for pages to be flushed to disk doesn't screw up page aging of the other pages (the ones we absolutely do not want to evict from memory yet).

That the inactive list is split into two lists has nothing to do with page aging or balancing. We just do that to make it easier to kick bdflush and to have the information available we need for eg. write throttling.

He added that the current scheme didn't have enough information available to do proper balancing, but that having multiple lists would automatically provide all needed information. This, he went on, was the difference between his scheme and Linus' counter-proposal of 'scan points'. He added:

If there was any hope that the current VM would be a good enough basis to work from I would have done that. In fact, I tried this for the last 6 months and horribly failed.

Other people have also tried (and failed). I'd be surprised if you could do better, but it sure would be a pleasant surprise...

Finally, he concluded:

While page aging is a fairly major part, it is certainly NOT the big issue here...

The big issues are:

Linus exhorted Rik to go back and reread Linus' previous email, and

Realize that your "multiple queues" is nothing more than "cached information". They do not change _behaviour_ at all. They only change the amount of CPU-time you need to parse it.

Your arguments do not seem to address this issue at all.

In my mailbox I have an email from you as of yesterday (or the day before) which says:

I will not try to balance the current MM because it is not doable

And I don't see that your suggestion is fundamentally adding anything but a CPU timesaver.

Basically, answer me this _simple_ question: what _behavioural_ differences do you claim multiple queues have? Ignore CPU usage for now.

I'm claiming they are just a cache.

And you claim that the current MM cannot be balanced, but your new one can.

Please reconcile these two things for me.

Rik agreed that his multiple lists were functionally the same as a single list, with, he added, "statistics about how many pages of age 0 there are." He agreed there were other ways to do what he'd proposed, such as having a single list and keeping multiple counters for the stats he felt would enable proper balancing. But he added, "What I fail to see is why this would be preferable to a code base where all the different pages are neatly separated and we don't have N+1 functions that are all scanning the same list, special-casing out each other's pages and searching the list for their own special pages..." Linus replied:

I disagree just with the "all improved, radically new, 50% more for the same price" ad-campaign I've seen.

I don't like the fact that you said that you don't want to worry about 2.4.x because you don't think it can be fixed it as it stands. I think that's a cop-out and dishonest. I think I've explained why.

I could fully imagine doing even multi-lists in 2.4.x. I think performance bugs are secondary to stability bugs, but hey, if the patch is clean and straightforward and fixes a performance bug, I would not hesitate to apply it. It may be that going to multi-lists actually is easier just because of some thins being more explicit. Fine.

But stop the ad-campaign. We get too many biased ads for presidents-to-be already, no need to take that approach to technical issues. We need to fix the VM balancing, we don't need to sell it to people with buzz-words.

5. Latest Lowlatency Patch For 2.4

3�Aug�2000�-�14�Aug�2000 (24 posts) Archive Link: "[patch] lowlatency patch for 2.4, lowlatency-2.4.0-test6-B5"

Topics: Assembly, Virtual Memory

People: Ingo Molnar,�Andrew Morton,�Jamie Lokier

Ingo Molnar announced:

i've ported my 2.2 lowlatency patch to 2.4.0-test6-pre1. The vanilla 2.4 kernel fixed some latencies present in 2.2.16, but it also introduced a few new ones - and it keeps the fundamental latency sources largely unchanged, so the size and scope of the lowlatency patch has not changed much:

http://www.redhat.com/~mingo/lowlatency-patches/lowlatency-2.4.0-test6-B5

this patch is *not* yet intended to be merged into the mainstream kernel. I'd first like to see what kind of latencies and behavior people see, then i'll split the patch up into an 'uncontroversial' and 'controversial' part.

especially due to the VM changes i'd like people to try this, as in my experience it makes the system much 'smoother' during heavy VM load. The stock VM creates latencies up to 200 msec (!) on a 256MB box, and 200 msec can be easily noticed by humans as well.

the patch is a 'take no prisoners' solution, ie. i fixed all latency sources i could identify, no matter what the fix does to code 'beauty'. I strongly disagree with the "it's ok in 99.9% of the cases" approach, because in fact it's very easy to trigger bad latencies under various (common) workloads. And i just do not want Linux to become another Windows: "well you can play music just fine, as long as you dont do this and dont do that, and for God's sake, put enough RAM into your system.".

With this patch applied i was unable to trigger larger than 0.5 msec latencies even under extreme VM load in 100.0% of the cases - with the typical latencies in an unloaded system being around 0.1 msec. The patch fixes some 'scalability latency sources', ie. extreme latencies which show up only if a process has many open files, has lots of VM allocated.

95% of the conditional schedule points the patch adds fix some real latency that caused bigger than 1msec latencies under realistic (and common) workloads.

Main changes:

reports, comments, suggestions welcome!

Later, he added:

the newest version of the lowlatency patch can be downloaded from:

http://www.redhat.com/~mingo/lowlatency-patches/lowlatency-2.4.0-test6-C4

Changes:

enjoy - reports, comments, suggestions welcome.

Andrew Morton replied:

Comments and testing results:

6. Per-User Resources In 2.4 And 2.5

5�Aug�2000�-�8�Aug�2000 (16 posts) Archive Link: "can't mlockall() more than 128MB, is this a kernel limitiation ?"

People: Alan Cox,�Andi Kleen,�Robert H. de Vries,�Linus Torvalds

In the course of discussion, Alan Cox remarked, "Right now Linux isnt tracking per user resources. You need the beancounter addons to implement per user memory." Andi Kleen modified, "Actually test6-pre* seems to, at least for files and processes. See linux/kernel/user.c" and Robert H. de Vries also replied to Alan, "I think Linus has just (test6-pre series) put this facility in the kernel. See the new kernel/user.c"

Alan replied, "Yep - sort of a toy edition of beancounter. Thats not knocking it - the full beancounter isnt 2.4 material.." And Linus Torvalds also said to Robert:

Yes and no.

The "new" user.c is not actually new at all. It's the same old "struct user_struct" that we've had for a long time, and that tracks the number of processes a specific user has. You'll find the same "struct user_struct" in linux-2.2 too - this is much older than the 2.3.x development tree.

The new thing is that it's just separated out - it used to be in kernel/fork.c, and nothing else really knew about it. But it is basically the same old code in a new location: kernel/user.c.

The only _new_ thing in the code is due to "future expansion" changes: the "struct user_struct" thing has always had a reference counter, and that reference counter was also used as the "nr of processes using this" counter: they were one and the same. For future expansion, I split up the reference counter and the process counter into two: they should currently always be the same, but they won't be forever.

The reason? We can expand it to count more than just processes. And when we do that, we'll need to have the reference counter be independent of the things we count.

But no, it's not really new code, just a re-organization of something we've had for a long time (along with bug-fixes: the stuff in kernel/sys.c are real fixes for cases that could have caused us to ignore the process counts completely under low memory circumstances. The new code will correctly handle the case of not having enough memory to create a new virtual user).

7. RAID Docs Out Of Date

7�Aug�2000�-�8�Aug�2000 (5 posts) Archive Link: "RAID questions"

Topics: Disk Arrays: RAID

Adam McKenna complained that documentation for software RAID was way out of date and gave misleading and inaccurate information. Considering that he was working on a stable kernel, this was very surprising to him, and he asked for info on how to get software RAID working on the latest stable kernels. Andrew Pochinsky gave a link to http://people.redhat.com/mingo/raid-patches/raid-2.2.16-A0, which he said worked fine on a stock 2.2.16 tree, with the Red Hat 6.2 RAID tools. Gregory Leblanc also replied to Adam, saying he'd started a FAQ that went monthly to the Linux-RAID mailing list.

8. New Tool For Kernel Configuration

7�Aug�2000�-�8�Aug�2000 (6 posts) Archive Link: "A new config program -- anyone interested?"

People: Michael Elizabeth Chastain,�Paul Vojta

Paul Vojta was fed up with the standard kernel configuration tools, since 'make config', 'make menuconfig', and 'make xconfig' were interactive by nature. All he wanted was to be able to transfer the configuration of one kernel to another, so he wrote the 'qconfig' program. Several people pointed out that 'make oldconfig' would have given him the noninteractive compile he was after, and he later agreed that if he'd remembered 'make oldconfig', he probably wouldn't have bothered with 'qconfig'. But he and Michael Elizabeth Chastain also pointed out that 'qconfig' did have some significant improvements over 'make oldconfig'. As Michael put it, "With qconfig, if a variable is not in qconfig.in, and Linus changes the default value for that variable in arch/$(ARCH)/defconfig, qconfig will incorporate that change. oldconfig won't." And Paul added, "If you diff the qconfig.out files, you find out what questions have disappeared, and what questions have additional options or changed status (e.g., no longer experimental, different text, etc.). You're better able to track how your setup differs from the default."

Michael added that it should be possible to implement 'qconfig' with a lot less code, and posted a brief Makefile recipe:

# Makefile rules
qconfig:
cat arch/$(ARCH)/defconfig qconfig.in > .config
$(MAKE) oldconfig # or copy the oldconfig rules here
diff arch/$(ARCH)/defconfig .config | awk {blah, blah, blah} ... > qconfig.out

9. "Heap Of Bugs" Found In 2.4 Drivers

7�Aug�2000�-�8�Aug�2000 (8 posts) Archive Link: "[PATCH] checking kmalloc, init_etherdev and other fixes"

People: Arnaldo Carvalho de Melo,�Jeff Garzik,�David S. Miller,�Linus Torvalds

Arnaldo Carvalho de Melo posted a patch and explained, "This patch mostly includes checks for kmalloc and init_etherdev in the net drivers, but also fixes some bugs on some drivers, please take a look and consider aplying." David S. Miller pointed out some problems with the patch, and Arnaldo posted a corrected patch. Jeff Garzik replied, "You have definitely found a heap of bugs. The patch does need a little work though." He went on to describe various problems with the patch, and apparently Linus Torvalds also worked on it a bit with Arnaldo in private.

10. SGI Starts "Linux Test Project" Testing Suite

8�Aug�2000�-�12�Aug�2000 (22 posts) Archive Link: "[Announce] Linux Test Project"

Topics: SMP, User-Mode Linux

People: Nathan Straz,�Jeff Garzik,�Jeff Dike,�David Mansfield,�Andi Kleen,�Horst von Brand

Nathan Straz announced:

SGI would like to announce the Linux Test Project. The goal of this project is to create a formalized test system for the Linux kernel.

We have released a set of 96 tests on the project's website (http://oss.sgi.com/projects/ltp/). These tests exercise file systems and system calls and can be used for stress testing or sanity tests.

We would like to discuss the following topics with the community.

  1. The testing philosophy that is most important to the kernel developers. What approach best fits the development process? Regression? Functional? Stress? Performance?
  2. What is needed immediately? Building a test suite for the kernel is going to take time. What tests or tools are most important?
  3. We need to plan a development road map that works with the Linux kernel development road map.

We are hoping to hold an unscheduled BOF at LinuxWorld on Wednesday. Aaron Laffin and Richard Logan will be there to discuss these issues. If you are interested in testing and are attending LinuxWorld, please keep an eye open for our BOF.

A lot of folks cheered this idea, and gave feedback. Jeff Garzik felt that regression testing (testing old features to make sure new ones don't break stuff) and stress testing would be the two most important things to work on, with regression having first priority, adding, "Regression testing provides more stability of interface and code in the long run. Stress testing tools tend to focus on a few specific areas of the code, and be completely inadequate for covering certain cases." He also suggested culling and unifying test suites from the various distribution vendors, since they each seemed to have their own unique set of tests. Horst von Brand added that a good regression suite would pretty much be dependent on an existing functional test suite, so he recommended putting functional tests first in priority.

Jeff Dike also suggested, "Coverage isn't mentioned. If you are interested in doing a coverage test suite, then you should look into using gcov in conjunction with user-mode linux (http://user-mode-linux.sourceforge.net). I've done this in the past, and it works just fine." Nathan felt that a coverage suite wouldn't do much for the Linux kernel, since they wanted to test functionality more than the code itself. But he added that if they did decide to start covering code, user-mode Linux would be the way to go.

Andi Kleen also suggested that a suite to test common system calls in parallel on multiple CPUs could catch locking bugs that might have been introduced by 2.4's SMP scaling work. Nathan replied, "You should check out the tests we just released. They are "quickhitters" which are very simple tests that exercise system calls. You can run quichitters with a "-c n" option which creates n copies of the test and runs them simultaneously." But he added that these tests had not been created with Andi's idea in mind, and tended to interfere with each other. But he felt a fix would be possible, and offered to supply more information to anyone interested in working on it.

David Mansfield also asked how the test suite would deal with tests that caused pathological behavior in the kernel, "for example, 'infinite' hangs in the MM system during OOM, or crashes (OOPSes, panics) or deadlocks (process stuck in 'D' state)." All the tests he normally performed on kernels, he went on, involved situations like these, and he felt that any test suite would have to deal with them somehow. Pavel suggested running those tests in user-mode Linux, which would keep the machine up even if the user-mode kernel crashed. Nathan replied to David, "We definately need to build into the framework a way to recover from problems like this. If this will be some type of automated reboot, or someone walking in a rebooting the machine manually, I don't know. My goals are to get the system as automated as possible. It may turn out that we will include these tests as manual tests for completeness." Andi replied to this, suggesting the "software watchdog", which would reboot the system if its daemon failed to write to a specific file at regular intervals. He went on, "I would recommend configuring the software watchdog before running any critical tests. The test procedure could also use a simple log mechanism (write a START TEST record to a log file, fsync it) and a restart mechanism that tries to figure out any crashes so they can be logged."

11. Linux 2.2.17pre16

10�Aug�2000�-�14�Aug�2000 (13 posts) Archive Link: "Linux 2.2.17pre16"

Topics: Networking, PCI, Sound: i810

People: Alan Cox,�Andrew Morton,�Paul Mackerras,�Jan Harkes,�Andrey Savochkin,�Pontus Fuchs,�Marcelo Tosatti,�Bill Nottingham,�Dave Jones,�Donald Becker,�Andi Kleen,�Patrick van de Lageweg

Clustering: Beowulf

Alan Cox posted the CHANGELOG for 2.2.17pre16:

2.2.17pre16

  1. Thinkpad hacks and external amp support for CS46xx, also fix mono (Bill Nottingham, me, David Kaiser)
  2. Actually fix i810 audio hangs and other stuff (me)
  3. Dave Jones addr change (Dave Jones)
  4. Fix long standing vm hang bug (Marcelo Tosatti)
  5. Fix irda memory leak (Pontus Fuchs)
  6. Minor further PPC fixes (Paul Mackerras)
  7. Fix PCI id ordering (Paul Mackerras)
  8. 3Ware corrected update (Adam Radford Joel Jacobson)
  9. Fix stale documentation in proc.txt (Paonia Ezrine)
  10. Fix the TCP/vm bug nicely (Andi Kleen)
  11. Add 3c556 support to the 3c59x driver (Andrew Morton)
  12. Switch eepro100 to I/O mode pending investigation (Andrey Savochkin)
  13. Fix 'Donald Duck impressions' in ES1879 audio (Bruce Forsberg)
  14. CODA fs fixes for 2.2.17pre (Jan Harkes)
  15. RIO serial driver update (Patrick van de Lageweg)
  16. Minimal version of the at1700 fix [From Hiroaki Nagoya's original stuff] (Brian S. Julin)
  17. Typo fix in sysctl vm docs (Dave Jones)
  18. DAC960 update to rev 2.2.7 (Leonard Zubkoff)

To item 11 ("Add 3c556 support to the 3c59x driver"), Andrew Morton corrected:

Support is partial because the 3c59x.c in kernel 2.2 does not support power management. A moderate amount of mangling will be needed to make it do so.

The workaround is to add something like the following to your power management `resume' script:

ifdown eth0
rmmod 3c59x
modprobe 3c59x
ifup eth0

The 3c556 is also supported by Donald Becker's driver (http://www.scyld.com). Although that driver does support power management, it does not yet do so for the 3c556.

Another variant of this device has been reported. It has a PCI device ID of 0x6056. It has not yet responded to resuscitation attempts.

Additional details are on Fred Maciel's page at http://www2.neweb.ne.jp/wd/fbm/3c556

12. Linux-2.4.0-test6

9�Aug�2000�-�10�Aug�2000 (6 posts) Archive Link: "Linux-2.4.0-test6"

Topics: Disks: IDE, FS: UMSDOS, FS: ext2, Kernel Release Announcement, Networking, PCI

People: Linus Torvalds

Linus Torvalds announced Linux 2.4.0-test6, saying:

Ok, test6 is there now:

Changes in test6:

  1. speling fixces.
  2. fix drm/agp initialization issue
  3. saner modules installation (*) NOTE! This may/will break some module setups. Files go in different places. Better places.
  4. per-CPU irq count area. Better for caches, simpler code.
  5. "mem_map + MAP_NR(x)" => virt_to_page(x) (*) Purely syntactic change at this point. NUMA memory handling will take advantage of this during 2.5.x
  6. page_address() returns (void *) to make it clearer that it is a virtual address (it's the reverse of "virt_to_page()", see above).
  7. zimage builds should work again.
  8. Make current gcc's able to compile the kernel.
  9. fix irq probing in IDE driver: this caused strange irq problems for other drivers later on (notably PCMCIA, which is one of the few drivers to still probe for ISA interrupts on modern machines).
  10. Intel microcode update update.
  11. mips/mips64/sh/sparc/sparc64/acorn updates
  12. DAC960 driver update
  13. floppy shouldn't scream on open/close
  14. console driver does correct palette setting. No more black screens with XF86-4.x
  15. ISDN updates
  16. PCI layer can assign resources from multiple IO and memory windows
  17. yenta_socket driver no longer oopsable on unload.
  18. flush_dcache_page() for more virtual dcache coherency issues
  19. ext2_get_block() races fixed
  20. jffs bugfixes galore.
  21. user resource tracking infrastructure re-organization.
  22. umsdos works again.
  23. loopback shouldn't deadlock

Tons of small stuff. Holler if there's something bad.

13. NT/HFS-Style Multiple "Resources" In A Single File

11�Aug�2000�-�15�Aug�2000 (332 posts) Archive Link: "NTFS-like streams?"

Topics: Extended Attributes, FS: NFS, FS: NTFS, POSIX

People: Michael Rothwell,�Linus Torvalds,�Alexander Viro,�Pavel Machek

Christopher Vickery suggested an NT-like feature, whereby a single file could have several streams of data, each of which could be operated on as a unique file. A lot of folks were against this, and various alternatives were suggested, such as simply using directories with multiple files. In one of the many subthreads branching off of this post, Michael Rothwell explained that such a thing did not yet exist for Linux, and he explained, "There's two different ways of doing it currently; the BeOS way and the NT way. As you said, NT makes a namespace augmentation, using the ":" character to deliniate attribute names from file names. This is called "named streams". BeOS does not do that, but provides special accessor functions instead; this is called "extended attributes." They both accomplish the same goal though: keeping extra data about a file with the file." Linus Torvalds replied:

Note that this is a subset of what I wanted to make sure the Linux VFS layer can do: if a filesystem has multiple forks in a file, the VFS layer should be able to handle it by just doing the normal "readdir()" and "lookup()" on such regular files.

Of course, no UNIX filesystem does this, so it has never gotten any testing. But the plan was (and is) that if somebody wants to implement resource forks, then it should be possible without any hackery.

Linux does _not_ use the ":" character, of course. Linux uses the same old "/index.html" that it always uses for delineating names. That's pretty built-in into the VFS layer.

But it definitely should not be impossible to have a file called

~/myfile

and then access the "Icon" resource in it by just doing

xv ~/myfile/Icon

It requires that the low-level FS know what it is doing, and it may require some changes (small) to the VFS layer just because it has never been done before (and I'd be surprised if such resource forks didn't uncover _something_), but it should be entirely doable.

When the "use directories" argument was put forward again, Linus drove his point home:

I'll talk really slowly.

HFS has resource forks. They are not directories. Linux cannot handle them well.

I'm all for handling HFS resource forks. It's called "interoperability".

It's also realizing that maybe, just maybe, UNIX didn't invent every clever idea out there. Maybe, just maybe, resource forks are actually a good idea. And maybe we shouldn't just say "Oh, UNIX already has directories, we don't need no steenking resource forks".

Put this another way: don't think about "directories vs resource forks" at all. Instead, think about the problem of supporting something like HFS or NTFS _well_ from Linux. How would you do it?

Suggestions welcome. What's your interface of choice for a filesystem like HFS that _does_ have resource forks? Whether you like them or not is completely immaterial - they exist.

And usability concerns _are_ real concerns. I'm claiming that the best interface for such a filesystem would be

open("file", O_RDONLY) - opens the default fork
open("file/Icon", O_RDONLY) - opens the Icon fork
open("file/Creator"...
readdir("file") - lists the resources that the file has

and I'm also claiming that the Linux VFS layer actually shouldn't have any fundamental problems with something like this.

Tell me why we shouldn't do it like the above? And DON'T give any crap about whether resource forks are useful or not, because I claim that they exist regardless of their usefulness and that we shouldn't just put our heads in the sand and try to hope that the issue doesn't exist.

At one point in the discussion, Alexander Viro objected, "POSIX has a lot of nasty words about mixing files and directories. And I'm afraid that saying "no, foo is file, it just happens to have children" won't work - that way you are going to screw a lot of userland stuff." To which Linus replied, "Note that NFS isn't strictly a POSIX filesystem. And certainly neither is MSDOSfs or /proc. Not being POSIX doesn't mean that they are useless."

Pavel Machek and others worried that supporting file "resources" in this way would break a lot of userland apps, but Linus countered:

I don't think this is a strong argument. Any program that "knows" that it is handling a POSIX filesystem and simply does part of the work itself is always going to break on extensions. That's just unavoidable. Adding the magic string at the end makes "xv" happy, but might easily make something else that assumes POSIX behaviour unhappy instead (ie somebody else does 'stat("myfile#utar")' and is unhappy because it doesn't exist).

Tough. Whatever we do, complex files are going to act differently from regular files. Even a HFS approach that looks _exactly_ like a UNIX filesystem will confuse programs that get unhappy when the resource files magically disappear when the non-resource file is deleted.

He went on:

I'm personally worried not about individual programs not being able to take advantage of the resources, but about Linux fundamentally not _supporting_ the notion of resources at all.

So what I want to make sure is that Linux supports the infrastructure for people to take advantage of resource forks. The fact that not everybody is going to be able to do so automatically is not my problem.

Put another way: I suspect that we won't support resource forks natively for another few years, and HFS etc will have their own specialized stuff. I don't care all that much. But at the same time I do believe that eventually we'll probably have to handle it. And at _that_ point I care about the fact that our internal design has to be robust. It doesn't have to make everybody happy, but it has to be clean both conceptually and from a pure implementation standpoint. I don't want a "hack that works".

There were a lot of other implementation concerns from various folks, and the discussion continued for a good while, as various issues were hashed out.

14. USB Initialization Cleanup

12�Aug�2000 (3 posts) Archive Link: "USB initialisation"

Topics: USB

People: Russell King,�Linus Torvalds

Russell King reported:

On one of the ARM platforms, we have encountered a problem with the order of initialisation of USB vs the initial "bus"-type hardware setup.

Since Linus doesn't like new init calls going into init/main.c, the initialisation of a chip which has PCMCIA and USB hardware incorporated is placed at the head of the initcall list.

However, the USB drivers (OHCI) are initialised before this time by an explicit call in init/main.c.

Can we initialise the USB hardware drivers via the initcall method, or is there some reason why its done the way it is?

Linus Torvalds replied, "I would _much_ prefer to have the USB drivers fully initialized with "initcalls()". The only reason it's done like it is right now is that I suspect the USB maintainers didn't realize that you can fully order the initcall sequence by just chaning the link order. If I get a patch that removes "usb_init()" from init/main.c, I'll apply it right away." Russel posted a preliminary patch, though he suggested running it by the USB folks first; and the thread ended.

15. 2.4.0-test7-pre3 And ChangeLogs

12�Aug�2000�-�14�Aug�2000 (3 posts) Archive Link: "test7-pre3"

Topics: FS: FAT, Kernel Release Announcement, Networking, PCI, Sound: i810, USB

People: Linus Torvalds,�Chris Good

Linus Torvalds announced 2.4.0-test7-pre3, saying:

Trying something new: keeping rudimentary change-logs. I should keep this up until final 2.4.0. Watch me.

test7:

He replied to himself a couple days later with pre4:

Chris Good suggested putting the latest changes at the top of the announcements...

(ed. [] I think ChangeLogs are a great improvement, especially over what we had before, which was, basically, no announcements at all... I doubt we'll see ChangeLogs once we hit 2.5 though, but hopefully someone will give Linus a nudge to start up again when we get close to 2.6)

Sharon And Joy

Kernel Traffic is grateful to be developed on a computer donated by Professor Greg Benson and Professor Allan Cruse in the Department of Computer Science at the University of San Francisco. This is the same department that invented FlashMob Computing. Kernel Traffic is hosted by the generous folks at kernel.org. All pages on this site are copyright their original authors, and distributed under the terms of the GNU General Public License version 2.0.