Kernel Traffic #25 For 1 Jul 1999

By Zack Brown

Table Of Contents

Introduction

There were some very thought-provoking criticisms of KT #24. Peter Samuelson pointed out that my interpretation of Alan Cox's stance on the mkdir() problem (covered in Issue #24, Section #14  (16 Jun 1999: mkdir() Problems And Uncertainties) ) was not quite right. I've since updated the page. Thanks, Peter!

Cyrille Chepelov wrote to me:

Hi Zack,

thanks for your work -- I find KT pretty useful when I don't have time to directly lurk lkml..

However, the last two week I *had* time to do so, and I find you somehow have totally censored the discussion on devfs issues. Now it's pratically extinct, but there have been quite a lot of shell exchanges between hpa, tytso and rgooch, with bits from Alan and others. Actually, some of the arguments in that discussion were pretty insightful. Was there a reason for you to censor that ? Are you one of the rigid anti-devfs people, and chose to silence that threads for this reason ? Or simply you lacked time for this (maybe in that case, it'd be better to say that you lacked time to follow this or that thread ; maybe that'll bring contributors in ! I understand following lkml is hard, making a good quality summary of it is much harder)

Anyway, thanks for the great work you're putting in KT.

--Cyrille

Although I really, really appreciate the "you do a great job, keep it up" emails, the blunt, critical responses are perhaps even more useful to me. If you think I've missed an important thread, or if there's something else you think I should do to improve KT, please tell me. And if you follow linux-kernel yourself and would like to make sure I pay particular attention to a given thread, tell me about it. I can't guarantee to cover the threads you tell me about, but I will definitely look them over more carefully than I might otherwise. Just FYI, I don't summarize any threads until I think they're over, so long threads like devfs may go on for weeks before I even read them.

Mailing List Stats For This Week

We looked at 1506 posts in 6097K.

There were 474 different contributors. 203 posted more than once. 182 posted last week too.

The top posters of the week were:

 

1. Performance/DoS Patch; Kernel Stabilizes
31 May 1999 - 1 Jun 1999 (4 posts) Archive Link: "[patch] `cp /dev/zero /tmp new interactive feeling"
Topics: FS: ext2, SMP, Security
People: Andrea ArcangeliStephen C. TweedieLinus TorvaldsAlan CoxPavel Machek

Andrea Arcangeli said, "I got a bit bored by how much `cp /dev/zero /tmp' does stall the machine (all reads precisely). So I fixed the problem. But with my approch the speed of writes to disk while there are also read in progress or while there's some buffers that needs to be read, drops. Also scheduling rate increase a lot under such conditions..." He posted a URL to his patch (ftp://e-mind.com/pub/andrea/kernel/pre-2_3_4_andrea1.bz2) against kernel pre-2.3.4-2.

Pavel Machek pointed out that this was really a denial of service attack, and should go into 2.2; and Eloy A. Paris agreed, having been bit by exactly this problem. Andrea said he'd get the 2.2 patch ready as soon as possible, but would use a different ("saner") way of addressing the problem, which would be only slightly less interactive than the 2.3.4 patch.

Under the Subject: [patch] `cp /dev/zero /tmp' (patch against 2.2.9) (http://kernelnotes.org/lnxlists/linux-kernel/lk_9906_01/msg00649.html) , Andrea posted a new patch against 2.2.9, adding, "I think the problem is the current flushing that is not able to share the I/O bandwith with readers." He was a little worried that his patch would look ugly, but he couldn't think of anything better.

However, Stephen C. Tweedie replied, "Unfortunately it does completely the wrong thing if you have more than one disk: it stalls all writes on all disks as soon as any one disk goes into wait-on-read. That's not good. Not good at all." He added, "This is precisely the problem with optimising for one micro-benchmark."

His own analysis of the problem ran, "We really do need per-block-device IO scheduling, in some form or another, to fix this once and for all. We already have the per-device CURRENT queues, so some of the necessary infrastructure is in place today (and the current mechanism does support devices specifying their own queues to let them share queues sensibly)."

Andrea defended his patch, saying, "I agree that it's _ugly_, but it's equally ugly that we currently may block due out of request slots, without being able to start I/O on the second blockdevice." He asked if Stephen really saw a performance hit under high IO load using two block devices.

Stephen had only been pointing out theoretical problems. He asked if Andrea had run any benchmarks himself. Andrea had done what he could, but nothing formal. He agreed that his solution was probably not best, but he had to use it till something better came along.

Linus Torvalds came in at this point, saying:

Anybody who is interested in FS performance should take a look at the latest pre-patch of 2.3.7 (only pre-6 and possibly later: do NOT get any earlier versions. pre-5 still causes file corruption, pre-6 looks good so far).

Careful, though: I fixed the problem that caused some corruption less than an hour ago, and while my tests indicate it all works fine, this is a very fundamental change. The difference to earlier kernels is:

  • ext2 (and some other block device filesystems that have been taught about it) uses write-through from the page cache instead of having a separate buffer cache and the page cache to maintain dirty state. This means much less memory pressure in certain situations, and it also means that we can avoid unnecessary copies.
  • the page cache has been threaded, so on SMP you can actually get noticeable speedups from processes that do concurrent file accesses.
  • lower-latency read paths, especially the cached case.

Both of these are big, and fundamental changes. So don't mistake me when I say it is experimental: Ingo, David and I have been spending the last weeks (especialy Ingo, who deserves a _lot_ of credit for this all: I designed much of it, but Ingo made it a reality. Thanks Ingo) on making it do the right thing and be stable, but if you worry about not having backups you might not want to play with it even so. It took us this long just to make it work reliably enough that we can't find any obvious problems..

The interesting areas are things like

  • writes to shared mappings now go blindingly fast. We're talking mondo cleanups here. We used to do really badly on this, now we do really well.
  • does bdflush still do the right thing? There may be a _lot_ of tweaking to do to get everything working at full capacity.
  • can people confirm that it is stable for everybody?
  • if anybody has 8-way machines etc, scalability is interesting. It should scale to 8-way no problem. We used to scale to 1-way, barely. Numbers?
  • fsync(). It doesn't work right now, but it should be easy to make it work well on big files etc - something we've never been able to do before (we used to lack the indexing from file to dirty blocks: now we have access to that quite automatically thanks to having the inode->page index in place, and the dirty blocks are right there)

and I'd really appreciate comments from people, as long as people are aware that it _looks_ stable but we don't guarantee anything at this point.

Stephen replied, "I'll merge in the new fsync code. I'm not sure I'll bother using the inode page list: it doesn't deal with metadata, so we still need a separate per-inode dirty buffer list anyway. Given that, it may turn out to be simpler just to use that one list for all the dirty blocks."

Linus recommended taking a look at 2.3.7-pre7, adding, "I did the fsync() stuff for the page cache, and I'm pretty confident that we don't even need a per-inode dirty list with it." (he replied to himself 20 minutes later with a simple fix).

Stephen disagreed with the idea that a per-inode dirty list was unnecessary, and they continued discussing implementation, with some help from Andrea and (a little from) Alan Cox. The main disagreement seemed to be over whether the performance hit of having to walk the inode indirection tree looking for dirty buffers, was worth the other enhancements Linus had made. Stephen said no, and Linus wanted benchmarks. WIUM.

 

2. devfs
10 Jun 1999 - 24 Jun 1999 (230 posts) Archive Link: "RE: UUIDs (and devfs and major/minor numbers)"
Topics: BSD: FreeBSD, FS: devfs, FS: ext2, Hot-Plugging, USB
People: Ian WoodhouseDan HollisRichard GoochStephen FrostDaniel TaylorAlan CoxPavel MachekDavid ParsonsTheodore Y. Ts'oHorst von BrandBrandon S. AllberyClifford WolfH. Peter Anvin

The basic question is, should Linux continue to have its /dev directory as it is, or replace it with the devfs virtual filesystem.

Rather than try to summarize the actual discussion, I'll just give some choice general comments made during the debate. I'm very much aware that I've pigeon-holed some folks into falling clearly into the "pro" or "con" categories, who don't fit cleanly into those categories. Sorry. Doing my best.

PRO

Ian Woodhouse said, "Devfs offers a very good, logical interface. It's tidy (no wasted entries, no node creation. Believe me, /dev on *big* unix systems [AlphaServers, 100+ disks, FibreChannel, individually addressed by LUN] is exceptionally messy.). It's even human-readable to boot. On how you deal with persistance of ownership/mode over reboot, I have no comment. That's probably a site-issue. Symlinks, too, to name devices according to _your_ preference, are a site-issue."

Dan Hollis said:

I dont think devfs *requires* to be mounted on /dev ... Someone correct me if im wrong.

  1. If you want a physical /dev on the disk with 40,000 entries, disable the devfs compile option and create your physical /dev directory and populate it with 40,000 entries and all is well.
  2. If you want a physical /dev dynamically populated by a userspace daemon, then compile devfs in, mount it on /devices and have a userspace daemon populate /dev and all is well.
  3. If you want a fully dynamic devfs mounted on /dev, then compile it in and mount it on /dev and all is well.

Any questions?

Elsewhere, he added, "FreeBSD adopted devfs because it solves very real problems. I dont think the BSD developers are stupid. They reasoned it out and came to the same conclusion that Richard Gooch came to : devfs *is* useful."

And elsewhere: "devfs might not solve every possible problem at the moment, but it is certainly the correct infrastructure for solving problems."

Richard Gooch (author of devfs) said, "Let me state that I agree with the principle of putting as much as possible in user space. Devfs follows that principle. I know it would be possible to push some of the functionality of devfs into user space, but that would then make some things very hard or impossible to do in user space. This is why I say devfs provides the minimum necessary. The really cool things are then easy to do with devfsd."

Elsewhere, he added, "I get the strong impression that some of the most vocal opponents have not even bothered to download it, let alone compile it and try it out. I do actually monitor my ftp logs, you know..."

Elsewhere:

I'm tired of going through the following routine:

  1. Me: I think devfs should go in the kernel, because ...
  2. DO: DEVFS SUCKS! It's a BAAAAAAD idea
  3. Me: No, it allows you to do ...
  4. DO: YOU CAN DO IT ALL IN USER SPACE
  5. Me: Some things, yes, but not everything and not efficiently
  6. DO: Put a small hack into the kernel and the rest in user space
  7. Me: But then ... is hard/impossible
  8. DO: Make the kernel hack bigger
  9. Me: Now you've got something like devfs
  10. DO: But you've got names and that's "policy", and I don't like yours
  11. Me: So use devfsd and change them
  12. DO: goto (4).

"DO" is Devfs Opponent.

Sigh. Over the past year and a half, I've noticed a distinct pattern. It seems impossible to make any progress in this argument, because so often, when I press on one point, demonstrating why devfs wins out, the argument is shifted to another point. After a while, the first point is suddenly popped up.

Elsewhere:

The kernel always has and always will dictate some policy. And we've always been able to build a structure on top of that which follows a different policy. Devfs+devfsd actually makes that easier.

And there's a lot to be said for the kernel providing a decent, human-friendly policy that is guaranteed to be there. It allows portable programmes to be written, without having to worry about what a particular distribution maintainer has put in /dev.

With devfs+devfsd a distribution can provide a structure which they think is logical and works well for them, and get compatibility for free.

Stephen Frost said, "I do agree with the virtual /dev, because I want to be able to have devices be attached and detached w/o having to create/delete device nodes manually, or having to have alot of devices under /dev where only some are used. I also in general like the naming scheme devfs uses, I can see advantages both ways, and I think a scheme similar to Solaris may work, where /devices has a directory structure related to the Physical location of device, and /dev has it by type of device."

Daniel Taylor said:

I repeat (again):

The kernel provides access to devices, the kernel sets policy. To claim otherwise is folly.

The policy that devfs allows is:

  1. more flexible
  2. more robust
  3. easier on troubleshooters
  4. reliant on kernel code in fewer places
  5. more elegant (IMHO)
  6. independent of the root filesystem type

You may disagree with me on any of these particulars, but I want to see some sort of hard evidence that there is a better way. Like running code.

Elsewhere, he added, "I specificly use devfs because I do NOT use "vanilla" configurations. I use devices that RedHat never heard of, and I either need to mutter incantations over /dev to get everything working nicely or run devfs."

Alan Cox weighed in on the side of devfs, though not as violently as some other folks. He said:

Just say no to it in the configuration and you get what you had before. devfs fits a lot of sizes very well, and you can turn it off if you dont like it.

Thats hardly some "you must obey" single enforced policy

Pavel Machek said, "what is the problem with daemon looking into /devices, finding out their minor/major numbers, and then mknod()-ing same devices with correct permissions in /dev? I can see this doable with devfs. (And yes, you have additional advantage of being able to mount /devices over /dev if you decide not to use any daemon)."

David Parsons said, "In the ultra-paranoid case, where you want to do fancy shuffling of permissions for every device under the sun, but you want new devices to populate themselves into /dev, you set up /dev in the traditional pentagram and candles method, open devfs as /devices (with major/minor number generation enabled), the write a little daemon that simply carries the major and minor numbers across into /dev."

CON

Theodore Y. Ts'o said:

The issue is not virtual FS versus some other kernel interface. The issue is what appears in /dev, and whether the kernel code should be hard coding what happears in /dev. It shouldn't. That's policy. The kernel shouldn't be dictating policy.

Instead, the kernel should be exporting sufficient information so that a user-mode daemon can provide whatever interesting naming scheme (and that naming scheme might include device names based on the UUID or the fslabel in the ext2 device, or something else far more general than what kernel-space code can provide.)

Earlier, he added, "Some of the argumenets made in the past for devfs (i.e., optimizing the speed of opening a device file) are really bad reasons. Does anyone really think that opening files in /dev is really something which happens often enough that it should be optimized?!? Opening device files simply isn't something that happens all that often, and I suspect that the incremental speedup between the dcache and devfs is barely measuarable."

Horst von Brand said:

How do you manage permissions on this cleanly? It has to be persistent (no "OK, after we boot this tiny script fixes up the whole mess")? How do you propose to manage default permissions for devices that might suddenly appear out of nowhere (i.e., USB or hot-pluggable PCI or PCMCIA or...)? No, "one size fits all" won't do.

I'm not against some devfs type scheme per se, but this is important, and devfs makes this problem _worse_ without solving much of the other problems that are there. The current way of populating /dev with everything there might ever be is broken, but gives you a clean, uniform way of setting persistent permissions using standard tools.

Brandon S. Allbery said, "I'm not convinced that a devfs which is directly mountable on /dev makes sense *in the presence of hot-swappable devices*. It's a bit *too* mutable. "devfsd == vold done right" appears to make more sense IMHO. If there are no hot-swappable devices available, it is workable; but IMHO it's a bad idea to set things up that way because things will go wacky as soon as someone starts hot-swapping USB devices... it's bad to have a default setup which breaks in strange ways when one starts taking advantage of hot-swappable devices on modern computers."

H. Peter Anvin was quite vocally opposed to devfs. All 16 posts of his this week were on the subject, but almost none were quotable (not because they weren't good, but because they wouldn't make sense outside of the context of the discussion). He was mainly responding to particular technical points, and arguing that devfs was unnecessary.


Under the Subject: [PATCH] devfs v112 available (http://kernelnotes.org/lnxlists/linux-kernel/lk_9906_04/msg00741.html) Richard Gooch announced devfs v. 1.1.2 at http://www.atnf.csiro.au/~rgooch/linux/kernel-patches.html, to apply against kernels 2.3.5 through 2.3.8. Under the Subject: devfsd-v1.2.3 available (http://kernelnotes.org/lnxlists/linux-kernel/lk_9906_04/msg00742.html) , he also announced version 1.2.3 of devfsd, at http://www.atnf.csiro.au/~rgooch/linux/, which would work with devfs v. 1.1.2. And under the Subject: OT: The ROCK Linux Distribution (http://kernelnotes.org/lnxlists/linux-kernel/lk_9906_03/msg01127.html) , Clifford Wolf announced, "I'm maintaining "ROCK Linux". It's a Distribution with the focus on high skilled Users: No configuration utility (use the config files luke, be one with the config files). Based on Glibc 2.1.1 and the Kernel 2.2.10. It is the first distribution which is using Richard's devfs (AFAIK). It is a small distribution, but it's not a "mini distribution". It comes with over 200 Packages including X11 and the GNOME Desktop." He gave the URL at http://linux.rock-projects.com/.

 

3. FS Corruption With Later 2.2.x?
16 Jun 1999 - 24 Jun 1999 (75 posts) Archive Link: "Massive e2fs corruption with 2.2.9/10?"
Topics: FS: NFS, SMP
People: Stephen C. Tweedie

This was a long thread, in which several people reported filesystem corruption on the later 2.2.x kernels. None of the major developers were able to reproduce the problems, however. At one point, Stephen C. Tweedie said, "Well, after an hour or so of hammering a large 2.2.10 (build on egcs-1.1.2) SMP box with all manner of lmdd big file copies/compares, 100 process and 150 process dbench stressers, parallel kernel builds and concurrent NFS loading, absolutely nothing has gone wrong for me."

Then one by one, most or all of the reports were shown to be probably hardware related (bad ram, overclocked CPUs, etc), although it did seem like one or two might be actual kernel problems.

 

4. The Future Of OS Design
20 Jun 1999 - 25 Jun 1999 (72 posts) Archive Link: "Some very thought-provoking ideas about OS architecture."
Topics: FS, Ioctls, Microsoft, Replacing Linux, Virtual Memory
People: Eric S. RaymondLinus TorvaldsAlan CoxPavel MachekRik van Riel

Eric S. Raymond said:

Gents and ladies, I believe I have may have seen what comes after Unix. Not a half-step like Plan 9, but an advance in OS architecture as fundamental at Multics or Unix was in its day.

As an old Unix hand myself, I don't make this claim lightly; I've been wrestling with it for a couple of weeks now. Nor am I suggesting we ought to drop what we're doing and hare off in a new direction. What I am suggesting is that Linus and the other kernel architects should be taking a hard look at this stuff and thinking about it. It may take a while for all the implications to sink in. They're huge.

What comes after Unix will, I now believe, probably resemble at least in concept an experimental operating system called EROS. Full details are available at <http://www.eros-os.org/>, but for the impatient I'll review the high points here.

EROS is built around two fundamental and intertwined ideas. One is that all data and code persistence is handled directly by the OS. There is no file system. Yes, I said *no file system*. Instead, everything is structures built in virtual memory and checkpointed out to disk every so often (every five minutes in EROS). Want something? Chase a pointer to it; EROS memory management does the rest.

The second fundamental idea is that of a pure capability architecture with provably correct security. This is something like ACLs, except that an OS with ACLs on a file system has a hole in it; programs can communicate (in ways intended or unintended) through the file system that everybody shares access to.

Capabilities plus checkpointing is a combination that turns out to have huge synergies. Obviously programming is a lot simpler -- no more hours and hours spent writing persistence/pickling/marshalling code. The OS kernel is a lot simpler too; I can't find the figure to be sure, but I believe EROS's is supposed to clock in at about 50K of code.

Here's another: All disk I/O is huge sequential BLTs done as part of checkpoint operations. You can actually use close to 100% of your controller's bandwidth, as opposed to the 30%-50% typical for explicit-I/O operating systems that are doing seeks a lot of the time. This means the maximum I/O throughput the OS can handle effectively more than doubles. With simpler code. You could even afford the time to verify each checkpoint write...

Here's a third: Had a crash or power-out? On reboot, the system simply picks up pointers to the last checkpointed state. Your OS, and all your applications, are back in thirty seconds. No fscks, ever again!

And I haven't even talked about the advantages of capabilities over userids yet. I would, but I just realized I'm running out of time -- gotta get ready to fly to Seattle tomorrow to upset some stomachs at Microsoft.

www.eros-os.org (http://www.eros-os.org) . Eric sez check it out. Mind-blowing stuff once you've had a few days to digest it.

Alan Cox noticed this in the EROS mailing list archives: "The EROS license requires that it be possible for me or my designates to do proprietary releases." But Eric hinted that he might have some influence over that aspect of things, in the future.

There was a lively debate about the technical virtues of EROS. A lot of people felt it didn't represent any sort of new idea, but was very interesting in its current form. Alan was particularly impressed by the security model (he also read pretty much every EROS doc in existence).

On sort of a side topic, Rik van Riel felt that the PC (i.e. the "monolithic computer") would soon go extinct, taking UNIX, Eros, Microsoft OS, etc., with it. He pointed to Alliance OS (http://www.allos.org/) , a distributed system based on message passing, as the wave of the future.

Linus Torvalds replied:

That's a classic thing said by "OS Research People".

And it's complete crap and idiocy, and I'm finally going to stand up and ask people to THINK instead of repeating the old and stinking dogma.

It's _much_ better to have stand-alone appliances that can work well in a networked environment than to have a networked appliance.

I don't understand people who think that "distribution" implies "collective". A distributed system should _not_ be composed of mindless worker ants that only work together with other mindless worker ants.

A distributed system should be composed of individual stand-alone systems that can work together. They should be real systems in their own right, and have the power to make their own decisions. Too many distributed OS projects are thinking "bees in a hive" - while what you should aim for is "humans in society".

I'll take humans over bees any day. Real OS's, with real operating systems. Monolithic, because they CAN stand alone, and in fact do most of their stuff without needing hand-holding every single minute. General-purpose instead of being able to do just one thing.

He added:

I will tell you anything based on message passing is stupid. It's very simple:

  • if you end up doing remote communication, the largest overhead is in the communication, not in how you initiate it. This is only going to be more true with mobile computing, not less.
  • Ergo: optimizing for message passing is stupid. You should _always_ optimize for the local case, because it's the only case where the calling protocol really matters - once you go remote you have time to massage the arguments any which way you like.
  • Most operations are going to be local. Any operating system that starts out from the notion that most operations are going to be remote is going to die off as computers get more and more powerful.
  • Things may start off distributed, but in the end network bandwidth is always going to be more expensive than CPU power.
  • Truly mobile computing implies that a noticeable portion of the time you do _not_ want to be in contact with any other computers. Your computer had better be a very capable one even on its own. Anybody who thinks anything else is just unbelievably misguided.

This implies that your computer had better have a local filesystem, and had better be designed to work as well without any connectivity as it does _with_ connectivity. It can't communicate, but that shouldn't mean that it can't work.

So right now people are pointing at PDA's, and saying that they should be running a "light" OS, all based on message passing, because obviously all the real work would be done on a server. It makes sense, no?

NO. It does NOT make sense. People used to say the same thing about workstations: workstations used to be expensive and not quite powerful enough, and people wanted to have more than one. Where are those people today? Face it, the hardware just got so much better that suddenly REAL operating systems didn't have any of the alledged downsides, and while you obviously want the ability to communicate, you should not think that that is what you optimize for.

The same is going to happen in the PDA space. Right now we have PalmOS. It's already doing internet connectivity, how much do you want to bet that in the not too distant future they'll want to offer more and more? There is no technical reason why a Palm in a few years won't have a few hundred megs of RAM and a CPU that is quite equipped to handle a real OS. (If they had selected the strongarm instead of a cut-down 68k it would already).

In short: message passing as the fundamental operation of the OS is just an excercise in computer science masturbation. It may feel good, but you don't actually get anything DONE. Nobody has ever shown that it made sense in the real world. It's basically just much simpler and saner to have a function call interface, and for operations that are non-local it gets transparently _promoted_ to a message. There's no reason why it should be considered to be a message when it starts out.

Tidbits: Linus offered his stance/opinion on some less related issues. At one point, he said, "'ioctl()' and 'fcntl()' as they currently stand are just horribly ugly, and they are probably one of the worst features of UNIX as a design." And elsewhere, "I think we do want to move into a "web direction" where you can just do a open("http://ssss.yyyyy.dd/~silly", O_RDONLY) and it does the right thing." (Pavel Machek pointed out that his podfuk (http://atrey.karlin.mff.cuni.cz/~pavel/podfuk/podfuk.html) program could do that already)

 

5. Treating Multiple Files As One
20 Jun 1999 - 26 Jun 1999 (84 posts) Archive Link: "I discussed reading directories as files with jra, Stallman, and loic"
Topics: Compression, FS: Coda, FS: FAT, FS: NFS, FS: NTFS, FS: ext2, Microsoft
People: Linus TorvaldsWanderer no-last-nameTheodore Y. Ts'oHans ReiserAlex BuellBill Huey

Hans Reiser proposed the idea of operating on directories full of files, as if they were just a single file. Linus Torvalds replied:

Note that the Linux VFS layer was pretty much _designed_ with something like this in mind. From very early on, I decided that the VFS layer should not make too much of a distinction between a directory and a regular file: both have "lookup" properties, and both have "read" properties.

Some of that has been corrupted over time, and some of it was never done because nobody actually used it - so there's a few places where the VFS layer does things like "if (!S_ISDIR(d_inode->i_imode))" etc and thus "knows" about the difference between a directory and a regular file, but that was never really meant to be a design goal, and I'd be happy to try to clean it up.

So basically it all should be doable today: if a low-level filesystem wants to export directories both as regular files and as pathname components, it can be done. The low-level FS can look at the O_DIRECTORY flag to know whether somebody wants to read the thing as a directory or not (ie "readdir()" obviously opens the directory, while normal operations open the default file), and it should all work pretty much today.

It's going to confuse a lot of UNIX applications, but at the same time a reasonable number of them won't ever really have to know.

Wanderer no-last-name replied that this was very similar to a proposal he had once made for a "binder" extension, "A binder being a file that contains an internal directory of addressable files (or objects)." He went on to describe his proposal in more detail. Hans replied with some technical objections, and Theodore Y. Ts'o warned:

Before we go running into a deep technical discussion about how to design different streams inside a file, we should first stop ask ourselves how they will be *used*.

Something that folks should keep in mind is that as far as I have been able to determine, Microsoft isn't actually planning on using streams for anything. As near as I can tell it was added so that their SMB servers could replace Appleshare servers more efficiently, but that's really about it. I don't believe, for example, that MS Office 2000 is going to be using the streams functionality at all, and this is for a very good reason.

Streams really lose when you need to send them across the internet. How do you send a multifork file across FTP? Or HTTP? What if you want to put the multifork file on a diskette that's formatted with a FAT filesystem for transport to another OS? What if you want to tar a multifork file? Or use a system utility like /bin/cp or /usr/bin/mc that doesn't know about multifork files?

One of the reasons why the Apple resource-fork was a really sucky idea in practice was that executables stored dialog boxes, buttons, text, all in resources --- which would get lost if you tried to ftp the file unless you binhexed or otherwise prepped the file for transfer first.

So I question the whole practical utility of file streams in the first place. The only place where they don't screw the user is if the alternate streams are used to store non-critical information where it doesn't matter if the information gets lost when you ftp the file or copy the file using a non-multi-fork aware application. For example, the icon of the file, so the display manager can more easily figure out what icon to associate with the file --- and of course, some people would argue with the notion that the icon isn't critical information, and that it should be preserved, in which case putting it in a alternate stream may not be such a hot idea.

However, for speed reasons, a graphical file manager might do better to have a single file that has all of the icons cached in a few dot files (for security reasons, you will need a different dot file for each user who owns files in a directory). Said dot file would have information associating the name of the file, the inode number and mod time with the icon. If the icon cache is out of date, and an file appears in a directory without also updating the icon cache, the graphical file manager will have to find some way of determining the right icon to associate with the file. (But, this is a problem the graphical file manager would have to deal with anyway). The advantage of using a few dot files in each directory is that it will result in a lot fewer system calls and files needs that need read and touched than if the graphical file manager has to open the icon resource fork in each file just to determine which icon to display for that one file. So I don't even buy the argument multifork files are required to make graphical file managers faster; a few dot files in each directory would actually be more efficient, and would work across non-multi-fork aware remote filesystems like NFS. I don't think a graphical file manager that only worked on specialized filesystems would be all that well received!

So before we design filesystems that support multi-forks, let's please think about how they will be used, and how they will interact with current systems that don't really support multiple forks, and in fact are quite hostile to the whole concept. What's the point of being able to treat a filesystem object as both directory and a file if none of the system utilities, file formats (like tar) and internet protocols don't really support it? Does it really buy us enough to be worth the effort? And if we don't know exactly how it will be used, how will we know what sort of performace/feature tradeoffs we need to make before it will be useful?

Bill Huey started cursing wildly at Ted for his Apple comment, but apologized after some calmer folks took him aside. Not before making Alex Buell's (and others') killfile though.

Other folks had some technical replies to Ted's post, and Ted answered:

So, here's a quick back-of-the-envelope design for a completely user-space solution for folks who have been asking for multi-fork files. It's not intended to be a completely polished design, but I believe it's worth at least considering before rushing off and deciding that the only way to do things is to extend Linux's filesystem semantics.

I write this up this because people have accused me of just being a conservative "Dr. No" who always thinks their great new ideas are always bad. On the contrary, if application writers (especially office suite application writers) are demanding certain sets of functionality, we should take such requests seriously, and weigh the costs and benefits of what they ask for. It's just that I very strongly believe in trying to offer a user-space solution first before resorting to making in-kernel solutions. Especially if they are hacks that will only work on Linux systems! (Using one's OS market share as a club against interoperability is a despicable Microsoft tactic, and not one I want to encourage.)

Requirements analysis

So, let's try this as an exercise. Since no one has actually bothered to write down a list of requirements before galloping off to a solution, let me try to offer some:

  1. "Common" file manipulations operations should treat an "application logical bundle of data" (albod) as if it were a single file. (Forgive me for inventing a new acronym here, but "application logical bundle of data" is too long to type each time, and I don't want to bias people's thinking about how it is actually implemented.)
  2. Applications should be able to quickly and efficiently manipulate (read, modify, replace, delete, etc.) individual streams of data within an albod. This should be done without the file bloat and inefficiencies found in MS Office 97 format files.
  3. There should be standard file streams inside the albod whose semantics and data format are standardized, so that programs such as graphical file managers can determine basic information about an albod, such as which icon to use, who created it, which application should be invoked when the albod is activated, etc. quickly and easily. (Using file(1) on a data file to determine which application can interpret it is considered barbaric.)
  4. It should be easy to send these albod's across standard Internet protocols using standard, commonly available tools (ftp, http, rcp, scp, etc.).

Am I missing any other requirements?

Other solutions

Now then, which approaches have been used to address this problem in the past? In the NTFS and the Macintosh, this was done by adding specialized (but non-standard) semantics and new formats in the filesystem. This satisfied the first three requirements, but failed on the last.

The NeXT used a directory containing individual files, which satisfied requirements #2 and #3, but didn't satisfy #1 (except if you only used their graphical file manager) and #4 (unless you explicitly tar'ed stuff up first).

My proposed straw-man proposal

I now offer to you a design for a potential solution which is purely implemented in userspace, and has the advantage that it will work across all existing filesystems, include NFS, AFS, Coda, ext2, and doesn't require any linux-specific kernel hacks (which is important, since last time I checked, the GNOME and KDE folks weren't interested in solutions that only worked on Linux). The solution is a directory-based solution, like NeXT, but tries to address the rest of the requirements.

First of all, we need some way of distinguishing an "albod" from a normal directory. This can either be done using a filesystem specific flag, which is probably more efficient, but we would also like a filesystem independent way of doing this. So instead of (or perhaps in addition to) using a filesystem-provided flag, let's posit a magic dotfile in the directory which, if present, marks it has an albod bundle.

Now let's assume that we have a hacked libc (or a system-wide LD_PRELOAD) which intercepts the open system call. If an application does not declare itself (via some API call) to be albod knowledgeable, an attempt to open and read the albod results in the user-mode library emulation of open()/read() to return a tar-file-like flat-file representation of the albod. This allows cp, ftp, httpd, mimeencode, etc. to be able to treat an albod as if it were a single "bag of bits".

If the application declares itself to be albod-aware, it can then treat the albod as a directory hierarchy, and manipulate the various subcomponents of the albod as named streams, just like NTFS5 allows --- except that we can have hierarchical named streams, and not just a flat namespace!

How are albod's written? Well, an albod-aware application simply writes the appropriate component directories and files as if they were normal Unix files (which in fact, they are). If an non-albod-aware application such as /bin/cp writes it, there are two design choices. It's not clear which one is better, so let me outline both of them. One is to have the user-mode library notice that it is a albod flat-file representation by looking at its header, and then automatically unpacking it into its directory format as it is writing it out.

The other design choice is to simply allow the albod to be written out as a flat file, and when an albod-aware application tries to modify it, only then does the albod-flat-file-package get exploded into its directory-based form. If the flat-file format is compressed (which would be a great idea since applications would now get compression for free) then only expanding an albod when it is necessary to read it will save disk space for albod's which are only getting access occasionally in read-only fashion.

Problems with this approach

What are the downsides of this approach? Since by default, a non-albod-aware application gets the entire packaged albod as a single flat-byte-stream representation, /bin/cp, etc. work fine. This is great if the albod contains some new application data format, such as a Word or an Excel or a Powerpoint competitor, since the actual application code which manipulates the application document is albod-aware.

However, if the albod contains a .gif, .mp3, etc. file, where the already-existing applications that know how to process the .gif or .mp3 file aren't albod-aware (think: xv), then having open() return a flat-file contents of the entire albod is the wrong behavior. Instead, you want to return the default data-fork contents in that case. So what we can do is to have a second magic .dotfile or flag which indicates that for this albod, when it is opened and read, the default data file should be returned instead of a flat-file representation of the albod. The tradeoff for using this optional mode is that a naive /bin/cp or Midnight Commander program which doesn't know about albod files won't know how to copy or move the entire albod. So an attempt to ftp or mail this alternate form of the albod will just result in the data fork being sent. But if all of the application-specific data (i.e., the .gif or the .au data) is in the default data fork, losing the other metadata format might not be a disaster, and so this might be the approprach tradeoff. It depends on what extra metadata extensions GNOME or KDE wants to store in the albod alongside the .gif or .mp3 data.

The other downside with this solution is that it is admittedly pretty complex, and there are some subtle issues about how the LD_PRELOAD or hacked libc routines should actually work in practice. Some might even say that it is a kludge.

On the other hand, is it really that much worse than having kernel-mode "reparse points" that manipulaes application specific data in the kernel?!? I would argue that in contrast, having user-mode library hacks may actually cleaner, although admittedly both solutions aren't exactly pretty. Perhaps someone can come up with a yet more cleaner solution. I hope so!!

Summary

This is obviously not a fully fleshed out design proposal. There are obviously lots and lots of details that would need to be filled in first, before this could be used as a set of functional specs which an implementor could implement. I won't even claim that this is the best way to meet the stated requirements solely in user space. Someone may come up with a more clever user-space-only solution.

Rather, this was intended to serve as some food for thought, and a proof by example there is a way to do this in user-mode, without requiring Linux-specific filesystem hacks and extensions. While it requires some extra extensions to the libc, which might be considered kludgy, I believe it is no worse than the Microsoft NTFS-style "reparse points" suggestion which was offered to the kernel list in the last day or so.

There was a good bit of discussion back and forth, with folks agreeing and disagreeing on various points. Eventually, Hans said, "This discussion has reached the point where I want to write some code now...."

 

6. 2.3.7 Filesystem Reorganization And Breakage
20 Jun 1999 - 21 Jun 1999 (2 posts) Archive Link: "Linux-2.3.7.. Let's be careful out there.."
Topics: FS: FAT, FS: NFS, FS: ext2, Kernel Release Announcement, SMP
People: Linus TorvaldsRobert B. HamiltonJeremy KatzAndrea ArcangeliDavid S. MillerIngo MolnarArvind Sankar

Linus Torvalds announced Linux 2.3.7, saying:

The new and much improved fully page-cache based filesystem code is now apparently stable, and works wonderfully well performancewise. We fixed all known issues with the IO subsystem: it scales well in SMP, and it avoids unnecessary copies and unnecessary temporary buffers for write-out. The shared mapping code in particular is much cleaner and also a _lot_ faster.

In short, it's perfect. And we want as many people as possible out there testing out the new cool code, and bask in the success stories..

HOWEVER. _Just_ in case something goes wrong [ extremely unlikely of course. Sure. Sue me ], we want to indeminfy ourselves. There just might be a bug hiding there somewhere, and it might eat your filesystem while laughing in glee over you being naive and testing new code. So you have been warned.

In particular, there's some indication that it might have problems on sparc still (and/or other architectures), possibly due to the ext2fs byte order cleanups that have also been done in order to reach the afore-mentioned state of perfection.

I'd be especially interested in people running databases on top of Linux: Solid server in particular is very fsync-happy, and that's one of the operations that have been speeded up by orders of magnitude.

Robert B. Hamilton reported, "If I mmap a large file PROT_READ, MAP_PRIVATE, and then proceed to read the mmap'ed area with a pointer, my program hangs"

Under the Subject: ext2fs corruption on 2.3.7 (http://kernelnotes.org/lnxlists/linux-kernel/lk_9906_03/msg01225.html) , Jeremy Katz reported some filesystem corruption. Ingo Molnar posted his latest pagecache fixes, including a data corruption bug. Jeremy applied the patch, and replied that his swap was now oops'ing. So he disabled his swap and found that file corruption was gone, everywhere except over NFS, where he still saw it. He also posted the swap oops. Andrea Arcangeli analyzed the oops and noticed that some of the debugging code might be the problem. He offered a 2.3.7_andrea1 patch, but Jeremy said it locked his system solid.

Meanwhile, David S. Miller felt Andrea might be wrong about the debugging code having a problem, and Andrea took another look and agreed with him.

Later, under the Subject: 2.3.9-pre2 a Success!?!? (http://kernelnotes.org/lnxlists/linux-kernel/lk_9906_04/msg00708.html) , Jeremy was unable to reproduce the corruption or oopsen under 2.3.9-pre2. He added, "Now we can see how many features we can add and still have a 2.4 by the end of the year :)"

But Linus replied:

There's still something fishy going on, and David still has some problems on his sparc. So don't get all excited yet - we're steadily getting rid of bugs, but I want to have this stabilize over the next week or so before I start accepting seriously different patches.

(I've been dropping patches so far, I expect to drop patches for another week or so - I want this thing _stable_).

Arvind Sankar asked if FAT would still be broken in 2.3.9, and Linus said yes.

 

7. FENRIS Source Available
21 Jun 1999 - 29 Jun 1999 (42 posts) Archive Link: "FENRIS (nwfs) 1.4.2 Source Code Available,"
Topics: FS
People: Jeff V. MerkeyAlexander ViroAlan CoxSteven N. Hirsch

Jeff V. Merkey announced, "The FENRIS (nwfs) source code for 1.4.2 can be downloaded from 207.109.151.240. Please refer to the release notes attached to this email for info on what got fixed. Special thanks to Alan Cox for his help with the GNU compiler issues (and other issues). We will be posting one additional release later this week after we have regression tested the GNU compiler fixes (nwfs-1.4.3). Next week, mirroring support will be posted to the site in nwfs-1.4.4." Steven N. Hirsch asked how it was supposed to compile, and Jeff replied, as a stand-alone module. Once they got all the bugs worked out, Jeff said the idea would be to port it to the stock 2.0.x kernels.

Alexander Viro replied, "Jeff, if your filesystem is supposed to work with 2.0 you are in for a *BIG* work when you will port it to 2.2. Sorry, but It's Your Problem(tm). VFS changed big way and keeping the same codebase is next to impossible. Exactly because 2.0 was much dumber. If your design decisions are based on 2.0 - too bad, the thing will suck badly."

Jeff replied, "We have kept our depenencies to a minimum on inode related data structures and methods. We also know that the linux specific portions are different, and have anticiapted this, and they are #ifdef's out. You will note that about 95% of FENRIS is OS neutral, and we already have a single code base between NT and Linux, so having one between Linux and other Linux's isn't something that scares us, or that we don't already know about." He added:

By the way, not to slam Linus or anything, but making the types of sweeping changes that were made between 2.0 and 2.2 in the file system architecturaly was unsound from an engineering perspective, although I do understand why it needed to change, most commercial software companies would have never allowed this to occur. The changes are what have broken Caldera's Netware clients and server software, and they are still working on it and getting it fixed.

Word of Advice -- these file system changes hurt Linux in the market because they delayed the commercial Linux vendors from getting key services up and running when customers needed them. In the future, we should not make such sweeping changes wuthout making certain there is still a method to support the old interfaces as well.

Alan Cox replied, "Thats the difference between dying of legacy support and good efficient code. And yes it is a difficult tightrope to walk. Im suprised it broke netware so badly. The free netware client was updated very easily and speedily." He added:

The dentries stuff was probably the hardest change to deal with though. Its not a simple proceedure to quickly fix up code without understanding the problem, which reduces the number of users who fix it. (and yes I do get patches from people that start 'I don't know C, but xyz wouldnt compile in the new 2.1.x release so I copied the change from this similar looking thing and now it works'.)

I don't claim we have the model right, but the rules are definitely different

Jeff agreed, saying, "I think all things being equal, linux today is **GREAT** technology. Yes, it did break Caldera's Netware stuff (It's still busted, I have to do a lot of work on 2.0.35 then move it to a 2.2.9 box to test). I also think alot of the work being done with the Page Cache is totally **KICK ASS**. I'm just saying we should try to assess the impact on the commercial Linux vendors, and try to give them adequate warning, but progress also has a price and I agree it's difficult tightrope to walk."

There followed a good bit of discussion, not about FENRIS, but about problems with Linux development kernels, leading one irate person to complain that Linux was doomed. Eventually, Jeff said:

I want to make it clear to everyone that I don't think Linux is doomed, however, this email thread is using my trademark (FENRIS) on the header of the email to "bitch" about every busted build in the last 5 years. If you folks want to complain to Linus or Alan about stuff getting busted, it's your right. If you want to use my Trademark name on the email header, think again. I don't want this email thread associated with our trademarks since we do not share your views to the same extreme as the folks in this email thread.

Please stop using my trademarks for your "bitch" list about Linux. Nothing in the world is perfect, including Linux, but it's getting better, and folks are working hard to improve it. If you want to simply title the email, "Why Linux is Doomed", then you certainly can do so -- without attempting to associate us with your personal views.

 

8. Linux For The Blind
22 Jun 1999 (5 posts) Archive Link: "Speech output for Linux: speakup-0.07 released"
Topics: Braille, Disability Support
People: Kirk Reiser

Kirk Reiser announced version 0.07 of speakup, a GPLed tool to translate all console output into audible speech. He gave a URL to some incomplete web pages (http://www.braille.uwo.ca/speakup) , and to the FTP site (ftp://ftp.braille.uwo.ca/pub/linux/speakup) .

Someone asked if this was not a user space issue (speakup is distributed as a set of kernel patches), and Kirk replied, "Actually, no. The over all goal of speakup is to make the entire machine available to the visually impaired person from boot to shut-off. That is its long term goal anyway. This is a question which gets asked all the time. Usually by people who don't really understand the reasons for wanting/needing comparable access for all! 'grin'"

 

9. Linux Moves On After Invasive Recoding
24 Jun 1999 - 25 Jun 1999 (10 posts) Archive Link: "[patch] fix for the `access beyond end of device' bug of 2.3.[789]"
People: Ingo MolnarLinus TorvaldsAndrea Arcangeli

Andrea Arcangeli posted a one-liner against 2.3.8, causing much jubilation. Ingo Molnar replied, "yess, this was it :) I have re-checked this place a hundred times yesterday but missed the bug ;)" and Linus Torvalds said, "Whee.. That was indeed a silly bug, and one that was hard as hell to find. Good work, Andrea!" He added, "I'll make a 2.3.9 with this (and some other cleanups and fixes), and then I can finally start accepting other patches again (assuming it seems to be stable)."

 

 

 

 

 

 

We Hope You Enjoy Kernel Traffic
 

Kernel Traffic is grateful to be developed on a computer donated by Professor Greg Benson and Professor Allan Cruse in the Department of Computer Science at the University of San Francisco. This is the same department that invented FlashMob Computing. Kernel Traffic is hosted by the generous folks at kernel.org. All pages on this site are copyright their original authors, and distributed under the terms of the GNU General Public License, version 2.0.