Kernel Traffic
Home | News | RSS Feeds | Mailing Lists | Authors Info | Mirrors | Stalled Traffic

Kernel Traffic #64 For 24�Apr�2000

By Zack Brown

Table Of Contents

Mailing List Stats For This Week

We looked at 1435 posts in 6025K.

There were 415 different contributors. 188 posted more than once. 153 posted last week too.

The top posters of the week were:

1. Linus On devfs

4�Apr�2000�-�19�Apr�2000 (253 posts) Archive Link: "Suggested dual human/binary interface for proc/devfs"

Topics: FS: autofs, FS: devfs, FS: ramfs, Hot-Plugging

People: Linus Torvalds,�Theodore Y. Ts'o,�Horst von Brand,�Alexander Viro

This was a very meaty thread, in which Linus Torvalds finally explained his thinking behind including 'devfs' in the kernel, and where he expects it to go in the future. In the course of a very interesting discussion, he started off by explaining his view about /proc files containing binary data, which are easier to parse, as opposed to ASCII strings, which are more human-readable:

I want the numberic crap to GO AWAY. It's stupid, it's unmaintainable, and I do _not_ want to have the same old "device number" problems in new guises.

A hierarchical name-space with true names is the obviously correct way to name kernel parameters. And doing that any other way than exporting it as a filesystem is stupid and wrong.

Guys, remember what made UNIX successful, and Plan-9 seductive? The "everything is a file" notion is a powerful notion, and should NOT be dismissed because of some petty issues with people being too lazy to parse a full name.

The same is true of ASCII contents. Binary files for configuration data are BAD. This is true for kernel interfaces the same way it is true of interfaces outside the kernel.

I tell you, you don't want the mess of having things like the Windows registry - we want to have dot-files that are ASCII, and readable with a regular editor, that you can do grep's on, and that can be manipulated easily with perl.

Think of /etc/password. And think of the STUPIDITIES that a lot of UNIX vendors made with their user managment databases - it happened not once, but multiple times. All in the name of unified tools (never mind the fact that none of the standard tools worked any more), and in the name of efficiency (the "parsing ASCII wastes CPU cycles").

Do people think that .bashrc would be better in a binary format that uses special tools to edit it? I don't think so. Don't make the kernel interface files fall into that classic _stupid_ black hole.

Plain-text ASCII is a goodness. Readable naming is a goodness. Yes, it takes more care, but the end result is simply _better_.

Don't expand on the failures of a numeric sysctl(). Let it die in peace. Trust me, in the big picture everybody will end up happier.

A number of people replied to this and there were several long and interesting threads; but later, Linus posted some sample code for a very simple RAM-based filesystem. He explained, "let me append the ramfs thing I wrote as a dynamic ramdisk, and note how it leaves everything in the VFS caches. This is how any virtual filesystem should end up looking some day (Al Viro is looking at making /proc use this approach too - get rid of the "proc_directory_entry" thing, and just use the VFS layer dentry instead)." He went on, " One of the arguments against /proc as opposed to sysctl() has been complexity and size: this thing is _small_. It uses the VFS logic (which has to be there anyway if you have any filesystem at all) to do all the maintenance. Add number parsing and output, and you have a sysctl() without all the sysctl()-specific crud."

Finally, he added, "Btw, richard, this approach would lend itself quite well to devfs too," and this determined the course of future discussion. From this point on, the subthread focused on 'devfs'.

Richard had a lot of questions about Linus' suggestion. In particular, he didn't understand how (for example) a 'chmod' operation would persist across reboots, unless 'devfs' explicitly added the complexity Linus was trying to remove, and saved the information somewhere. But Linus explained that this didn't matter. He said that even something like 'tar' was good enough to save those permissions at shutdown and restore them at bootup; and no other special infrastructure was really needed aside from the normal VFS. Richard couldn't get his head around this at first, and argued that a number of people felt the 'tar' approach to be an ugly kludge. Ted was one of those who felt that way, and made a lengthy case about how devfs wanted to have it both ways; i.e. to make use of the virtual filesystem, and also have control over the underlying disk. As he put it, "This is really all about moving the dirt around. You can sweep it under the carpet, but the dirt is still going to be there. The different solutions simply sweep it under different corners of the rug, and with the lump in some cases better distributed than others. :-)"

Horst von Brand was also repelled by the 'tar' approach, and said, "Would you advocate using a filesystem for your home where you have to drag permissions and ownership for files out of a backup each time you log in?" And he added:

Keep devices in inodes on disk. To do persistence right, you'll have to keep them in inodes on disk anyway... so this whole kludge can go. Solves the different devices visible with different permissions in different chroot(2)s too, without any extra effort.

You know, there were flamewars about this exact point around devfs for _years_... and no magic solution came forth, not even an idea for a workable solution.

But Linus replied:

How many times a week do you reboot?

And how many scripts do you already run on reboot?

In short, what's wrong with running one more script on reboot and shutdown, if it means that the filesystem magically becomes much less involved.

Permissions can (and obviously in my opinion, should) be handled by a user-space agent. The kernel uses some default permissions in the absense of any other knowledge, but this is not all that dis-similar from what /proc does in the end.

You're arguing for removing devfs, so you _like_ to make it sound like there is no solution. I'm telling you that you're wrong.

Then later, he revealed his thinking more fully, saying, "I don't actually seriously think that using "tar" is a good idea." He went on:

What I envision is more like "autofs", in fact. I really like autofs, and it has nothing to do with the fact that Peter works in the room next-door. It has everything to do with facts like being able to lookup the information over NIS etc - which is exactly the kind of feature that I think in the long run would be extremely cool for /devfs. And it's still reasonably small, because the kernel doesn't make any real decisions.

Think of the administration advantages of something like _that_. And that's why I think devfs is eventually really cool - not because of what it does, but because of the _potential_.

The main reason for me to include devfs into the kernel was to try to move the bickering about it from pure bickering to a more productive level. Still bickering, but now, because it's integrated, I hope that not only discussion but changes will be more open to people who might not otherwise have cared.

Alexander replied bitterly, "Arrgh. In that you've definitely succeeded..."

Elsewhere, Linus went on:

I'm definitely envisioning something like autofs, where the decisions on naming, permissions, meaning etc can be made in user space.

At the same time it's definitely more than autofs, due to the interactions with hot-plug devices, and the kernel having a way of telling the deamon when a new device gets added. So if you just extended autofs to handle special devices too, it wouldn't be very interesting yet.

At least I'm thinking of it as a "autofs"+"device manager" kind of thing.

2. ioctl() Bug Hunt And Fix

10�Apr�2000�-�12�Apr�2000 (8 posts) Archive Link: "Bug with FIONREAD for TCP?"

Topics: Ioctls

People: Richard Gooch,�David S. Miller,�Alexey Kuznetsov

Richard Gooch reported that the FIONREAD ioctl(), which normally returns the number of bytes that are immediately available to be read on a file descriptor, had been yielding 1 byte on freshly closed descriptors, for the past few months. He described the situation:

The server creates a socket, binds to a port and listens and accepts the first connection. It does a select(2) on the FD, and upon an input event, calls ioctl(2) with FIONREAD. This yields 1 byte readable. Any attempt to read that byte fails, as it's not really there.

The client connects, waits a second, then calls close(2). It doesn't matter if the client writes some data (which the server reads), the server still gets 1 byte readable when the connection closes. This really doesn't seem right, since there is no data to read. Besides, how else is the server to know that the "input" event is in fact a closure (I can't use poll(2))?

David S. Miller felt this was probably a bug, and asked how other systems behaved. Richard replied, "I've been using this technique for nearly 10 years, and FIONREAD has always yielded 0 bytes readable on close for ConvexOS, SunOS 4.x, Solaris 2.x, IRIX 5.x and 6.x, AIX 3.x, HP-UX ?.?, UNICOS ?.?, Ultrix, DEC OSF/1 (aka Digital Unix aka Tru64 Unix aka tomorrow's new name). Even Linux worked this way, until sometime in the last 4 months, while I was away (2.3.x series of course)." Alexey Kuznetsov posted a short patch and asked for comments, and Richard and David both agreed it was a dead-on fix.

3. Cornering A Slowdown

10�Apr�2000�-�13�Apr�2000 (38 posts) Archive Link: "Asynch I/O gets easily overloaded on 2.2.15 and 2.3.99"

Topics: Big O Notation, Ioctls

People: Stephen C. Tweedie,�Andrea Arcangeli,�Andi Kleen,�Jeff V. Merkey,�Dimitris Michailidis,�Andre Hedrick,�Bert Hubert

Jeff V. Merkey noticed that increasing the LRU of his Netware Filesystem to over 3000 buffers, would slow linux down by a huge amount. This didn't seem to jibe with what he knew of LRU's. As "Unix Internals: The New Frontiers" by Uresh Vahalia, puts it on page 285, "When the buffer is not locked, it is kept on a free list. The free list is maintained in least recently used (LRU) order. Whenever the kernel needs a free buffer, it chooses the buffer that has not been accessed for the longest time. This rule is based on the fact that typical system usage demonstrates a strong locality of reference: recently accessed data is more likely to be accessed before data that has not been accessed in a long time. When a buffer is accessed and then released, it is placed at the end of the free list (it is at this point most recently used). As time passes, it advances towards the list head. If at any time it is accessed again, it returns to the end of the list. When it reaches the head of the list, it is the least recently used buffer and will be allocated to any process that needs a free buffer." Normally, adding more buffers to the cache would tend to speed things up rather than slow them down, and Jeff felt he'd tracked this problem to the kernel itself, rather than his NWFS code. Andi Kleen replied that he felt the problem was probably the elevator code from ll_rw_blk. The elevator algorithm would sort I/O requests so as minimize the number of scans over the hard drive. Jeff said he'd do some benchmarks and report on them, but Stephen C. Tweedie replied, "The elevator doesn't sort buffers, it only sorts requests --- and there is a strict limit on how many requests we can have outstanding at once. I wouldn't have thought that elevator scalability would have had anything to do with performance going downhill above 3000 buffers." But he agreed that only profiling could tell for sure. Andrea Arcangeli also felt that Andi must be wrong about the elevator code being the problem. Using formal theta-notation to refer to the relative worst-case running time of various algorithms (using 'O' as the ascii representation of theta, where O(1) represents a better running time than O(n), which is better than O(n^2) etc. -- see the end of this article), saying that the

new elevator merges requests in O(1) if they are requesting contigous sectors. 2.2.x algorithm is O(N) also for merging requests of contiougs I/O instead.

If they are seeking all over the place then, yes we have an O(n) complexity there but the queue is limited. It's hard limit and you are hitting such limit _each_ time you write to disk some mbyte of data. So if you have not hangs while kupdate runs, then the elevator isn't going to be the culprit.

But Andi asked, "I was more thinking about lots of wakeups in get_request_wait causing the elevator to do much work (kupdate is single threaded so there are only a few wakeups). With lots of threads calling ll_rw_block in parallel it may look differently." Andrea replied (semi-) historically, "With the previous elevator code in 2.3.5x if there wasn't available requests you are right, the revalidation was quite expensive. However with the 2.3.99-prex I fixed that," and went on to say that the only remaining slowdown in the code was still very fast and shouldn't account for Jeff's problem.

There was a bit more technical discussion, and then Jeff posted the results of his benchmarks, almost exactly 24 hours after the thread began. He said:

I tries runs of 500 buffers, 1000 buffers, 2000 buffers, 3000 buffers, and 4000 buffers.

And the winners are!

  1. ll_rw_blk (and add_request/make_request) (oink, oink..... oink, oink ... rooting around down in the hardware -- I think it's looking for truffles)

    and a close second:

  2. ide_do_request
    ide_delay_50ms (huge!!!)

There were a lot of replies to this. Dimitris Michailidis shared his experience regarding item 1 of Jeff's report:

I suspect that add_request/make_request are not the real culprits here. My experience with heavy disk I/O tests is that the bottleneck is usually __get_request_wait(), but that executes with interrupts off so profiling charges the callers instead. Here's an excerpt from a call graph profile (kernel is 99-pre3):

0.87 1.19 20662/20662 generic_make_request [22]
[51] 0.5 0.87 1.19 20662 __get_request_wait [51]
1.15 0.04 200799/379645 schedule [47]

As you can see processes sleep/wake_up a lot in __get_request_wait and generate more than half of all the scheduling activity. This is despite the wake_one, and actually having all processes wake up simultaneously doesn't make things all that worse (increases calls to schedule() by about 20% in my case). The real bottleneck under disk I/O load is the single request array, IMO.

Jeff was very interested in this, and said he'd post new numbers as soon as he had them. Elsewhere, Andre Hedrick replied to item 2 of Jeff's benchmarks, particularly the huge ide_delay_50ms time, saying, "You are stuck! It is a hardware issue in clocking..... I have begun adjusting to a schedule to easy the loop load and it appears harmless and stable."

The rest of the discussion focused on an oops Jeff had reported with his benchmarks, which was subsequently fixed. No actual solution surfaced for Jeff's original slowdown.

There were two descriptions of theta-notation on the list this week, both occurring in a completely unrelated thread. Bert Hubert described it as follows:

O(whatever) is a notation by which you can describe how the running time of your algorithm depends on the amount of input. Finding books in a randomly ordered library is O(N) - if you have twice as many books to search, it will take twice as long.

In a perfectly ordered library, the time it takes is described by O(1) - you will always find your book in the same time. Perhaps O(sqrt(N)) would be more appropriate, because if there are more books, you may need to walk further before you are at the right place. The sqrt is because your library is two dimensional.

Naive sorting is O(N^2) - which can be good or bad. With large N, it is most surely bad. But you should know that the *actual* running time is described by something that looks more like t= offset + alpha*O(N^2) - if offset or alpha are large, the O(N^2) algorithm might be faster for small N.

Do yourself a favor and get the Knuth trilogy. Not meant to be read cover to cover, but very useful none the less.

Andrea himself described it like this:

Note it's not 0(1) but O(1). In this context it means that right now when you want a page, you get it at a fixed cost that doesn't change in function of how many pages are available and how many of them are of the kind we are interested about.

Take these two symbols:


N is a variable number. Assume N to be infinite. O(N) means that an algorithm needs to check all the N objects to give you a result, and in our case it will take infinite time. If the algorithm would had O(1) complexity it would instead take the usual fixed time despite of how many N objects exists in the system.

4. Difficulties Merging Drivers For 2.3.99

11�Apr�2000�-�13�Apr�2000 (18 posts) Archive Link: "What to do with 3c59x.c?"

Topics: Hot-Plugging, Networking, PCI

People: Andrew Morton,�Jeff Garzik,�Linus Torvalds,�David Ford

Andrew Morton announced/asked:

I have pretty much completed the 2.3.99 merge of drivers/net/3c59x.c and drivers/net/pcmcia/3c575_cb.c.

This driver now supports

3c59x series
3c90x series
3c980 series
3c555 series
3c575 series (Cardbus).

That's EISA, PCI and Cardbus all in the one driver. Fairly cleanly.

I wish to get a patch out for some testing. (As it claims to support 27 different NICs it's gonna need some...)

My question is: how should this driver be integrated into the kernel tree?

Any suggestions?

After some discussion, he posted again, saying:

aargh. I'm screwed.

I'll do the cut-n-paste from Rules.make into pcmcia/Makefile for the while.

Really, mkdep should be taught to recur. This probably explains why a good old 'make clean' sometimes makes wierd things stop happening.

To Andrew's statement that the source should be compiled twice, Jeff Garzik replied, "What makes you think this? The source should not need to be compiled twice. The PCI driver interface is designed to seamlessly support hotplug and non-hotplug." Andrew replied, "One example: 3c59x.o could have been compiled for static (non-modular) use. Copying that onto 3c575cb.o and insmodding it wouldn't work very well :)" At this point, Linus Torvalds came in with:

Andrew, that's _exactly_ what I do with "tulip.o" - I just compile it into the kernel.

And it works wonderfully well with pcmcia.

PCMCIA does _not_ require modules. In fact, it would be a bug to load the _cb module when the non-cb one is already compiled into the kernel. There's absolutely no reason to have two object files, but if you want to maintain compatibility with the old pcmcia scripts you could have a symlink (/lib/modules/xxx/pcmcia/3c575cb.o -> /lib/modules/xxx/net/3c59x.o)

However, David Ford reported that compiling tulip directly into the kernel would not boot if he had a tulip PCMCIA card installed. Linus asked for more information, but the thread ended there.

5. A New Way To Clean Up /proc

11�Apr�2000�-�17�Apr�2000 (42 posts) Archive Link: "An alternative way of populating /proc"

Topics: Disks: IDE, FS: procfs, Virtual Memory

People: Matt Aubury,�Linus Torvalds

Matt Aubury proposed a way of restructuring '/proc/index.html' to make it useful again. He described:

The recent debate about the multitude of possible formats for data in /proc caused me to think about a short-hand way of populating a /proc directory hierarchy. This scheme uses a format string to describe the hierarchical data layout, so:

create_proc_entries(NULL, "test:{bar:{x:%d,y:%d,z:%d},foo:%f}", &x, &y, &z, foo_fun);

creates a "/proc/test/index.html" directory, which further contains a subdirectory "bar" and a file "foo". The "bar" subdirectory contains three files "x", "y" and "z".

The formatting argument "%d" takes a pointer to an integer. When reading such a file (in this case "x", "y", or "z"), the value is shown as ascii. Writing to the file (again in ascii) updates the value. The "%f" formatting argument allows you to pass an arbitrary user function for generating output. Clearly, there are potentially quite a number of standard/useful formatting arguments.

I've done a quick, dirty, unfinished implementation of this idea, so people can get the picture. Attached.

Many people will hate this because (1) it's doing parsing within the kernel, (2) it tends to favour ascii I/O, (3) it tends to favour deep directory hierarchies,(4) it uses recursion :-)

On the other hand, (1) it's very lightweight (lsmod shows size=744 including demo code), (2) it makes creating a lot of /proc entries stupidly simple, (3) it might reduce code duplication.

Everyone including Linus Torvalds was very impressed with this idea. There was a fair bit of discussion, which Matt summarized:


The "create_proc_entries" suggestion has caused quite a bit of interest, so I'm hoping to make a proper patch this weekend. A bunch of issues have been raised (sorry to address them all in one place):

The remaining issues that I see are:

Poking around my 2.2.12 /proc I see the following in the root owned parts:

187 -rw-r--r--
103 -r--r--r--
12� -r-------- (kcore, kmsg, IDE stuff -- don't know why)
11� -rw------- (IDE, some firewall stuff, some VM stuff)
52� dr-xr-xr-x

So it seems that the vast majority of cases can be boiled down to just whether or not root is allowed to write.

As for data types, integer I/O (decimal and hex) seems to be very common. String output is also frequent. We could potentially also do range checked integers, string input, arrays... How useful would these be? We'll still have the ability to pass function pointers for output, and I'll be adding support for generic input too.

As I say, I'll try to put something better together over the weekend. I'd be very interested in any more thoughts people have.

There was more discussion, and Matt posted a new patch, and said:

As promised I've created an "proper patch" version of the create_proc_entries API. Hopefully I've addressed a few of the issues raised, without making the interface (or code size) too bloated. According to my numbers (which I'm not sure I believe) it adds only 502 bytes to the size of vmlinux.

The new API is described in the DocBook'd code comments in the patch; scroll down for details. (Although I've had trouble actually getting the DocBook documentation to build properly -- I'd welcome any tips).

Once the API has settled down I'll submit it for inclusion: I think it would be good if it made it into 2.4.0, but if you don't agree please voice your objections.

The DocBook code comments in the patch explained:

This function is used to create a hierarchy of files and directories in procfs.

The format for a file is "name:contents", where name is the filename, and contents is zero or more format codes (described below). The contents may be prefixed with an asterix ("*") to specify that the file is to be created root writable. Otherwise file permissions are readable by all, writable by none.

The format for a directory is "name/{contents}", where name is the directory name, and contents are zero or more files or directories, separated by commas. Directory permissions are always readable and executable by all.

The valid format codes are:

%d - Decimal integer, input and output. Supply an (int *) as the argument.

%x - Hexadecimal integer, input and output. Supply an (int *) as the argument.

%s - Zero-terminated string, output only. Supply a (char *) as the argument.

%v - "Virtual" entry, supply a (struct proc_dir_entry **) pointer as the argument and this will be set the the entry that is created. This allows you to either tweek the settings once the entry is created. For instance, you may use this to arbitrarily set the read_proc and write_proc functions.

The function returns zero on success, and a negative error code on an error. Check the source for the meaning of the error code.

Example use is:

int beta;
char gamma[] = "Gamma";
struct proc_dir_entry *entry;
create_proc_entries(NULL, "alpha/{beta:%d,gamma:%s%v}", &beta, gamma, &entry, NULL);
entry->uid = 500;

6. SSL Accelerator Cards

11�Apr�2000�-�16�Apr�2000 (7 posts) Archive Link: "SSL Accelerator cards"

Topics: Networking, PCI

People: Michael T. Babcock,�Dan Kegel

Michael T. Babcock asked if anyone was working on support for crypto-accelerator cards such as the Compaq AXL 200 SSL card. He added, "I realise that true support for such beasts would need to be done within the software (such as Apache), but cryptographic module hooks in the kernel (as NT now does) would probably be useful for plugging drivers into." There was some small discussion, and Dan Kegel posted a bunch of relevant URLs. Later, he added that after more research, he'd put up a dedicated SSL web page. He also described:

I list a bunch of products and several reviews on that page. Here are the vendors that make PCI cards that support Linux:

Phobos makes a transparant SSL proxy thingy for $1900 that might do nicely, and wouldn't require any software changes. It can live in a Linux box, but it just gets power and configuration from the PCI bus; all data goes via ethernet only, not the PCI bus.

nCipher makes a fast card that supposedly supports both Solaris and Linux. It probably goes for a little more than the Phobos card, and can handle 75 or so SSL connections / second.

IBM makes a well-supported crypto coprocessor, but it's based on a 486, and you just have to wonder :-)

7. User Mode Port In The Main Tree

11�Apr�2000�-�13�Apr�2000 (6 posts) Archive Link: "user-mode port 0.19-2.3.99-pre4"

Topics: User-Mode Linux

People: Jeff Dike,�Alan Cox,�Pavel Machek

Jeff Dike announced the latest release of the user-mode Linux port and gave URLs to the SourceForge page and the actual files. Pavel Machek really liked it, and asked if Jeff would be submitting it to Linus for inclusion in the main tree. Jeff replied, "My plan is to submit it to Linus when he opens up 2.5. I'm also thinking about seeing about getting it into an early 2.4. If those IBM people can get the S390 port into 2.2, I think I ought to be allowed to get um into 2.4..." Alan Cox replied, "S/390 is a target for early 2.4 or 2.4pre too. I'd like to see UML in. Its security value for isolating virtual machines is fascinating." Jeff did a little jig, and asked:

Are there guidelines anywhere about how best to submit a largish body of code so it doesn't get dinged for non-functional reasons? One thing that I know about is changing my 2-char indents to 8-char ones. Are there others?

Would some 2.4 releases be better to target than others (i.e. should I try to go for a 2.4pre in preference to 2.4.low)?

I'm currently cleaning things up in preparation for 2.4/2.5, so the sooner I know what needs doing, the sooner I can get going on them.

Alan replied, "Format it up nicely so it follows the coding style. Read over it and look at all the things that make you think uggh. Then seperate out the changes to the mainstream code and look very hard at them and see if you can sanely avoid them or if they are pointers to generic code that needs fixing or are just plain sensible to add. Then send me a set of diffs." And that was it.

8. Benchmarks Comparing Linux And NT

13�Apr�2000 (6 posts) Archive Link: "Performance data..."

People: Rik van Riel,�Dan Kegel,�Ashok Raj

Ashok Raj asked for benchmarks comparing Linux and NT for I/O and networking; and some folks gave pointers to the Mindcraft test. Rik van Riel put in:

After the (carefully tuned) Mindcraft test, which painfully pointed out some weak points in early 2.2 kernels, Linux has been improved quite a bit.

There is, however, a somewhat more balanced patch (stresses the whole system, not just a few points) done by the German magazine C'T. You should be able to find it on their site:

Dan Kegel also gave a pointer to his page containing all the benchmarks he was aware of on the web.

9. More 'devfs' Discussion

14�Apr�2000�-�17�Apr�2000 (9 posts) Archive Link: "question on your MOUNT_REWRITE changes."

Topics: FS: autofs, FS: devfs, FS: procfs, Feature Freeze

People: Alexander Viro,�Jamie Lokier,�David Parsons

In the course of discussion, Alexander Viro mentioned, "lookup_dentry() is going out - walk_name() is the replacement." Jamie Lokier raised an eyebrow, and asked sardonically, "Dentries have completely changed their meaning during the second feature freeze this year? How is it you're able to thoroughly mangle this stuff and I can't get a simple DT_DIR patch looked at?" Alexander first peacefully pointed out that the new behaviour was not all that different from the old. But with great venom he added that for the dentry code changes "you can be grateful to devfs. I would be _glad_ to postpone these changes. But now they became pretty much mandatory - thanks to the fs with sufficient set of methods and need to do multiple mounts." He also posted this lengthy technical explanation (quoted in full):

the main problem with multiple mounts is the following:

  1. We should never have more than one dentry for a writable directory. Print it and hang it on the wall. It's a fundamental requirement. There is no way to work around it in our VFS. I tried to invent a scheme that would allow that for more than a year. And I've done most of namespace-related code in our VFS since the moment when Bill Hawes stopped working on it, so I suspect that right now I have the best working knowledge of that stuff. There is no fscking way to survive multiple dentries for writable directory without major lossage. Period.
  2. Consequently, we should not have several dentry trees for the same filesystem.
  3. Consequently, if we want to have the filesystem mounted in several places we have to share the dentry tree. Including the root dentry.
  4. Consequently, ->d_covers and ->d_mounts are BAD ideas. We have to separate that information (mount linkage) from the struct dentry.

Notice that unified tree consists of chunks that come from individual filesystems. And linkage between them (what is mounted where) is already kept in a different way than the linkage between dentries within the chunk. Chunks themselves form a tree. Each node in that tree corresponds to one mountpoint and thus to the directory tree of filesystem mounted there. 1--4 means that we have to separate that 'tree of chunks' from dentry trees and make the nodes in that tree _refer_ to dentry trees. Moreover, we must permit to have several chunks refering to the same dentry tree - that's precisely what we get for multiple mounts.

Let's explore what changes it would require. First of all, dentry becomes insufficient for walking through the unified tree. _If_ we want to do such a walk we also need to know which chunk we are talking about. HOWEVER, we rarely need such walking. Most of the kernel couldn't care less for the chunk we are in - if you want rmdir() you don't care about the mounting, you just want the bloody dentry and that's enough. Even more so for read/write/ stat/lstat/readlink/almost everything. Almost all operations are local to thei individual filesystem and don't care where and how many times it's mounted. So we can keep using dentries almost everywhere we used to.

Now, let's see what _will_ change. First of all, we should carry the information about the chunk we are in through the lookup. Just as we carry the pointer to dentry we are in. Not a big deal. We should be able to keep track of crossing the chunk boundaries, but we have to do it anyway. However, we should know which chunk we are in when we start walking. IOW, we need to

  1. know the chunk where the cwd is.
  2. know the chunk where the root is. Easy enough - we need to extend fs_struct a bit and take care to set both "dentry" and "chunk" components upon chdir()/chroot()/fchdir(). Not a big deal for the first two, but the third requires to store the chunk in struct file of opened directory. IOW,
  3. in addition to f_dentry we need to store the chunk. Fine. There is not that much places where we open files (see below). We also need to
  4. know which chunk we are in when we follow the link. Trivial, since we keep track of it in the sole caller of ->follow_link() anyway. That's it for namespace walking. Another problem is that we need to know the chunk if we ever want to know the full pathname of object. It adds to the list above
  5. chunk where the swap component sits. That's it. The rest is covered in (1)--(3).

Now, (ignoring the stuff with the places where we open files) we need to choose the structure that would represent nodes in the "tree of chunks". Fortunately, we already had such a structure - struct vfsmount. It was a natural candidate for the per-mountpoint stuff, just as struct super_block is for mountpoint-independent data. That required moving the quota options into struct super_block (obviously). With that done we got the material for chunks tree.

What do we need to know about every chunk? Well, obviously we need dentry of mountpoint, root dentry of mounted fs and parent chunk. That allows for trivial crossing the mounpoints, erm, rootwards. For crossing them in other direction (into the mounted fs) we need a bit more - several mountpoints _may_ (normally will not, but we have to account for that) have the same dentry (in differnet chunks, indeed). So we have a _set_ of chunks over the dentry of mountpoint. They all have different parent chunks, thus crossing the mountpoint turns into

find a chunk that would
belong to set over current dentry and
had the parent equal to current chunk.

Data structure for that is a separate story and final choice will take profiling for different uses, but one of the trivial (and effecient in normal cases) variants is the cyclic list of vfsmounts anchored in dentry. In absence of the case when two mountpoints have same dentry (in different chunks) it's as efficient as the old scheme was.

As for the files opening, the problem was in the binfmt drivers, mostly due to the fact that do_execve() left opening to the ->load_binary(). Which was a BAD idea, since it lead to code duplication in all of them. Fixed by shifting the opening into exec.c and passing struct file instead of struct dentry.

That's mostly it as far as design counts - everything else was the matter of coding, choosing decent interfaces, etc. and is the matter of putting it into the tree in small steps. Large part is already there and I'ld rather postpone the description of all gory details until all this stuff will go. Infrastructure is already there (almost all - the last piece is sent to Linus and it deals with the <expletives> /usr/gnuemul/{solaris,etc.} handling). Once it will be in I'll post the description of new interfaces.

Main part of pending patches consists of almost complete rewrite of fs/super.c, so changes in that area are _not_ a good idea right now. Filesystems are already there - in that part all changes are already done, except the changes in autofs - it is intimately tied to the mount-related stuff. I have this stuff done, so it's not going to be a problem.

Resulting design gives a lot of interesting opportunities, e.g. it allows to store all metadata in the dentry tree and don't bother with 'backplane' trees a-la current procfs. Other obvious things include loopback mounts (add a new vfsmount and we are done) and dealing with filesystems not visible to any user (create a vfsmount visible only to the kernel and you are done).

Folks, could you please wait until the interface of lookup will settle down? I promise to give complete description of the interfaces and data structures once the thing will be there. Right now it's in transit. I hope that mess above gives some idea of where we are moving to - it definitely contains all crucial ideas.

The only reply came from David Parsons, who said, "Why should I comment? For once you've actually bothered to put actual technical commentary in your email instead of the traditional unsupported whining about Richard's coding style. This is *good* and you should keep doing it."

10. Things To Do Before 2.4: Saga Continues

16�Apr�2000�-�17�Apr�2000 (8 posts) Archive Link: "Linux 2.3.99pre6-3 job list"

Topics: Compression, Disk Arrays: RAID, Disks: IDE, Disks: SCSI, FS: Coda, FS: FAT, FS: NFS, FS: NTFS, FS: UMSDOS, I2O, Networking, PCI, Power Management: ACPI, SMP, Security, USB, Virtual Memory, VisWS

People: Alan Cox,�Alexander Viro

Alan Cox posted his latest job list:

  1. Fixed
    1. Tulip hang on rmmod (fixed in .51 ?)
    2. Incredibly slow loopback tcp bug (believed fixed about 2.3.48)
    3. COMX series WAN now merged
    4. VM needs rebalancing or we have a bad leak
    5. SHM works chroot
    6. SHM back compatibility
    7. Intel i960 problems with I2O
    8. Symbol clashes and other mess from _three_ copies of zlib!
    9. PCI buffer overruns
    10. Shared memory changes change the API breaking applications (eg gimp)
    11. Finish softnet driver port over and cleanups
    12. via rhine oopses under load ?
    13. SCSI generic driver crashes controllers (need to pass PCI_DIR_UNKNOWN..)
    14. UMSDOS fixups resync
    15. Make NTFS sort of work
    16. Any user can crash FAT fs code with ftruncate
    17. AFFS fixups
  2. In Progress
    1. Merge the network fixes (DaveM)
    2. Merge 2.2.15 changes (Alan)
    3. Get RAID 0.90 in (Ingo)
    4. Finish I2O merge
    5. Still some SHM bug reports

  3. Fix Exists But Isnt Merged
    1. Signals leak kernel memory (security)
    2. msync fails on NFS
    3. Lan Media WAN update for 2.3
    4. Semaphore races
    5. Semaphore memory leak
    6. Exploitable leak in file locking
    7. Merge the RIO driver (probably do post 2.4.0 as it is large) (in AC tree)
    8. S/390 Merge (merged in AC tree)
    9. 1.07 AMI MegaRAID
    10. Mark SGI VisWS obsolete
    11. 64bit lockf support
    12. UMSDOS was broken by the fs changes
    13. Get the Emu10K merged
    14. TTY and N_HDLC layer called poll_wait twice per fd and corrupt memory
    15. ATM layer calls poll_wait twice per fd and corrupts memory
    16. Random calls poll_wait twice per fd and corrupts memory
    17. PCI sound calls poll_wait twice per fd and corrupts memory
    18. sbus audio calls poll_wait twice per fd and corrupts memory

  4. To Do
    1. Restore O_SYNC functionality
    2. Fix eth= command line
    3. Trace numerous random crashes in the inode cache
    4. Fix Space.c duplicate string/write to constants
    5. VM kswapd has some serious problems
    6. vmalloc(GFP_DMA) is needed for DMA drivers
    7. put_user appears to be broken for i386 machines
    8. Fix module remove race bug (mostly done - Al Viro)
    9. Test other file systems on write
    10. Directory race fix for UFS
    11. Security holes in execve()
    12. Audit all char and block drivers to ensure they are safe with the 2.3 locking - a lot of them are not especially on the open() path.
    13. Stick lock_kernel() calls around driver with issues to hard to fix nicely for 2.4 itself
    14. IDE fails on some VIA boards (eg the i-opener)
    15. PCMCIA/Cardbus hangs, IRQ problems, Keyboard/mouse problem (may be fixed ?)
    16. Use PCI DMA by default in IDE is unsafe (must not do so on via VPx x<3)
    17. Use PCI DMA 'lost interrupt' problem with some hw [which ?]
    18. Crashes on boot on some Compaqs ?
    19. pci_set_master forces a 64 latency on low latency setting devices. Some boards require all cards have latency <= 32
    20. usbfs hangs on mount sometimes
    21. Loopback fs hangs
    22. Problems with ip autoconfig according to Zaitcev
    23. SMP affinity code creates multiple dirs with the same name
    24. TLB flush should use highest priority
    25. Set SMP affinity mask to actual cpu online mask (needed for some boards)
    26. pci_socket crash on unload
    27. Quota mount options are now broken

  5. To Do But Non Showstopper
    1. Make syncppp use new ppp code
    2. Finish 64bit vfs merges (lockf64 and friends missing)
    3. NCR5380 isnt smp safe
    4. DMFE is not SMP safe
    5. ACPI hangs on boot for some systems
    6. Go through as 2.4pre kicks in and figure what we should mark obsolete for the final 2.4
    7. Per Process rtsigio limit
    8. Fix SPX socket code
    9. Boot hangs on a range of Dell docking stations (Latitude)
    10. HFS is still broken
    11. iget abuse in knfsd
    12. Paride seems to need fixes for the block changes yet
    13. Some people report 2.3.x serial problems
    14. AIC7xxx doesnt work non PCI ?
    15. USB hangs on APM suspend on some machines
    16. PCMCIA crashes on unloading pci_socket
    17. DEFXX driver appears broken
    18. ISAPnP IRQ handling failing on SB1000 + resource handling bug
    19. TB Multisound driver hasnt been updated for new isa I/O totally.

  6. Compatibility Errors
    1. exec() returns wrong codes on a file not found

  7. Probably Post 2.4
    1. per super block write_super needs an async flag
    2. addres_space needs a VM pressure/flush callback
    3. per file_op rw_kiovec
    4. enhanced disk statistics

  8. Drivers In 2.2 not 2.4
  9. To Check
    1. Truncate races (Debian apt shows it nicely) [done ? - all but Coda]
    2. Elevator and block handling queue change errors are all sorted
    3. Check O_APPEND atomicity bug fixing is complete
    4. Make sure all drivers return 1 from their __setup functions
    5. Protection on isize (sct) [Al Viro mostly done]
    6. Mikulas claims we need to fix the getblk/mark_buffer_uptodate thing for 2.3.x as well
    7. Network block device seems broken by block device changes
    8. Fbcon races
    9. Fix all remaining PCI code to use new resources and enable_Device
    10. VFS?VM - mmap/write deadlock (demo code seems to show lock is there)
    11. rw sempahores on page faults (mmap_sem)
    12. kiobuf seperate lock functions/bounce/page_address fixes
    13. Fix routing by fwmark
    14. Some FB drivers check the A000 area and find it busy then bomb out
    15. rw semaphores on inodes to fix read/truncate races ? [Probably fixed]
    16. Not all device drivers are safe now the write inode lock isnt taken on write
    17. File locking needs checking for races
    18. Multiwrite IDE breaks on a disk error
    19. AFFS doesn't work on current page cache
    20. ACPI/APM suspend issue
    21. NFS bugs are fixed
    22. BusLogic crashes when you cant /proc/scsi/BusLogic/0
    23. Floppy last block cache flush error
    24. NFS causes dup kmem_create on reload
    25. Quota exceeded can cause bogus files on disk (-1 bytes long)

Alexander Viro replied to 1.14 (UMSDOS fixups resync), "Sorry. Fixed in incoming patches, but -pre6-3 check_pseudo_root() is b0rken." To 4.10 (Directory race fix for UFS), Alexander replied, "What? It's fixed _long_ ago." To 6.1 (exec() returns wrong codes on a file not found), Alexander reported that problem fixed as well. To 4.27 (Quota mount options are now broken), he asked for details, and Alan replied that that was also an out of date bug. There were a few other replies to Alan's list, but no discussion.

Sharon And Joy

Kernel Traffic is grateful to be developed on a computer donated by Professor Greg Benson and Professor Allan Cruse in the Department of Computer Science at the University of San Francisco. This is the same department that invented FlashMob Computing. Kernel Traffic is hosted by the generous folks at All pages on this site are copyright their original authors, and distributed under the terms of the GNU General Public License version 2.0.