Kernel Traffic #2 For 21 Jan 1999

By Zack Brown

Table Of Contents

Introduction

Well, the first issue of Kernel Traffic seems to be a success. Thanks for all the letters and submissions.

Mailing List Stats For This Week

We looked at 1445 posts in 5525K.

There were 545 different contributors. 247 posted more than once. 226 posted last week too.

The top posters of the week were:

 

1. Partitions Under Linux
14 Jan 1999 - 16 Jan 1999 (11 posts) Archive Link: "Re: (un)corrupted ext2 partitions"
Topics: BSD: OpenBSD, FS
People: Theodore Y. Ts'oRiley WilliamsAndries BrouwerMarc EspieArvind Sankar

Arvind Sankar posted a question that developed into a very interesting discussion. Basically he was creatively repartitioning his drive using extended partitions and noticed that linux seemed to care which partition the root directory was in. This didn't make sense to him, and he posted to the list.

At first there was some interesting technical explanation of what was going on, and Theodore Y. Ts'o made this informative statement: "One of the reasons why I don't like extended partitions is that they work essentially using a linked list scheme, where the first block in each extended partition is a partition table used to describe the rest of the partition. It's a very fragile data structure, and it's very easy to get corrupted."

If you don't know what a linked list is, it's essentially what he describes: each extended partition is an element of a list, and maintains a data structure pointing to the next partition (the next element of the list). Mess up one of the data structures, and you lose the data that would let you access the others. Most of the time, linked lists are a great and wonderful programming resource, but with something as fragile as a hard disk, you don't want pockets of data on which everything else depends.

Riley Williams then pointed out, "An alternative way to handle extended partitions, which both MSDOS and Win9x recognise, although neither creates, is to use multiple entries in the base partition table for up to three valid partitions (four if they fill the entire extended partition) before allocating another extended partition to deal with the rest of them." Andries then added, "Also OS/2 and Linux recognise this. Linux fdisk and cfdisk will not create this; sfdisk will, if you ask for it." Could someone explain the origin of these methods of defining extended partitions?

Then the discussion turned toward how this situation could be improved under linux, by developing a new partitioning scheme that would not suffer from such fragility.

The objection was quickly made that using an already existing system such as BSD disklabel would be better than making up yet another partitioning system. But it was also pointed out that BSD's disklabel had a limit of 8 logical partitions, and that any alternative partitioning scheme would be incompatible with DOS/Windows, a step we may not be ready to take just yet.

It then came up that Marc Espie has already got a patch into the 2.2.0 pre series supporting BSD disklabel, but apparently there is some work left to do, and for now he has this recommendation: "If you want to play safe, all decent OSes documentations recommend to use their own fdisk-like program to edit the area of the disk they're concerned about: use linux's fdisk for linux partitions; use OpenBSD's fdisk for OpenBSD partitions."

Apparently everyone agrees that trying to write a single fdisk that correctly handles all different partition types would be a very difficult project.

 

2. Ensuring Unique Inodes In Microsoft's FAT Filesystem
16 Jan 1999 - 18 Jan 1999 (7 posts) Archive Link: "[RFC] What should we do with FAT inode numbers?"
Topics: FS: FAT, FS: NFS, FS: VFAT, FS: smbfs, Legacy Support, Microsoft
People: Horst von BrandH. Peter AnvinColin PlumbLinus TorvaldsHal DustonAlexander ViroHelge Hafting

SMP

A very interesting discussion of a very difficult issue.

First of all, FAT stands for File Allocation Table. It is Microsoft's way of keeping track of files under DOS (VFAT is their extension of this for Windows9X, allowing long filenames). An inode is a data structure containing information about a file. Each file on a filesystem is supposed to have a unique inode number. According to the Free Online Dictionary of Computing (http://wombat.doc.ic.ac.uk/foldoc/index.html) , an inode on a unix system contains "the device where the inode resides, locking information, mode and type of file, the number of links to the file, the owner's user and group ids, the number of bytes in the file, access and modification times, the time the inode itself was last modified and the addresses of the file's blocks on disk."

Obviously if two files share the same inode number, the security of each file's data is in jeopardy (to put it mildly).

(ed. [] --Actually, Hal Duston emailed us a correction (and much thanks go to him for it): it is not that each file's data would be in jeopardy, but the two files would have to be identical, and any change in one would be reflected in the other as well. As he put it, "think hard links.")

The problem raised in linux-kernel is that in Microsoft's FAT filesystem there is no way to ensure a unique inode for each file. FAT filesystems use directory-entry locations as the inode number, the assumption being that each file has a unique entry in a directory, thus making that entry suitable for inode information. The flaw in this design, as pointed out by Alexander Viro, is that an unlinked-but-opened inode allows the inode number to be reused while it is still in use. This could happen if you rename an open file, for instance.

To safeguard the integrity of FAT filesystems, linux must compensate for this behavior by using an intelligent method of assigning new inode numbers so no duplications occur. This formed the subject of the thread.

Horst von Brand made this suggestion: "How about using the number of the first cluster in the file as the inode number? As far as I understand, on FATish filesystems that is the way you get at the real file data, so it can't change under you, whatever else is going on."

This apparently would completely solve the problem, except for the fact that, as H. Peter Anvin said, "it doesn't work for zero-length files (no clusters allocated.)"

Furthermore, Colin Plumb added in a different part of the thread, "There just *is* no good fix to the FAT inode number problem for zero-length files," although Helge Hafting came up with the bizarre solution of allocating a single cluster even for zero-length files ( "A waste, but not a big one," he added).

Linus Torvalds responded with his own possible partial solution, adding that "the inode number doesn't have to be unique, [as long as] under normal circumstances you can never see any inode numbers that are the same."

It was pointed out that a server crash would be a case in which the direction Linus was leading wouldn't work, and Linus had the interesting reply that he didn't care if it broke in that circumstance. He added, by way of justification, "at that point it's a conceptual issue. We have other filesystems that will act the same (smbfs comes to mind), where the inode number simply isn't 'unique enough' to be used as a NFS server re-population mechanism. And they may break if/when a dentry is dropped for some other reason even without a crash."

Another way of putting it might be, if you have to sacrifice something to the volcano god, it might as well be something that's already in the volcano.

Who thought Linus' use of "dentry" above was a typo? Nope, it's a kernel construct. Jeremy Impson's Kernel Glossary (http://source.syr.edu/~jdimpson/linux/glossary/) says dentries allow "faster filename lookups, better SMP performance, and greater flexibility in programming and in writing new filesystem drivers." Unfortunately it doesn't say what a dentry actually is.

Cross Referencing Linux (http://lxr.linux.no/) (see left sidebar for other source tours) reveals that a dentry is a struct defined in /usr/src/linux/include/linux/dcache.h. The struct contains data associated with a directory, such as inode information, parent directory, and mount point. Unless I'm mistaken, dentries are simply the data structure linux uses to keep track of directories. Presumably, what Linus meant by a "dropped" dentry was (according to the comment for d_drop in dcache.h), the invalidation of a directory entry by removing it from the hash table, so VFS (Virtual File System) lookups wouldn't find it.

Would someone in the know care to confirm and elaborate on this?

 

3. Tracking Kernel Patches And Configurations
14 Jan 1999 - 16 Jan 1999 (52 posts) Archive Link: "[PATCH] show patch names"
Topics: Configuration, Version Control
People: Ingo MolnarOliver XymoronLinus Torvalds

This turned into one of the biggest discussions of the week.

Oliver Xymoron wrote a very short patch that helps people keep track of what patches are included on the kernels they compile. It got a little initial support, but is not likely to make it into the kernel. Ingo Molnar summed up his objections: "i think this a step in the wrong direction. i dont think it's good in the long run to support forked versions. forking is hard conceptually, and we should not pretend it's easy, developers should have a maximum incentive to merge code ..."

Oliver defended with, "every feature that Linus didn't put in himself is a fork, if only for a short time. Some things remain forked, in this weak sense, for much longer, and some of them get widespread use" and "The purpose here is not 'support forks' - they're already a fact of life. The idea is to support support, and make it easy to recognize when these patches are in play."

The discussion itself quickly forked to a related issue: /proc/config, which would allow users to see how their kernel was configured, by catting a file in the /proc/ directory. This might be useful if you're using a kernel you compiled long ago, and you find that while it works, you can't seem to compile any new kernels that work as well. If only you could remember how you configured that working kernel, you could configure the new kernel the same way. But without /proc/config (or a backup of your old config file) you can't do it.

These types of issues (making config available and maintaining patch lists) are related, and are always coming and going on the kernel mailing list. People write a patch to accomplish one or the other of the issues, it receives some support, and then Linus Torvalds says he won't do it because it bloats the kernel and people should keep track of their patches and backup their config files on their own.

I would guess Linus' general sense of these types of issues is that they are frivolous, and a mere convenience for people who lack the skill to keep track of their systems. Of course, none of that helps the people who lack those skills...

 

4. Future Press Release For 2.2
14 Jan 1999 - 15 Jan 1999 (33 posts) Archive Link: "draft for review - press release"
Topics: History, Licencing, Project Publicity
People: Linus Torvalds

Greg Smart did some work on a press release for the not-yet-released stable version 2.2 and posted it on linux-kernel for comments. This turned into one of the biggest threads of the week, with many people going over technical details and suggesting corrections and rewordings. Unfortunately by Kernel Traffic press time the discussion seems to have degenerated and risks becoming a flame war, and the document itself seems to be trying to imitate some of the worst practices of the commercial software industry.

In spite of this, some interesting things were said, and the press release itself is interesting for the open source nature of its composition. Like linux, there really is no identifiable author. It would be exciting to see a larger writing project, perhaps a manual or even a novel, developed with that model. The difficulty, of course, would be that one can see if a piece of software works by running it, while significant portions of a document must actually be read in order to get a decent impression of it.

A nice tidbit that came out of the first days of the discussion is that the earliest versions of the linux kernel were under a different licence. Instead of the GPL they were released with a "no commercial use" prohibition. Things would have been mighty different today if Linus Torvalds had stayed with that licence.

 

5. Ancient Bug Found And Hammered
15 Jan 1999 - 16 Jan 1999 (8 posts) Archive Link: "Is there something wrong here?"
Topics: Development Philosophy
People: Simon KirbyLinus Torvalds

A short but exciting thread...

Simon Kirby wrote, "There has always been something that has felt wrong for me with regards to how Linux blocks processes when something is being written to disk...I'm going to attempt to describe it with an example I just saw of it on our mail server."

Linus Torvalds responded, saying that Simon had uncovered a bug which "must have been there since day one. I'm almost afraid of looking at linux-0.01 because I suspect it will be there too."

He then included a patch which he felt might fix the problem, "but you'd better be aware," he added, "that the above function is the 'heart' of the Linux buffer management, and changing it is not to be done lightly."

So if you have noticed your machine being unresponsive while writing out data, you can look forward to that slowdown being gone in 2.2.0

No embarrassment, no frantic attempts to contain the situation and prevent press leaks, no assertions that the machine is supposed to lock while writing. Merely a straightforward recognition of a really tough bug, and the opening out of that recognition to an unlimited number of people who might help. The 'heart' of open source.

Would someone care to explain the technical issues involved in this bug, why it was so hard to track; and/or find out which version introduced it?

 

6. Dangers Of Hotplugging Mouse Or Keyboard
14 Jan 1999 - 15 Jan 1999 (15 posts) Archive Link: "Re: ISSUE: psaux does not load in 2.2.0pre6"
Topics: Hot-Plugging, Modules
People: Charles CazabonDavid Lang

Someone had a problem because the psaux mouse code was merged with the kbd code, and so can no longer be compiled as a module.

An interesting tidbit came out of the discussion. Charles Cazabon said, "on a lot of hardware, inserting or removing a PS/2 mouse or keyboard while powered-on can fry the keyboard controller chip on the motherboard. It will work 99 times out of 100, but every so often the hardware will blow up. I've seen over a dozen machines fried this way over the years."

David Lang added, "when I was working hardware repair a few years ago the shop I was in had 5-10 machines a month (within the first year warranty) that would show up with the complaint 'keyboard/mouse will not work'. The usual fix was to replace the fuse for the keyboard (soldered onto the MLB). hot plugging your keyboard/mouse can cost you!"

 

7. Scheduling Discussion
14 Jan 1999 - 19 Jan 1999 (24 posts) Archive Link: "[uPATCH] SMP scheduling fix (?)"
Topics: Networking, Process Scheduling, SMP, Security
People: Rik van RielNeil ConwayRobert HyattPavel MachekIngo Molnar

Microkernels

SMP stands for "Symmetric MultiProcessing", and deals with computers that have more than one CPU.

Rik van Riel started this thread with a patch submission aimed at addressing this problem: "I (and other people? ) have noted that, on SMP systems, niced tasks get more CPU than they deserve. In fact, you often see that 2 non-niced tasks have to share a CPU while 2 niced tasks get the other CPU..."

Neil Conway pointed out (and Rik replied that he had noticed this right after posting the patch), "This patch is flawed. If we use it, it'll reintroduce the crappy interactive response when interactive jobs compete with non-niced CPU-hogs," a problem Neil himself (as he pointed out) had recently fixed.

Robert Hyatt entered the discussion with a feature request: "I'd _really_ like to have a nice value that says 'don't run unless you are twiddling your thumbs'. Ie a nice 20 perhaps, that says I don't want this to run unless there is _nothing_ else to schedule. But I'd also like to be able to pick a nice value that means 'something'. IE on the old LTSS system running on Crays out at Lawrence Livermore, we used a fractional priority system. If you ran at 1.000, and I ran at 2.000, I simply got two times as much cpu as you did. (I got charged twice as much, too). The nice thing was that I could figure out what priority to use to kind of 'order' how things would finish."

Pavel Machek objected that this was impossible under linux because it creates a security hole that would allow a denial-of-service (DoS) attack. A DoS attack is when an attack on your system prevents you from doing something, but doesn't actually read, write, or delete anything of yours. An example of this (which came up in a different thread this week) might be an attack that renders your ppp connection useless, so you have to disconnect and reconnect if you want to regain internet access.

The DoS attack Pavel identified was based on the fact that Robert's nice-value of 20 would cause the program to wait forever if any other program was running at the same time. As Pavel put it, "make your nice 20 task hold some lock, and then run 'normal' task so that nice 20 task will never ever get cpu again. BOom, you are holding lock and are not going to release it."

Meanwhile, Rik apparently has an old patch (2.1.132) that implements the 'don't run unless you're twiddling your thumbs' nice value, but says it can trigger a "rare race condition" (maybe the same DoS pavel noticed), so he's not going to work on it until 2.2 stabilizes.

Massimiliano Ghilardi (Max) also has an old patch (2.0.36) that does the same thing, and may port it into current kernels.

At this point, Neil Conway chimed in with an apparently simple point about how much CPU time a process gets depending on its nice value, but it turned into a big noise when he found that the actual situation (at least with 2.0.35) didn't seem to jibe with what the source claimed to do.

This sparked a one-day bug-hunting expedition. Ingo Molnar felt that the problem was merely with the source documentation, and issued a patch to correct a comment in the code. But whether the code should behave the way it does is still in question. There is also the question of whether the problem even matters, since it is only a distinction between whether a niced process gets 5% or 9% or the CPU.

Meanwhile Robert, responding to Neil's discovery, clarified his original feature request. What he wanted, as he put it, was:

  1. A way to 'scale' cpu activity so that I can say these two processes should run at a ratio of 1:2. Or whatever.
  2. A way to say "do not run this process unless there is _nothing_ left to run" (ie an idle-only task).
  3. A way to say "run this process, period, except when it is blocked." IE a process with more-or-less 'infinite' priority.
  4. A way to say "give this process 1/4 of the cpu, at least, no matter *what else* is running.

No one replied to this directly, and the latest additions to the thread continue to debate Pavel's DoS attack. Apparently not everyone is convinced that it really does render Robert's original request "impossible", as Pavel initially suggested.

 

8. How To Lock Up 2.2.0pre7
15 Jan 1999 - 17 Jan 1999 (13 posts) Archive Link: "A simple way to lock up 2.2.0-pre7"
Topics: Security
People: Mike GalbraithAlan CoxLinus Torvalds

Cezary Sliwa had the first post on a thread that ended up having a wild finish.

According to him, giving the command "while true; do (cat /dev/tty > /dev/null &); done" at the console of 2.2.0pre7 can lock the machine if you try to ctrl-c out of it. At first, some people had trouble reproducing the lock-up, but Mike Galbraith (who turned out to be the hero of this thread) managed to reproduce it and posted some call trace information.

Linus Torvalds objected that the call trace looked as though it must have come from a heavily patched kernel, and was inclined to disregard the whole problem on that basis--especially since he couldn't reproduce it himself.

Mike tried again with a clean kernel, reproduced the lock-up and provided a fresh call trace. Meanwhile there were some other reports from people able to reproduce the bug.

Then Mike reported on some creative hacking that (he hoped) would help isolate the problem. He included a patch, with this comment: "I took a wild guess (intuitive reasoning:) and did the following to read_chan in n_tty.c. I don't _even_ think that this is correct mind you, but it might serve to illuminate the little beastie. With this in, I can no longer induce the error, and the tty still works."

Alan Cox and Linus immediately agreed that the patch was a dead-on fix.

Mike replied, "Hmm.. musta flinched, I was only aiming to wing the critter."

And rode off into the sunset. Yee haw!

 

 

 

 

 

 

We Hope You Enjoy Kernel Traffic
 

Kernel Traffic is hosted by the generous folks at Tux.Org. All pages on this site are copyright their original authors, and distributed under the terms of the GNU General Public License, version 2.0.