Kernel Traffic
Latest | Archives | People | Topics
Latest | Archives | People | Topics
Latest | Archives | People | Topics
Home | News | RSS Feeds | Mailing Lists | Authors Info | Mirrors | Stalled Traffic

Kernel Traffic #7 For 24 Feb 1999

By Zack Brown

Table Of Contents

Mailing List Stats For This Week

We looked at 1120 posts in 4378K.

There were 457 different contributors. 205 posted more than once. 154 posted last week too.

The top posters of the week were:

1. More Spam From

18 Feb 1999 (7 posts) Archive Link: "I found a misspelled word on your website..."

Topics: Mailing List Administration, Sound: ALSA, Spam

People: Chris JohnsonAlain ToussaintShaw TerwilligerMatt BusiginHenrik Olsen


A post under the name of Chris Johnson (and with no return address) hit linux-kernel this week:

I happened upon your web site and noticed that "receive" is misspelled (you have it as "recieve"). The URL of the page is:

Sorry for making such a big deal of a simple misspelling; my therapist says I have an obsessive-compulsive personality<g>. But these web pages seem to live forever and this could be embarrassing.

When I find something like this on my website, I use a free utility that helps me correct typos and make quick edits right in my browser. It's called JustEdit and you can get a free copy from

Alain Toussaint replied, "wonder if it's not a troll,i received the exact same e-mail on the berlin list,as indicated in the second paragraph,i suspect he's both obsessive and the developpers of that particular apps (want to get some pub),i think there's better way to advertise a software."

Shaw Terwilliger added, "This guy is out to spam every mailing list in existance to sell his (shareware?) software. He hit the ALSA lists this morning, with the same "misspelling" but with a different mechanically generated URL. It seems he's harvesting list archives and sending messages to the list addresses."

Matt Busigin said, "Its an obvious bot. It probably recursively searches webpages and finds fairly close spelling errors, and most likely with some sort of percentage chance, finds an email address on the page, and mails a randomly picked message like the above."

Henrik Olsen analyzed the headers and said:

Another * originated spam, currently it seems like 60% of the spam I see is posted from these guys.

I've never gotten any response from them either, not even when I was trying to track down who it was who had cracked one of our servers. I managed to get the connecting ip by a combination luck in being 5 minutes from a connected machine when named died letting him in and good monitoring software paging me because of that death, but that didn't help since noone answered my mail.

My personal reaction with uunet has been to mail them, then block mail from that subdomain once they don't answer.

2. The Bubble Bursts For BIOS Backward Compatibility

20 Feb 1999 - 23 Feb 1999 (11 posts) Archive Link: "New new enhanced memory detection patch; testers wanted"

Topics: Backward Compatibility

People: David ParsonsH. Peter Anvin

Would someone in the know care to comment on this in more detail?

david parsons wrote in, looking for testers. He said, "The memory detect scheme I wrote for Linux 1.2.13 and 2.0.x has become obsolete. Since I wrote it, many many bioses have stopped supporting int 15h, 0e801h for extended memory detection and instead gone over to int 15h, 0e820h. So, I've tweaked my memory detect patch to use 0e820 as well as 0e801 and 088." He added that the patch was only good for 2.0.x, but he'd produce one for 2.2.x on demand.

A stunned H. Peter Anvin replied, "Which BIOSes would that be?? That seems quite odd -- you don't just discontinue something like that."

David answered, "Well, I know that the bios on the HP Omnibook 5700 doesn't support e801 (it says it does, but e801 only returns 10mb on my 128mb home machine) and that the bios on the FIC 601(V?) motherboard doesn't either." He went on to explain, "apparently e820 is the recommended Windows way of detecting memory, because you can map around memory holes with it. And since nobody in the computing world gives a damn about backwards compatability unless it's forced on them at gunpoint, e801 becomes simply an inconvenient entry point which you can either ignore or return garbage in."


3. A Sad Rejection

18 Feb 1999 - 20 Feb 1999 (6 posts) Archive Link: "[patch] required signals"

People: Albert D. CahalanAlan CoxLinus Torvalds

Albert D. Cahalan submitted a patch that got rejected. He said, "four Linux ports violate standards: i386, ppc, m68k, and arm. We must have SIGSYS to meet the Single UNIX specification v2. Those four ports are also the only ones that lack SIGEMT," and added, "this patch documents the non-standard signals and, when possible, defines non-standard signals in terms of standard signals rather than the reverse."

Alan Cox said, "this isnt a kernel issue at all, " and went on, "and it makes no odds - glibc doesnt even use the kernel header file for signal numbers."

Albert disagreed that it wasn't a kernel issue, saying, "the kernel has code that is signal-specific, so it must define the numbering. The kernel itself uses the numbers in many places. Disagreement over allocation would be terrible." He added that "apparently glibc follows the kernel numbering -- a wise choice. (perhaps it _is_ the kernel numbering, after a trip through perl)"

Linus Torvalds put in, "there is no disagreement: the kernel specifies the numbers, and user space lives with them," and added, "the numbers the kernel doesn't care about might as well be defined by the library."

Albert insisted, "unfortunately the kernel _does_ care about them. Every signal has a default behavior determined by the kernel, and that behavior matters." He finished with, "so you don't think the patch is important... but is is bad? Proper default behavior can not be defined without defining what each signal maps to."

End of thread.

4. An Attempted Fix

22 Feb 1999 - 24 Feb 1999 (5 posts) Archive Link: "[RFC] inode generation numbers."

Topics: FS: NFS

People: G. Allen Morris IIILinus Torvalds

G. Allen Morris III wanted to add a field to 'struct inode' because "NFS is in need of an inode generation number for correctness and possibly some added security," and asked if there would be any adverse effects. Linus Torvalds replied, "In the long run, no. But I'm not accepting test-patches with it, simply because it breaks module compatibility, so I'd like to have a really strong release candidate before I do that. And I'd prefer to try to avoid it for 2.2.x at all."

5. A Double Fix

18 Feb 1999 - 19 Feb 1999 (2 posts) Archive Link: "[Patch] Couple of races on MIPS."

People: Alexander Viro

This was a two post thread in which Alexander Viro posted a patch to fix some exploitable races. Ralf responded that he had already fixed the problem. End of thread.

6. fsync(); syslogd; Ext2 Extensions; Linus Chastized

12 Feb 1999 - 24 Feb 1999 (135 posts) Archive Link: "fsync on large files"

Topics: BSD, Disks: SCSI, FS: FAT32, FS: ext2, FS: ext3, Raw IO, Scheduler, Security

People: Alan CurryPeter SvenssonSimon KirbyRiley WilliamsOliver XymoronZygo BlaxellTheodore Y. Ts'oAndi KleenLinus TorvaldsAlan CoxAlexander ViroStephen C. TweedieChris Wedgwood

Intro: The Problem

This thread was discussed last week in the Linux Weekly News, and KT gave a brief quote from Linus Torvalds; This week the discussion continued, and so far takes up over 130 posts, although it looks like it's just about dead by now. It all started when Alan Curry figured out that "syslogd's fsyncs have been the cause of some major performance problems on an ISP's central server. Load averages have been unreasonably high and gradually getting worse. Today we realized that there is a weekly cycle to it, and it matches the cycle of the log rotation of /var/log/messages. As this log file grows (currently 36 megs with 2 days left before rotation) and beyond, with syslogd fsync'ing every line it writes, syslogd hangs for long periods of time, and when syslogd is hung, lots of other stuff hangs."

Part 1: Some Suggestions

There followed a five day debugging subthread. There were some suggestions that Alan Curry rotate the log files daily instead of weekly, but Peter Svensson pointed out, "I suppose this performance problem may show up in other places too where this simple workaround is not possible. Is there any non-negotiable deep reason why fsync on large fils must be slow? Once 2.2 has stabilazed I suppose performance tuning in the 2.2 and 2.3 trees will be interesting to many developers."

Simon Kirby said, "There's no way in hell I'd run syslogd with fsync() enabled on a machine that logs anything more than 1 line per minute. Disable fsync()ing in syslogd by prepending "-" to each log file you want to turn it off for." But Alan Curry replied, "I know that, and I've already done that as a temporary solution. But syslogd syncs every line written for a good reason, namely that if the machine crashes you don't want to lose the last few lines that were logged. They are the most likely place to look for suspicious happenings. So I want a proper fix, not a "disable the safety feature" kludge."

There were several replies to this. Riley Williams suggested, "it sounds like what's needed is a facility to sync just a particular file, combined with more frequent rotation of the logs..."

Oliver Xymoron suggested, "A _real_ fix, which also deals with log security issues nicely, is to log across a serial port to a secure (possibly non-networked) host."

Simon suggested, "I don't see disabling fsync() as a kludge, though. Perhaps you should try remote logging and disabling fsync() on the remote machine? If the logging server goes down for whatever reason, you can enable fsync() again if you want to see if you can get a better trace of it (it could just be spat to the console anyhow), but if any other server goes down it should be logged there."

Simon's post also grew a number of branches.

Part 2: The Original Poster's Solution And The End Of That Branch Of The Thread

Alan Curry replied to Simon's suggestion of remote logging while disabling fsync(), "Someone wiser than me puts a safety feature into a critical daemon, and I take it out because it's inconvenient? That's a kludge." He added, "here's the really weird part: we already have a secondary logging machine, which is by all measurements less powerful than the main machine, but it handles all the same log data, with the same syslogd configuration, with hardly any cpu load at all," and went on, "here's what I think the real problem is. On the main machine, there are lots of things trying to talk to syslogd. But syslogd can't accept their connections because it is contantly fsync'ing, so the other daemons block trying to connect." He continued, "Then when an fsync finishes, syslogd accepts another connection, and 40 processes suddenly become runnable, causing the scheduler to freak out and eat a bunch of cpu time that does not show up as used by any particular process. On the other machine, syslogd is the only process that ever runs, so there isn't a scheduling problem. The fsync'ing is not a killer by itself."

He finished with this bit of advice to himself, and that was the last we heard from him on the subject: "I think I'll just start trusting the log server to do its job, and leave the syncing turned off on the main box. Anyone who doesn't have a dedicated log server, I guess the lesson is rotate your logs before they get too big."

Zygo Blaxell responded to Alan Curry's "Someone wiser than me..." comment, with the following historical/technical explanation (quoted in full):

It's only a kludge if you assume that the people who wrote syslogd are in fact smarter than you. I have reason to disagree.

The people who wrote syslogd with that "safety" feature were university grad students who have consistently demonstrated a lack of understanding of real-world computer issues including performance, security, and administration over the last decade. Their coding was atrocious and they failed to do some basic research on the API's they were using in a few notorious cases.

In fact, I think the wiser people are the ones who implemented the feature to _disable_ the fsync() mode, which actually occurred after the source code had been bouncing around the Linux world for a few years.

I've found syslogd in fsync() mode next to useless for crash analysis. For me Linux systems crash three ways:

  1. They crash over a period of several hours, days, weeks, or months. I've had Linux systems that ran for six months after part of the networking subsystem died a horrible death.
  2. They crash by going into a kernel panic instantly, and sometimes crash before even that.
  3. The SCSI subsystem (or a device on it) fails. Really. This happens to me more often than power failures.

In case #1, there's plenty of time for the normal filesystem to sync.

In case #2, there's not enough time for syslogd to record the messages-- the kernel stops running any processes at all, so syslogd is irrelevant. The serial console doesn't have this problem. If you want to log crash messages, use that.

In case #3, sync() on syslogd only exacerbates the problem (that is, although it doesn't cause the problem in and of itself, it provides extra opportunities for the problem to occur). Quiz: What's the worst thing you can do if one of your SCSI disks has bad firmware and can't handle the load? Send it _more_ I/O requests.

The cost in I/O performance for the unlikely event that syslog without fsync() will lose a message that syslog with fsync() would _not_ lose (and there's a window of only a few seconds here) is incredibly high. There are cases where fsync() still won't help you: it's likely that anything that can _deliberately_ bring down your system can be turned into something that can destroy the log files too.

If you have really an entire disk spindle to spend on syslog messages, put that spindle into a separate dedicated log server (on a separate UPS) and store your log messages there.

Part 3: Ted Ts'o Submits A Patch

Meanwhile, Theodore Y. Ts'o responded to Alan Curry's original post with a patch he'd written, intended to deal with syslog calling fsync() all the time: "What the patch does is keep an array of the last four blocks which were modified since the last fsync(). If there have been no more than four blocks modified, then the ext2 filesystem can do a "fast fsync", which just flushes those four (or fewer) blocks to disk, without having to walk all of the indirect blocks looking for modified blocks. This should be extremely effective for programs like syslog which are doing frequent fsync()'s with minimal amounts of data written between calls to fsync()."

Andi Kleen also responded with, "this is a very important patch, because fsync() is needed in some transaction oriented databases (they have to call fsync() or fdatasync() on the log frequently to commit operations). For some of them 4 blocks could be not enough though, perhaps a more generic version of your patch that handles more than 4 blocks could be found - like keeping the dirty blocks per inode on a special list."

Linus said he'd prefer a patch that "just made the dirty lists a per-inode thing." Ted came up with an alternative that he felt fit Linus' idea: "Most files never have fsync() called upon them, so a way we can make things efficient is to only create the dirty list after the first call to fsync() on an inode." . Linus didn't like it, and said, "I'd much rather just always add it to the "inode dirty list"," adding, "It's just a few pointer operations, and the advantage of having a clean and simple design probably results in better performance anyway."

But Ted objected, "This works for ext2, but it doesn't work with filesystems (most notably FAT filesystems, and possibly some B-tree based filesystems) where metadata might be shared by multiple files, so a block might have to be on multiple inode dirty lists."

Alan Cox suggested, "As a percentage the number of 'shared blocks' is low. So just keep a single 'might be shared list' and write that list out too," and Alexander Viro put in, "That really starts to resemble *BSD buffer cache architecture - they just index buffer cache by inode + offset instead of device + offset, assign negative offsets to metadata (on normal filesystems) and for FAT they keep metadata assigned to the device itself."

Meanwhile Linus also replied to Ted, continuing his push for simplicity. At one point he said, "A filesystem that has lots of shared dirty blocks is going to have a slow and complex fsync(), but I think it's fairly simple to just consider the exclusive blocks a separate issue, and then just accept the fact that there are shared data structures that have to be maintained separately by the filesystem," and added, "just having an exclusive dirty list would make 99% of all fsync() issues just go away. The remaining 1% is not likely to be a real problem, imho, and is very obviously not something that the VFS layer can really help with anyway."

Part 4: Linus Puts His Foot Down And Get's Chastised

In response to this, Stephen C. Tweedie and Alexander Viro had a four-post, one-day staircase in which it came out that Stephen was extending ext2.

Linus came out with, "Stephen, I'v etold you before: I will not accept these kinds of extensions to ext2. Make a new filesystem, and if you want, re-use the code (and the layout) of ext2." , He went on, "There's not a chance in hell that I will ever release a kernel with these kinds of major fs modifications - call it "ext3" and after a year or so of in-production use we can drop ext2."

Alan Cox rebuked Linus, with, "Linus before sounding off why not have a look at the patches concerned. Its bad when you go around pronouncing on things before looking hard," and added, "And funnily enough the instructions for applying his patch start "copy fs/ext2 to fs/ext3 then...""

Linus shot back, "Good. Then don't go around calling it ext2 any more. I don't want to have people even _wondering_ about the stability of the central Linux filesystem." He went on, "I would also suggest that Stephen actually drop ext2 altogether. There's just too much historical stuff in most filesystems - things like having "." and ".." in directories, even though Linux doesn't need them and they only complicate renaming and loopback mounting a lot. There's also a lot of code to handle concurrent writes etc, which can't happen any more," and added, "This is why I'm so upset at even the notion of extending ext2 - not only do I dislike the fact that Stephen was going to do it in-place (and I'm happy to hear he no longer considers that), I think that if people are doing a new filesystem, it should be done like "ext2" was originally done: by designing a new one, rather than building more scaffolding on top of an old one."

Stephen came back with, "it _isn't_ called ext2. It is called ext3." He added, "there is an _urgent_ need for a journaled filesystem for Linux, and there is an urgent need for a solution which is of production quality. Implementing a new filesystem from scratch is hardly the way to achieve that." He continued, "please check what I'm actually trying to do. I am explicitly _not_ designing a new filesystem," and, "The ext3 code is nothing more than a set of calls to demarkate the start and end of complete, consistent filesystem operations."

Linus explained himself:

I certainly see your argument about "make something work quickly - base it on ext2". However, while _you_ may consider this a stop-gap measure, _I_ know that we may end up being red in the face over some bad design when it turns out three years later that people are still using the stop-gap measure, and the stop-gap measure was good enough that nobody ever bothered to design something nice.

FAT32 was a "stop-gap measure". Sure, it made sense to just extend FAT from a "we need to have something that can handle larger filesystems NOW" standpoint. But stop-gap measures have a way of staying around. Forever.

I would hope that Linux development is not guided by those kinds of constraints. If we have to wait an extra year for a journaling filesystem, I don't think it's a bad tradeoff to discuss the possibility of just starting from a clean slate.

Meanwhile, Alan Cox responded to Linus' "Good. Then don't go around calling it ext2 any more" post, in particular where he said, "I think that if people are doing a new filesystem, it should be done like "ext2" was originally done: by designing a new one, rather than building more scaffolding on top of an old one."

Alan Cox's reply was, "Linux has a lot of cruft in it maybe while doing journalling we should throw the OS away and rewrite that too ? - I hope Im misunderstanding your argument."

Linus responded (as quoted last week in KT):

No, you're not misunderstanding the argument. Eventually some hungry programmer will decide that Linux has too much cruft, and he'll want to take over the world, and he'll come up with a system called Davix or something. That's how these things go, and that's how things _should_ work.

The only thing I can aim at is to minimize the amount of cruft, and pushing out the inevitable as far into the future as humanly possible. That's what "maintenance" means, Alan.

Alan, the _only_ beef I ever have about you as a developer is that you're looking about a year into the future - not ten.

Start thinking ten years down the road, and think "how do I avoid the complexity issue"? Then, after you've given that some serious thouhgt, come back to me about this all.

Alan Cox replied:

I have. A lot of what is going on now is right. The problem is there is a Linus tendancy to go "No way ever" not "I can't see how we can do this cleanly". Linus Torvalds is not god, and pronouncing blindly from on high is a bad idea.

You could much more productively have said

"Stephen please call it ext3 because ....."

"I can't see a way to use all 4Gig of memory so for now I'm working on the basis its not likely to occur"

Free Software works a darn sight better when the response to something is not "No, go away" but "I don't believe you can do it, show me.."

Im sure the latter is what you really mean in mosty cases, but it doesn't come across that way.

Personally I'm pretty sure that in 5 years time Linux will support very large amounts of RAM, a lot of processors, direct DMA I/O on file systems concurrent writes to a file and journalling.

Linus Replied, "Almost certainly. But it's not going to be done badly."

Stephen counted to ten, and wrote:


Linus, this is one of the most frustrating things about trying to actually implement some of this stuff: you get so bogged down in the "don't do it badly" that you never get around to talking about "do it well".

When the flames about raw IO errupted, the _only_ thing I was able to get out of you was "raw device IO is evil". It seemed to be impossible to get you to turn the discussion around to the requirements for file IO, because every pronouncement you made about the whole subject ended up with "raw device IO is evil".

Again with journaling, it's a case of "don't you *dare* touch ext2, ext2 is inviolate" rather than "yes, go ahead but keep the existing code intact". See the difference?

Part 5: The Smoke Clears

Alan Cox replied to Linus on a more technical level, and Linus replied back, also on only technical terms.

Chris Wedgwood also had a technical reply to Linus regarding the '.' and '..' directories, which had come up previously. He said, "Surely ".." is of some value (perhaps not now, but maybe in the future) to e2fsck is the disk gets corrupted?" To which Linus replied, "Not really. You can use it as an extra sanity-check (and as far as I know fsck does), but that's true of just about any redundant information - that doesn't necessarily make it actually useful. It's less so if you journal your metadata anyway," and added, "So yes, you can use ".." to check that both the parent and the child agree about each other, but even if they don't there's not all that much you can do with the information."

Ted replied with:

Actually, there is something that can be done with the information; if part of the parent's directory blocks have been corrupted, the child directory will be disconnected from the directory tree. The '..' link will tell you where in the directory tree the child should be reconnected, although not what the name of that child should be.

This information shows up in the fsck transcript, and can be used as a hint by the system administrator about how to restore the disconnected directory from the lost+found directory to the correct place in the directory tree.

I've considered automating this to make it a little easier for system administrators to actually fix things. (For example: reconnecting the directory to the parent directory if possible, and putting a symbolic link in lost+found so the administrator knows where to find the child directory so he/she can rename it to the correct name.)

Even better would be to store the name of the child directory in the child's directory block; this additional amount of redundant information would allow for a completely automated recovery procedure in many cases, which would be a big win.

To which Alexander Viro objected, "Erm... rename() will become a living horror that way. Either you'll have to reshuffle the stuff in first block or you've got another block to write on rename(). I.e. one more way to screw up."

Oliver Xymoron also replied to Chris Wedgwood's "Surely '..' is of some value ... to e2fsck is the disk gets corrupted?" , with, "Yes, definitely. I've certainly found it useful in reconstructing damage by hand. Doesn't mean the kernel has to use it, it just has to maintain it. I can see how having the kernel not rely on them (already the case) is a win, but I can't see why saving the couple dozen bytes per directory and the overhead at mkdir time is a big deal. Redundancy, if done right, can increase robustness."

Linus responded:

Redundancy, if done wrong, can also screw you quite badly. So it cuts both ways.

For example, a directory structure that is actually fairly powerful is a special case of hardlinked directories: not something you want to allow most people to do, but what I have actually been asked about a few times is a way to have the same directory show up in multiple places. That implies that ".." is actually dependent not on the directory itself, but on how you got there.

The standard answer to this in unix is symlinks, but I bet I'm not the only person who has ever cursed about "ls subdir/.." being very different from "ls .". And there are actually filesystems out there that can do it, it's just that they cannot have ".." entries in their directories.

Loopback mounts do this right, but they tend to be higher overhead. Nobody does loopback mounts of /usr/X11 -> /usr/X11R6, people use symlinks and live with the confusion of ".." because they are used to it.

Now, as it is, the linux dentry cache would be confused by having potentially multiple aliases for the same path, but that's something that I consider to be a misfeature - but one that wasn't worth fixing considering that we didn't have any serious filesystems that could take advantage of it anyway (I think the only two filesystems we support right now that can do it at all are AFFS and iso9660, and for the latter I don't actuall yknow of anybody who does that kind of disks).

But being able to handle hardlinked directories would actually be really nice: if you avoid loops (which is easy to do), it goes from being a dangerous feature to a really nice one.

Basically, I used to think that hardlinked directories are evil, and horrible. What changed my opinion was that (a) I found a few cases where I really wanted to use them and (b) I noticed that most of the reason for hating them was because they were hard to do right rather than anything fundamentally wrong with the concept.

We can't do them right as is, but getting rid of ".." in the on-disk directory structure would be one step, and I think I can handle the dentry aliasing issue too.

Imagine, for example, a directory tree with a shared component. Wouldn't it be nice to just link it into the tree at multiple points? Imagine a chroot() environment, for a moment - symlinks don't work to the outside, but hardlinking does.

Maybe it's not worth it, but it _is_ an example of redundancy that just screws you.

Alexander Viro responded to Linus' "if you avoid loops (which is easy to do)" with:

mkdir a
mkdir b
mkdir a/c
ln a/c b/c
mkdir a/c/d
mv b a/c/d

If you have a way to deal with that... BTW, I strongly suspect that any solution involving scanning potential ancestors is *not* good - you can construct very unpleasant DoS that way.

And to Linus' "getting rid of '..' in the on-disk directory structure would be one step, and I think I can handle the dentry aliasing issue too," Alexander said, "Could you elaborate? I am trying to figure out the way to do that and for the case of multiple links *from the same directory* I have a kinda-sorta solution. For generic thing... I would really like to hear your variant."

To Alexander's hardlink loop, Linus replied, "Ayee. Good spotting. Nasty. I was wrong, it's not all that easy at all." At the end of the post, Linus concluded, "I'd still like to allow hard links too, but my mind isn't quite as twisted as yours is, judging by your nasty example ;)"

At this point the thread seems pretty much over. It's strange to see the top developers getting annoyed at each other, although it's not uncommon. More interesting is how quickly (and the way in which) the discussion returned to a peaceful state. But the problems raised may not go away so easily.







Sharon And Joy

Kernel Traffic is grateful to be developed on a computer donated by Professor Greg Benson and Professor Allan Cruse in the Department of Computer Science at the University of San Francisco. This is the same department that invented FlashMob Computing. Kernel Traffic is hosted by the generous folks at All pages on this site are copyright their original authors, and distributed under the terms of the GNU General Public License version 2.0.