Kernel Traffic
Latest | Archives | People | Topics
Latest | Archives | People | Topics
Latest | Archives | People | Topics
Home | News | RSS Feeds | Mailing Lists | Authors Info | Mirrors | Stalled Traffic

Kernel Traffic #80 For 14 Aug 2000

By Zack Brown

Table Of Contents


Thanks go to Frederic Stark, who noticed that the links into the mailing list archives were not working, and who also found a broken cross reference. Thanks, Fred!

There is also a new indexing feature this week, extending through all back issues of KT and all the Cousins. You can find it in the left nav bar and in the little '[*]' links in each issue's text. Please send me bug reports and feature ideas.

Mailing List Stats For This Week

We looked at 1263 posts in 5351K.

There were 441 different contributors. 195 posted more than once. 157 posted last week too.

The top posters of the week were:

1. ext3-0.0.2f Released; Consistency Checkers; New "Phase Tree" Algorithm

5 Jul 2000 - 2 Aug 2000 (56 posts) Archive Link: "ext3-0.0.2e released"

Topics: BSD: FreeBSD, Disk Arrays: LVM, FS: JFS, FS: NFS, FS: ext2, FS: ext3, Virtual Memory, Web Servers

People: Stephen C. TweedieTheodore Y. Ts'oManfred SpraulNick CabatoffAndi KleenAlan CoxDaniel PhillipsAndreas DilgerVictor YodaikenBill HueyAndrew Morton

Stephen C. Tweedie announced:

ext3-0.0.2e has been uploaded to

This release fixes a few problems seen on rare occasions, plus one much more serious crash-on-unmount. It includes patches for both 2.2.17pre9, and the current Red Hat 2.2.16-3 errata kernel.

It also includes "orphan-list" code based on an implementation by Andreas Dilger, for cleaning up inodes which have been unlinked but are still held open by a process --- such inodes need to be deleted properly on a crash.

The full list of changes is below.

I will now be starting to merge in a substantial amount of newer code, including the new error-handling infrastructure for ext3 and the metadata-only journaling. The plan is to keep this 0.0.2e release as a 0.1 stable branch for those relying on ext3 while the new code is being merged in.

Thanks to all who have helped with the testing of 0.0.2d so far.

He listed the changes in this release:

Port forward to current (2.2.17pre9, and Red Hat errata 2.2.16-3) kernels

Merge in a number of ext2 fixes from 2.2.15+:

Fix a number of buffer leaks in recovery (prevents set_blocksize errors on mounting filesystems)

sync(2) waits for current transactions correctly

Set the superblock s_dirt flag on all transaction completions

Fixed the order of asserts and buffer writes in fs/buffer.c: this was causing false assertion failures on Mylex raid controllers

Delete the filesystem commit timer on unmount in all cases

Include Andreas Dilger's implementation of the "orphan list" code:

The orphan list maintains an on-disk list of inodes needing cleaned up on recovery, including:

He replied to himself the next day with a patch and a warning:

This has now been superceded by ext3-0.0.2f (patch enclosed), which fixes a major bug --- the new truncate code in 0.0.2e did not propagate extensions of existing directories to disk (existing, sufficiently-padded directories would not be affected, but appending a lot of new dirents to an existing directory could leave the new dirents unreachable after a reboot). e2fsck should be able to restore the directories if this has caught anybody --- the contents of the directories was not lost, only the update of the on-disk copy of the directory's size was being missed.

I'll push a complete set of clean ext3-0.0.2f patches out to shortly, but in the mean time please apply the patch below if you are running 0.0.2e.

Andreas, I also found that your orphan list code was missing the case of a "rmdir" of a directory still being used as the working directory of an existing process. 0.0.2f should also clean up such a case on reboot.

Later he gave a link to the patch.

In the course of discussion, Theodore Y. Ts'o said:

Even with journaling filesystems, there will be cases you will need to run some kind of filesystem consistency checker.

  1. In case of disk drive problems.
  2. In case of memory problems (particularly cache memory)
  3. In case of kernel bugs (many times what people think of as "bugs" in filesystem code is really bugs in the VM or buffer cache parts of the kernel.)

This is true for all journaling filesystems; they aren't magic. What journaling filesystems do protect you against is the need to run fsck in case of an power failure or a kernel crash, or some other kind of unclean shutdown (so long as that unclean shutdown doesn't cause any other forms of on-disk corruption.)

Andrew Morton asked if this consistency checking could be done while the FS were online. Theodore replied:

Multics operating system was able to run its filesystem recovery tool while the filesystem was online. Then again, Multics was also designed so that if a circuit breaker snapped off and one of its three memory cabinets got uncleanly shutdown, only processes that had memory pages on the downed memory subsystem would get killed. The thinking was: just because you lost 1/3 of your memory and you have to kill off 22 user's processes, why should you have to ruin the the other 45 users's day? :-)

In practice, though, I'm not aware of any filesystem consistency checker since the days of Multics that could do this. It's possible, but you have to put all sorts of very careful interlocking between the checking code and the filesystem code, and this adds a *lot* of complexity. In the case of multics, the filesystem consistency checker was actually part of the kernel (it ran in Ring 0), and this tends to go against the general Unix and Linux design principles of keeping as much as possible in userspace.

Manfred Spraul replied:

Windows 95/98 scandisk 8-)

Their implementation well documented (they need 3 lock levels), but very slow (scandisk restarts from scratch if someone else writes to the disk)

The filesystem reduces metadata caching, and scandisk uses a special interface for atomic read-fsck-write cycles. (conditional writes, similar to the atomic instructions on most RISC cpus)

Nick Cabatoff also said, "There's one for UFS/FFS now on the way in FreeBSD 5.0: Kirk McKusick just released alpha code to do what he calls snapshots, which I'm told will enable background fscking, among other things. See (or the freebsd-arch archives) if you're curious." Theodore replied:

That's not a full filesystem consistency checker, though. He's running fsck on a consistent snapshot of the filesystem in order to detect orphaned blocks which can then be freed in the live filesystem. (The BSD soft update code can leak blocks from inodes which are open at the time of a system crash, which is why this is necessary.)

This technique can't be used to deal with arbitrary filesystem corruption, however. It only addresses a very specific case which can't be handled any other way given the BSD Soft Updates approach.

And Andi Kleen added, "You can do the same today on Linux with LVM snapshots. They are only useful for read-only consistency checking because they are read-only. (so in case of a problem you'll need to umount and rerun fsck on the normal device)" Alan Cox put in, "Also there is ext2 based work going on using phase tree rather than journalling which gives you similar journal properties, in future snapshots and also very nice handling of multipath error recovery."

Steve Whitehouse asked for an explanation/reference for "phase tree", and Daniel Phillips replied at length:

It's the algorithm used in my Tux2 filesystem that I've been working on since around Christmas, or longer than that if you count 10 years of thinking about doing it :-)

Phase tree is my name for an algorithm similar to one that has been used in WAFL and in a OS named Auragen that you can ask Victor Yodaiken about. I developed the algorithm independently and it was only on reading a posting from Victor on linux-kernel in from June, 1997, that I realized I wasn't the only one to have thought of it.

My phase tree algorithm is different enough from the other two that I think it's fair to call it a new algorithm, or at least a close cousin. I'm writing a white paper on it for presentation at the ALS this fall. An abstract is available now. I have a working prototype of Tux2 "with some issues" that I'm now busily porting form 2.2.12 to 2.4.0.test.

I have attached Victor's original email, which makes very good reading. You can get the Tux2 abstract by emailing me... (I'm very interested in finding out exactly who is interested.)

Here is a brief description:

Tux2 is based on Ext2. It is not a journalling filesystem, but it does what a journalling filesystem does: keep your files safe in the event of a processing interruption. It does that for both data and metadata and, according to my early benchmarks, should do it at about the same speed at which a JFS does metadata-only. We shall see.

Tux2 uses my "phase tree" algorithm (so christened by Alan Cox - I called it tree/phase but I like his name more). Phase tree imposes a partial ordering on disk writes to ensure that a filesystem on disk is always updated atomically, with a single write of the filesystem metaroot. To work properly, the entire filesystem including all metadata, must be structured as a tree. Ext2 is not structured as a tree, therefore, the major difference between Ext2 and Tux2 is that all metadata has been rearranged into a tree.

Once you have the filesystem in the form of a tree you can make a copy of the metaroot, then for all updates, apply a "branching" algorithm that works from the updated block towards the metaroot doing a copy-on-write at each node that needs updating. After some number of updates (the exact number is a performance-tuning parameter) you store the new metaroot on disk, which gives you an atomic update. So far this is similar to Auragen and WAFL.

Tux2's phase tree algorithm works almost entirely in cache and is intimately coupled to the buffer cache system. A third metatree is added, to allow filesystem updating to continue without pause while the second tree is commited to disk, eventually replacing the first metatree using the abovementioned atomic write.

In tree phase terminology, the three trees are called "phases". The three phases are:

Tux2 has its own update daemon that handles its "phase transitions". A phase transition is the act of commiting a new metaroot wherein the second phase tree becomes the first, the third becomes the second and a new metaroot is created. (This is a function analgous to kflushd, though kflushd in its current form can't possibly know what it would need to know to cause phase transistions at appropriate times, and in any event, it has no way to initiate one.)

That's basically it. There are some other wrinkles in Tux2 that serve to flatten the filesystem tree, reduce the number of block writes required and keep cpu usage to a reasonable level.

As Alan mentioned, there are many interesting things you can do when you have a filesystem's metadata in the form of a tree. Tux2 doesn't do most of those things at this point, since its main purpose in life is to demonstrate the efficacy of the phase tree algorithm and to allow me to do kernel development without putting my precious files at risk every time I need to reset the system.

Bill Huey said this looked a lot like journalling, and Daniel replied:

Yes, that's the point. It is supposed to do what a journaled FS does, i.e., keep your files safe and make fsck go away, but with less overhead. There are other advantages over journalling: there are *far* fewer boundary conditions to deal with. Basically, there is only one ordering constraint to worry about per phase: write one entire batch of updates before writing the next. Within a phase the order of writing is completely unconstrained, so an elevator algorithm is free to choose the shortest path across the disk surface. This decouples the filesystem from the lowlevel I/O in a very satisfying way.

Note also that most journalling filesystems do not attempt to preserve the integrity of data within files rigorously because of the associated overhead of writing every data block twice (roughly speaking). Tux2 does provide an integrity guarantee for *both* data and metadata.

To be fair, there are two things a journal can do that phase tree cannot: (1) roll forward and (2) preserve filesystem integrity right up to the last completed disk write. It's not really clear to me why (1) is useful. But (2) is important for something like a network transaction server that wants to report each transaction "complete and safe" absolutely as soon as possible. So an agressive transaction server would report completion as soon as the journal entry had been made, allowing the client application to stop waiting and go on about its business. Phase tree has to wait until the upcoming metaroot write completes before it can report any transaction complete; this will be any time from a few tens to a few thousands of disk operations later, depending on how the phase change heuristics are designed and configured.

This means that journalling is better than phase tree for transaction serving. In most other applications, IMHO, phase tree will offer higher throughput while still giving an acceptable transaction latency. I think that is good.

2. Linux CVS Archive

24 Jul 2000 - 1 Aug 2000 (14 posts) Archive Link: "[ANN] Linux Kernel Source Reference"

Topics: Version Control

People: Thomas GraichenIvan PassosRiley WilliamsGary Lawrence Murphy

Thomas Graichen gave a link to The Linux Kernel Source Reference and described, "it's basically a cvs tree with all linux versions starting from 1.0 until the latest one with a cvsweb www frontend and pserver remote functionality on top of it ... this way you can easily get or diff or whatever any ever released i(since 1.0 :-) version of the linux kernel source." He invited comments and criticism, and some folks mentioned Riley Williams' online index of all known kernels. Riley added that he'd just finished updating it after a brief hiatus. Ivan Passos pointed out that Riley's collection wasn't CVSed, and folks agreed that the two project complemented each other. Gary Lawrence Murphy suggested merging the two, and also mentioned LXR (Linux Cross Reference), a searchable kernel archive.

Regarding Thomas' archive, David Schleef asked if the whole thing were available as a tarball for download, so he could set up his own local high-speed repository. Thomas thought something like 'rsync' would be great, and said he'd try to work on that in the next few days.

3. Linus Still Accepting Major Rewrites To USB Code

30 Jul 2000 - 6 Aug 2000 (19 posts) Archive Link: "[linux-usb-devel] USB status in 2.4.0-test5"

Topics: FS: devfs, Hot-Plugging, Modems, Networking, PCI, SMP, USB

People: Randy DunlapMiles LaneAlan Cox

In the linux-usb-devel mailing list, Randy Dunlap explained and announced:

For sometime now, Alan Cox has maintained a list of problems of various severity for 2.4. He and his gnomes have given up the ghost on this list, but Linus wanted it to be kept up, so Ted Ts'o volunteered to take it over. Ted updates the list and posts it to the kernel mailing list and to .

I should have done this long ago (and maybe some of you thought that I did), but I'm trying to use this same method to track USB problems/status in 2.4.0-testN. I'm using the same format that Alan/Ted use. After this list (that I just threw together) has been sanitized/reviewed/corrected, Ted can have it..... so please send me updates, corrections, additions, etc., for this USB 2.4 status list.

He posted his list:

USB Status/Problems in 2.4.0-testN

  1. Should Be Fixed
  2. Capable of corrupting your FS
    1. Problems with USB storage drives (ORB, maybe Zip) during APM sleep/suspend
  3. Security
  4. Boot-Time Failures
  5. Compile-Time Failures
  6. In Progress
    1. usb-uhci and uhci to handle control/bulk IN STALLS better
    2. usb-uhci not use set PCI Latency Timer register to 0
    3. usb-uhci SMP spinlock/bad pointer crash
    4. hotplug (PNP) and module autoloader support
  7. Obvious Projects for People (well if you have the hardware..)
  8. Fix Exists But Isn't Merged
  9. To Do
    1. race conditions on devices in use and being unplugged
    2. cpia camera driver with OHCI HCD locks up or fails
    3. pegasus (ethernet) driver crashes often
    4. SANE backend can't communicate to its scanner (sometimes, some scanners)
    5. OHCI memory corruption problem
    6. Fix differences in UHCI and OHCI HCD behaviors/semantics
  10. To Do But Non-Showstopper
    1. add bandwidth allocation support to usb-uhci and OHCI HCDs
    2. acm (modem) driver is slow compared to Windows drivers for same modems (probably a host controller driver problem, not acm driver)
    3. printer driver can lose data when printing huge files (like 100 MB)
    4. printer driver aborts on out-of-paper or off-line conditions instead of retrying until the condition is fixed
    5. speed up device enumeration (hub driver has large delays in it)
    6. add devfs support to drivers that don't have it
    7. add DocBook info to main USB driver interfaces (usb.c)
  11. Compatibility Errors
  12. Probably Post 2.4
    1. spread out interrupt frames for devices that use the same interrupt period (interval)
    2. add USB 2.0 EHCI HCD
  13. Drivers in 2.2 and not 2.4
  14. To Check
  15. Fixed

To item 9.3 (To Do: pegasus (ethernet) driver crashes often), Petko Manolov replied that he was suspicious of these crashes, they seemed to be more KCD and USB core related, but he said he'd go change a lot of Pegasus code anyway. Miles Lane pointed out, "That, perhaps, is not the greatest idea. My understanding is that Linus has been quite ademant about only accepting bug fixes. A major rewrite of any big chunk of code may simply introduce many new bugs." But he replied to himself the next day, "My apologies to you, Petko. Randy has informed me that Linus is still taking major rewrites of USB driver code. I guess I should have gathered that without having to be told, but sometimes the obvious alludes me."

4. Symlinks In The Kernel; Kernel/Library/etc Interface Dispute

27 Jul 2000 - 3 Aug 2000 (185 posts) Archive Link: "RLIM_INFINITY inconsistency between archs"

Topics: Backward Compatibility, FS: NFS, FS: ext2, SMP, USB

People: Boszormenyi ZoltanLinus TorvaldsMike A. HarrisKai HenningsenTheodore Y. Ts'oAlan CoxUlrich DrepperJames Lewis NanceAdam Sampson

Boszormenyi Zoltan had some trouble compiling the latest 'egcs' snapshots on a Linux 2.4.0 system, and traced the problem to the fact that "/usr/include/asm is a symlink to /usr/src/linux/include/asm, as in the original distribution but /usr/src/linux is a 2.4.0-testX tree. With a 2.2.X source tree, it does not produce any warning." Linus Torvalds replied:

I've asked glibc maintainers to stop the symlink insanity for the last few years now, but it doesn't seem to happen.

Basically, that symlink should not be a symlink. It's a symlink for historical reasons, none of them very good any more (and haven't been for a long time), and it's a disaster unless you want to be a C library developer. Which not very many people want to be.

The fact is, that the header files should match the library you link against, not the kernel you run on.

Think about it a bit.. Imagine that the kernel introduces a new "struct X", and maintains binary backwards compatibility by having an old system call in the old place that gets passed a pointer to "struct old_X". It's all compatible, because binaries compiled for the old kernel will still continue to run - they'll use the same old interfaces they are still used to, and they obviously do not know about the new ones.

Now, if you start mixing a new kernel header file with an old binary "glibc", you get into trouble. The new kernel header file will use the _new_ "struct X", because it will assume that anybody compiling against it is after the new-and-improved interfaces that the new kernel provides.

But then you link that program (with the new "struct X") to the binary library object archives that were compiled with the old header files, that use the old "struct old_X" (which _used_ to be X), and that use the old system call entry-points that have the compatibility stuff to take "struct old_X".

Boom! Do you see the disconnect?

In short, the _only_ people who should update their /usr/include/linux tree are the people who actually make library releases and compile their own glibc, because if they want to take advantaged of new kernel features they need those new definitions. That way there is never any conflict between the library and the headers, and you never get warnings like the above..

He went on:

I would suggest that people who compile new kernels should:

And yes, this is what I do. My /usr/src/linux still has the old 2.2.13 header files, even though I haven't run a 2.2.13 kernel in a _loong_ time. But those headers were what glibc was compiled against, so those headers are what matches the library object files.

And this is actually what has been the suggested environment for at least the last five years. I don't know why the symlink business keeps on living on, like a bad zombie. Pretty much every distribution still has that broken symlink, and people still remember that the linux sources should go into "/usr/src/linux/index.html" even though that hasn't been true in a _loong_ time.

Is there some documentation file that I've not updated and that people are slavishly following outdated information in? I don't read the documentation myself, so I'd never notice ;)

Mike A. Harris commended, "I very much like the idea of what you describe below however as it solves NUMEROUS problems indeed. This information should be put in the top level README file, and emphasis put on the 'dont compile in /usr/local' part, because it would sure save people a lot of headaches IMHO." Also in reply to Linus, Kai Henningsen pointed out that in Debian at least, "/usr/include/asm is a directory, and its contents come with the libc6-dev package."

In reply to Linus' question about misleading docs that might be floating around, several folks piped up. Jeff Lightfoot pointed out that a ton of files in the 'Documentation' directory referenced '/usr/src/linux/index.html', and James Lewis Nance and André Dahlqvist independently posted patchs to clean that up in the main README. Adam Sampson added that the 'glibc' installation instructions had similar problems, and Kai added that in the Linux sources, the problem existed in "Lots of places, actually. 'find -type f | xargs grep /usr/include' and shudder."

Also in reply to Linus, Theodore Y. Ts'o suggested having /usr/src/linux be a symlink to the header files of whatever kernel booted by default. Since only root could actually install a kernel (even though any user could do the compilation themselves), the question of where the link should point would always be clear. He explained, "The problem is that unless you are trying to say that you want to outlaw external source packages which generate kernel modules, there needs to be some way for such packages to be able to find the kernel header files." But Linus replied that this would force kernel header files to maintain source-level backward compatibility forever, which would cause big problems. In terms of how external packages could find header files, Linus replied:

By hand. By the maintainer. And _independently_ of what random user Joe Blow has on his particular installation.

Because it's not unreasonable AT ALL to have those packages be compiled with newer header files than the user even has access to. Imagine a ext2 library that wants to support new features of the filesystem, compiled on a box that only has 2.2.13 installed. Neve rever had anything newer.

Should that newer source package dumb itself down to 2.2.13 level, so that the e2fsck doesn't know how to handle new filesystems? Sure, the user obviously isn't using them _now_, but wouldn't it be a lot nicer if you just had a source tree that ended up generating the same binary that you as the maintainer has? With all the new features, just suppressed by the fact that it ends up running on a old-style filesystem image..

Trust me, it's STUPID to have user-level binaries that end up different depending on what machine they were compiled on. We've had exactly that happen, and it's a BUG. It's nasty to debug.

Think about it. You have machine X and machine Y, and they both have the ext2-programs compiled with the same compiler from the same sources with the same libraries. Would you _really_ consider it acceptable if they act differently?

I don't. And that is why I will continue to maintain that it is WRONG to have that symlink. No ifs, buts of other crap. Just face reality.

Elsewhere in the same vein, he went on:

I know people who _routinely_ compile stuff over NFS on another machine simply because that other machine is a lot faster, and the network is fast. They expect the binary to be the same. And I agree 100% percent. It should NOT depend on your particular kernel configuration (and yes, some kernel header files actually _change_: they depend on whether the kernel was compiled for a PII or a i386 etc).

Say you have a build-server that runs an older kernel because it doesn't really matter, and it's not running gnome etc. Say your desktop uses USB and you've upgraded. Or the reverse may be true, where the build-server is a SMP machine that uses a newer kernel because it handles the load better.

With your approach, that build-server would be unable to generate programs that take advantage of the new features that somebody wanted to have in the program. They would generate programs that are doing things that the locally generated programs wouldn't be doing.

What I mean is that the above generation-script should be generated _once_. The source gets distributed with the generated file, so that whatever happens you at least get reliable results in a reasonably heterogenous environment.

A "normal user" would never generate nofollow.h at all. The generation script would be used by the _maintainer_ or by people who add new features (And yes, in the above example it's rather simplistic. A real example would generate the proper architecture ifdef's etc).

I expect that library versions and compiler versions should matter to compiling programs. But I do _not_ want kernel versions to do that. It's already painful for people that you have to have the right library version. I'd _hate_ to see source code that says "requires kernel 2.3.99 or higher sources in /usr/src/linux" in addition to saying "needs glibc-2.1.2 or newer for threading reasons".

He replied to himself:

Put another way that maybe is a clearer example:

A lot of old-time UNIX people seem to think that everybody compiles sources themselves. That's madness. Yes, it's important that you _can_. But you shouldn't have to. If I hear that the new feature 2.3.5 of package "foo" supports the new filesystem layout that I've been waiting for, should I have to pray that the person who compiled the binary happened to use one of the development kernels where that feature was actually implemented?

Or should I have to recompile it myself to make sure?

Or, wonder of wonders, should it just WORK?

I think the latter. And I hope I've made clear to everybody why a software package must NOT EVER depend on what kernel version happened to be installed when it was compiled. And why it is so _important_ that nobody even by mistake does this. EVER.

The defense rests.

Theodore replied that he hadn't meant userland programs, and said:

I'm talking about kernel modules. Like the external PCMCIA package; remember? The one which you recommended distro's should use because the 2.4 PCMCIA code wasn't quite up to snuff yet.

Kernel modules *inherently* depends on which kernel happens to be running on which machine. We can't change that, because we don't want to lock down kernel interfaces.

It would be nice, however, if there was a painless way to compile such external kernel modules so they easily work with whatever kernels happens to be on the machine.

I accept your arguments that user-mode programs shouldn't depend on the kernel which you happen to be compiling on. But this simply doesn't work for kernel modules.

Linus replied, "You're right, right now kernel modules need some way of specifying where the kernel is. I've always just had a define at the top of a makefile that the user actually had to edit by hand (this was how early USB-development was done, for example). Not very pretty, I guess. But at least it doesn't screw the "normal" user packages." And Theodore said, "I'd really, really, like some kind of convention that could be standardized." He proposed either:

He went on:

I could live with any of these; as long as we all can agree on a single convention, so that default is always right. If you don't like /usr/src/linux because of the past history, and how user-mode packages are using it incorrectly, let's create a new convention. I personally think /lib/include (ala /lib/modules) is probably the best one but it means dropping approxmiately 4 megabytes into /lib, which might cause some problems for some partitioning schemes.

My external kernel module packages use a define at the top of a makefile as well (and currently defaulted to /usr/src/linux; I can change that). This is fine for me, but I'd like to be able to support users that don't necessarily know how to edit Makefiles. I'd like for them to be able to type "make" and "make install" as root, and that's about it. In order to do this, we need some kind of convention. Covnentions are Good Things.

Alan Cox also advocated standardizing on something, and suggested:

Symlinks are wonderful things


neither needs to be a source tree in full nor a copy. In fact its ideal since make modules_install will know enough to make the link so the link will defacto get put in the right place when people install new kernels. Self updating to new features is good.

Linus replied, "I like this one. It puts the thing in the same tree as the modules themselves, so it's self-contained. Let's _document_ it as a symlink, and make "make modules_install" do that part too (I don't use modules so I'd rather somebody else sent me the tested - likely one-liner - patch to do this)." Theodore posted a very small patch, and added, "Vendors should test this against their kernel packaging tools, which tend to do all sorts of non-standard stuff because they try to build build multiple kernels and multiple sets of modules from a single kernel source tree." There followed some implementation discussion about various pitfalls to be avoided, and how best to code the patch to avoid them.

Elsewhere, Ulrich Drepper had some angry words for Linus regarding the whole discussion:

Your style of development these sudden, unplanned changes is what makes it necessary to not add all the content to the libc headers. In addition, and I repeat this probably for the thousands time, where the f*ck is the sysconf() functions which is so very much needed?

Until you provide solutions for this you cannot expect others to do more work. I would have to release a new glibc version every week since something changed and somebody will run into the problems. And no, your argument that the people who are doing such low-level work should know what to do doesn't cut. Those people might know, but what they produce and ideally distribute in source form has to be compiled by the clueless. They don't know how to change their system (if they even have the permission) and hardcoding new values is also out of question.

Maybe you should spent some time thinking how *you* can improve the process of using more recent kernels before complaining about others. The first and obvious thing is to implement __sys_sysconf (maybe do it on top of sysctl, I don't care).

Regarding sysconf(), Linus replied:

I've never needed it.

Uli, maybe you forgot about that "open source" thing?

And btw, the kernel doesn't even _know_ many of the sysconf values. They depend on library implementation, and apparently even on things like the implementation of the "expr" binary. So "sysconf()" is not a kernel thing.

A subset of those sysconf values are things that you should ask the kernel, but go look at what sysconf should return: it's definitely not a system call.

You're barking up the wrong tree.

Ulrich replied that he had no time to work on the kernel; and that in any case he wasn't asking for a full implementation, only one that would expose the kernel parameters, and the library could take care of the rest. He went on, "Just recently I needed the real value of NGROUPS_MA. How should I get it? Also, fpathconf() is needed. And no, I'm not misdirecting this. Their were in the past some tries to implement this and you ignored them." To Ulrich's time constraints, Linus replied

So don't complain if I'm not interested in some esoteric glibc issue that I find totally removed from the kernel.

In particular, why curse at me when it's your own problem.

In short, go away until you can behave.

Ulrich replied:

It's a problem caused by you and the short-sighted way the kernel interfaces are designed so that they need constant attention. You are unwilling to cooperate in any way. Saying that writing a kernel version of sysconf/fpathconf is *my* problem is simply ridiculous. According to your logic it is my problem to keep the libc interface and it is my problem to keep (ehm, make) the kernel interface sane. You are happy living in the kenrel-only world. Probably using a shell kernel module or so since, as you mentioned, the libc problems are only "esoteric" problems for you.

Why are you constantly rejecting advices and even implementations of proper interface for the kernel? I know that you don't think it's fair to compare yourself to the developers of the other (commerial) Unix kernels. But how about just taking a look at the interfaces? Why do you think they have, for instance, kernel sysond and fpathconf interfaces? The reason is very similar to the situation we are in here: they have separate groups working on the kernel and user-level stuff, they allow the admin to reconfigure the kernel. This all cries out loud for a sane and stable interface.

If you don't want to work on these things, fine, nobody can blame you. But I think you owe it to all the other people working on and using the system to listen to their comments and accept some changes for which you in the kernel-only world see absolutely no need.

Having said this it is I think time to call for volunteers ones again. Maybe we can actually find some if you are stating that you are actually willing to seriously consider using what they are coming up with. What is needed is:

Please consider these advices.

To Ulrich's statement that Kernel interfaces were shortsightedly designed and required constant attention, Linus replied, "No. I've told you (in fact this whole thread is all _about_ that) how to not need constant attention. The fact that you repeatedly ignore this is your problem, not mine."

Ulrich accused Linus of taking the easy way out, and at one point Alan said he felt the balance was somewhere in the middle, and elsewhere volunteered to work on sysconf(), if Ulrich would provide a precise list of items. Ulrich replied, "The ones I mentioned in one of the last mails are those I'm currently aware of. But the scheme should be easily extendible anyway since there will be new requirements in the future. And ideally modules will be able to register their own extensions."

But Linus interposed:

Don't do this.

Make it a _minimal_ list, not the kind of "this is everything I can think of, and I'll also add a way for modules to add their own" stuff.

Yes, Uli, I know you like overdesigning things.

Ulrich said he wasn't overdesigning, just looking to the future, and cited, "For example, there are still people using the STREAMS stuff. This code should also export it's parameters." But Linus replied:


That code should just DIE.

sysconf() isn't even important enough to overdesign for. Why really cares whether _SC_STREAM_MAX gets the exact right value? I've never seen anybody use it.

The way code gets added to the kernel is when somebody cares enough to write it, and it looks good enough to add.

Code does NOT get added to the kernel just because somebody makes a big deal of nothing.

Ulrich said the 'streams' thing was just an example of the kind of thing he was talking about, not a specific case where it definitely should be done. But he concluded, "I'll stop trying to convince you since I don't have much hope. When glibc 2.2 comes out I'll provide some information on how much code is necessary to handle all the different kernel versions. Almost all of these changes could have been avoided if this purely minimalistic approach to kernel interface design would be replaced by something more flexible."

At this point the discussion veered off.

5. Trouble With PS/2 Hotplugging In Stable Series

28 Jul 2000 - 1 Aug 2000 (16 posts) Archive Link: "ps/2 mouse (synaptics touchpad)"

Topics: Hot-Plugging, Version Control

People: Alan CoxVladimir DergachevVojtech PavlikAndrew McNabb

Vladimir Dergachev noticed that 'gpm 1.19.3' gave tons of errors in 2.4.0-test4, while under 2.2.14 it worked fine. Andrew McNabb reported seeing the same problem when he'd upgraded to 2.2.16, and recommended just removing 'gpm', since it was unmaintained anyway. But Vladimir replied with a patch, having tracked the problem to some PS/2 reconnect code, introduced into 2.4 and 2.2 at about the same time. His fix was to remove the new code, after which the system worked fine again (although it would be impossible to hotplug PS/2 devices). He also mentioned that 'gpm' was being maintained again, at least as of June. Alan Cox replied that removing the code was not the right answer, and started asking debugging questions. No solution presented itself, aside from re-implementing reconnect-event determination (Vojtech Pavlik gave a link to Linux Input Drivers), and at one point Alan said, "If someone has infinite bandwidth to go digging in that CVS and cares to send me the relevant pieces let me know. Otherwise I'll worry about this after 2.2.17"

6. Feature Consideration

28 Jul 2000 - 4 Aug 2000 (25 posts) Archive Link: "[PATCH] Decrease hash table memory overhead"

Topics: Networking

People: Andi KleenLinus Torvalds

Andi Kleen posted a patch for 2.4, and reported:

Linux uses double linked list heads in the inode and dcache hash tables. That wastes a lot of memory, especially since neither inode nor dcache ever try to access the tail of the hash list. The following patch adds a new hlist_* implementation that works on double linked lists with a single pointer head. It adds a few jumps over the list_* rings, but IMHO the decreased cache line usage in the hash heads is more than worth it (you can do a lot of jumps in a single cache miss)

This saves about 96K memory on my 128MB machine, more on machines with bigger ram.

But Linus Torvalds replied, "I'd rather have just one list function than save a few kB of RAM. Avoid confusion, and make people so used to that one list-handling functionalty that bugs don't crop up as easily." Andi said he was almost certain that the patch would also speed up the system, and offered to do a benchmark; and Linus replied, "Hey, feel free. That might motivate me if it is noticeable." Andi posted some good numbers, but Linus objected:

I'm not interested in made-up benchmarks that cannot be reproduced under real load.

Can you make it show up on a real filesystem even with a contrieved user-mode benchmark?

(Btw, even if you do convince me, please don't use a name like "hlist". "hlist WHAT?" What's the "h" for? "hash"? Why? Basically, it sounds nonsensical).

Btw, from past exprience I've found that it can be a lot more advantageous to just dynamically move the hash entries to the front of the list when accessed, rather than worry about how the list is set up. Hashes are bad on the caches by design, and whether the hash table takes up x or 2x of memory is pretty much immaterial for performance. But whether you find the entry on the first or the fifth try is noticeable.

I suspect you'd find more of a performance advantage from trying something like that instead..

Andi explained, "Doing it completely from user space would probably add so many other variables and variances that the results would be hard to interpret," and Linus came back with:


The other way to say the same thing is

"Doing it from user space might show that it's not a performance optimization that can be noticed".


Andi gave up on the benchmark idea as being too much work, but added, "Anyways, hlists are already used all over the kernel (e.g. try grep pprev net/ipv4/*), just everybody is reinventing the wheel on them all the time. I did that myself several times. It would be nicer to use list_*() macros the time, just without the bloat of the list_* list heads." To which Linus agreed:

Now THIS is a valid argument that I can find no holes in.

The argument of "inode.c could be speeded up/shrunk/xxxx" doesn't strike me as being a very good argument especially just before 2.4.x.

The argument that "lots of code already does this, except they aren't very clean about it and do it by hand", is an argument I can buy into.

You might consider just going about it a different way: pick the places that _already_ use this kind of list, and clean them up using a generic list package. I still don't like "hlists" as a name, because I still don't see the "hash" in them conceptually, but I would certainyl consider any cleanup a good thing.

And once you come from that direction, it's going to be a lot easier convincing me to eventually potentially switch over some of the current lists.h users to a new implementation.

Andi replied that he'd look into this for 2.5, and posted a new patch containing only the pure bugfixes from his initial code.

7. Stopping Buffer-Overrun Attacks

28 Jul 2000 - 1 Aug 2000 (14 posts) Archive Link: "Stopping buffer-overflow security exploits using page protection"

Topics: Security

People: Bruce PerensJames SutherlandAlan CoxOliver XymoronLamont GranquistDerek Martin

Bruce Perens gave a pointer to an article on and asked, "Is there any good reason that we can not run Linux executables with the execute permission turned off, by default, on all stack and data pages? Wouldn't this stop buffer-overflow security exploits that try to inject executable code onto the stack or into function tables? i386 won't support it, but other architectures do." James Sutherland replied that this sounded like the "nonexecutable stack" idea that had been floating around for awhile (see Issue #50, Section #6  (27 Dec 1999: Unexecutable Stack) and Issue #51, Section #1  (28 Dec 1999: Unexecutable Stack Saga Continues) ). He added, "It doesn't stop anything - just changes the nature of the exploit needed (i.e. the skr1pt k1dd13s need to find v2 of their little skr1pt). The opinion round here seems to be that this isn't worth the hassle?" Alan Cox added:

As for the number of exploits this would stop, including this in the mainstream kernel would only be a stopgap measure. All it took to open the floodgates for stack smashing exploits was a single well-written article - Aleph One's "Smashing the Stack for Fun and Profit". Now writing an exploit once you find an overflow is a cookbook exercise. A 2nd edition of the cookbook would be all it would take to render the patch meaningless.

The problem isn't Intel's fault or any OS's, it's a problem in the C language and compiler. There are 5 fixes:

  1. write safe code (which has so far proved hard)
  2. compile with bounds-checking (big performance hit)
  3. compile with StackGuard, etc. (doesn't stop exploits that corrupt other locals)
  4. separate the return address stack from the automatic variable stack (ditto)
  5. use another language (performance)

Derek Martin said he didn't understand why folks were against closing up the security holes that they could, even if there were others they couldn't. Oliver Xymoron explained, "We have n exploitable buffer overruns. The non-exec patch will leave us with n exploitable buffer overruns next week and a false sense of security. Meanwhile the patch is disgusting complex - it's like putting four deadbolts on your front door while leaving your back door open.."

And Lamont Granquist put in:

This should really be a FAQ.

The problem is that you don't reduce any potential vulnerabilities at all. For every buffer overflow exploit out there you can modify it and produce a version which will work against a non-exec stack page on an x86. It is not hard. I was actually considering producing a "Smashing the Stack for Fun and Profit, Part II: Non-Exec Stacks" text to show just how easy it really is. For now, I suggest you check out the VULN-DEV archives -- a few very helpful people on that list walked me through how to produce non-exec stack exploits.

If a non-exec stack ever got accepted into the kernel, then exploit writers would simply start coding for non-exec stacks. The end result is that you would gain precisely nothing. And what you would lose is that you would have broken the x86 API -- for nothing. So, yes, there is a drawback, and no you don't reduce any vulnerabilities. Linus has already rejected such patches for this reason. Check out the Libsafe documentation for a little bit of background and references.

8. mount() History And Proposal

30 Jul 2000 - 7 Aug 2000 (13 posts) Archive Link: "[RFC][Long][Horror story] Mount flags"

Topics: FS: NFS, POSIX

People: Alexander ViroHans ReiserH. Peter AnvinMatthias AndreeAndries Brouwer

Alexander Viro gave an amazing history of mount(), and proposed (quoted in full):

Sorry for the length of that, but I really felt that the whole story was needed to appreciate the situation. In short, I think that there is a need of new variant of mount(2). Yep, new syscall number. See below for the reasons. Here it comes:

Mount Flags, or
A Story of Interace Rot.

Once upon a time life was simple, interfaces pleasant and look at the mount(2) didn't raise a suspicion that Frankenstein's monster got what he wanted. Back then mount(2) had 3 arguments - directory, device and rw flag (unused, by the way). Alas, it didn't last. In March '92 mount(2) got a new argument - fs type. So far, so good, but the story didn't end on that - somewhere in July '92 msdosfs went in and brought mount(8) options that were obviously fs-specific. And that brought a new argument - void *data. sys_newmount(9), you are saying? You wish... That's what had actually happened:

Flags is a 16-bit value that allows up to 16 non-fs dependent flags to be given to the mount() call (ie: read-only, no-dev, no-suid etc).

data is a (void *) that can point to any structure up to 4095 bytes, which can contain arbitrary fs-dependent information (or be NULL).

NOTE! As old versions of mount() didn't use this setup, the flags has to have a special 16-bit magic number in the hight word: 0xC0ED. If this magic word isn't present, the flags and data info isn't used, as the syscall assumes we are talking to an older version that didn't understand them.


do_mount() does the actual mounting after sys_mount has done the ugly parameter parsing. When enough time has gone by, and everything uses the new mount() parameters, sys_mount() can then be cleaned up.

Needless to say, this interface is still with us. Nevermind that current kernel will simply refuse to exec() a binary from '92, the kludge is still there. First bunch of flags was nice and sweet: ro, nodev, nosuid, noexec and sync. Bits 0--4, indeed. But in January '93 we've got remount and it had been implemented as a new flag. It wasn't a flag, indeed, but hey, why not encode the action into the same argument and avoid API changes? So there it went and got the bit #5. And so it stayed for a while. Flags (real flags, that is, not remount one) were mirrored into ->i_flags of every inode and everyone was happy.

In August '94 ->i_flags got two new bits - S_APPEND and S_IMMUTABLE. They could not be passed by mount(2), indeed. They got bits #8 and #9, apparently to make them visibly separate from the rest. Well, putting them at #16 and above might be wiser, but hey, who will ever need more than 8 (OK, 7) mount flags?

In the late '95 we got an implementation of quota. And ->i_flags got a new bit - S_QUOTA, #7. Originally it got an inventive name S_WRITE, but that insanity had been fixed in '98.

Fast-forward to October '96. POSIX mandatory braindam^Wlocking gores in. Since nobody wants the overhead hitting all filesystems we are getting a new mount flag. This time - real. OK, #6 is still free, so there it goes.

Novermber '96, and we have one more flag - noatime. Oops, looks like we had made a bad choice when append-only and immutable went in. Oh, well, who actually cares? #10 it is.

Originally remount could change only read-only bit. Well, mandatory locking and noatime also became changable, so in September '97 somebody asked himself why the rest didn't? At that point MS_RMT_MASK (flags that can be changed by remount) started to look somewhat ugly. It got worse three months later, when nodiratime went in (bit #11).

In October '98 RMK noticed that remount doesn't update ->i_flags, so macros got uglier - now we were checking both for ->i_flags and ->s_flags.

In April '99 unrelated events (rename() cleanup) had lead to Yet Another Mount Flags Ugliness(tm). This time the guilty party is known - it's me (AV). I needed a way to tell rename() that some filesystems need special treatment (silly-rename ones). Instead of putting that into ->s_type->fs_flags (after all, that's a property of filesystem type) I've added a new bit to ->s_flags (#15 - at that point we were visibly low on space; why not #16? Hell knows, I plead temporary braindamage inflicted by contact with NFS).

A year later one more bit got there, this time in ->i_flags - S_DEAD. That time I had finally had seen the light (OK, actually I had seen the dire lack of space, but let's pretend that I was clever) and it went into #16.

About the same time we've got Plan9-ish bindings. I made some noises about a new syscall, but they were not too convincing. For several reasons: first of all, the name (bind) had been already taken and bind9(2) was a half-hearted proposal at best. Moreover, I wanted to debug it fast and didn't want to change mount(8) source. So the quick kludge^Whack went in - -t bind. In other words, passing the thing through the "type" argument.

The same batch of changes introduced unlimited stacking. Which looked fine at first, but brought a lot of complaints, arguments and finally such an example of misuse that drove the point through. It was an obvious exploit, letting any user who can mount something (floppy, CD, whatever) to drive the system into OOM. Worse yet, cleaning up after that was damn hard, and I don't mean washing the LART. That was it - we need more flags, since the ability to overmount must be root-only. And checks should be in mount(8), since it's suid-root and from the kernel POV all calls of mount(2) are done by root. On the other hand, the actual test for presence of another filesystem at the mountpoint must be left to mount(2) to avoid races.

OK, but we also want to be able to support union-mounts at some future point. That means two more flags (head/tail of the union). We also want to get rid of the -t bind kludge, so that's one more bit going our way. However, currently we have only 3 unused bits - #12, #13 and #14. We can get more if we relocate S_QUOTA, S_APPEND, S_IMMUTABLE and MS_ODD_RENAME, though... OK, assume that we've done that, what do we have?

#0 to #4, #6, #10 and #11 are used for real flags. Fine. #5 and some of the rest are used for "action" flags. So we can fit into 16 bits, but it's getting really, really crowded here. We can get a bit more if we notice that MS_RENAME, MS_AFTER, MS_BEFORE and MS_OVER are mutually exclusive, but that gets really ugly - we could fit into 3 bits instead of 4, but they would be spread over not-contiguous area. And we can't do anything about that without breaking every existing binary of mount(8). Moreover, we can't do anything about the 0xc0ed kludge - all kernels since '92 are going to send us to hell if we change that. Yes, Virginia, removal of that check had been overdue for some 7 years, but there is no helping to that.

_Or_ we can do what needed to be done back in '92 and '94 and introduce sys_newmount(action, mountpoint, type, flags, device, data). Why "action" separate from "flags"? Well, see the story above. Mixing the bitmap and number into one integer _never_ pays. And inside the kernel we will have to start with separating them anyway. I could buy an argument about the register pressure, but damnit, it's mount(2) we are talking about. If it's a hotspot of your program I want to know what the hell are you trying to do.

Hans Reiser replied, "Changing mount for the reasons you cite sounds reasonable as a general proposition, I'll let others comment on whether you picked the best possible parameters definition for mount()." H. Peter Anvin also said, "It seems to me that it would make more sense to introduce Viro's proposed mount6() system call if we're going to introduce a new API."

Andries Brouwer (mount() maintainer) was not in favor of the proposal, and suggested that all the problems could be solved without major changes. He gave specific technical examples, but Matthias Andree objected, "Enhancing the mess is no good. It's no good joining things together that the kernel needs to separate again later either. I vote in favor of Al's approach." Andries was not convinced, and the thread petered out.

9. Some Discussion Of gcc/Kernel interactions

31 Jul 2000 - 4 Aug 2000 (78 posts) Archive Link: "2GIG-file"

Topics: Version Control

People: Victor KhimenkoLinus Torvalds

In the course of discussion, Victor Khimenko mentioned, "If RELEASED gcc miscompiles kernel it's kernel problem (BTW I've using gcc 2.95 compiled 2.2.x kernels for last year without problems). If UNSTABLE gcc miscompiles kernel then it's not even kernel issue ..." To the 'released gcc' proposition, Linus Torvalds replied, "Not always. There have been gcc releases that are buggy too. Sometimes the kernel ends up having work-arounds. Sometimes the end result is to tell people not to use them." And to the 'unstable gcc' proposition, he went on:

Not necessarily true either. Quite often new compilers just do optimizations that were always legal but just didn't trigger, and nobody noticed some bug in the kernel. So even a new snapshot of gcc may be fine, and miscompile the kernel even so. I'll try to fix the kernel asap, of course (sometimes that fix is to simply disable an optimization that isn't appropriate for the kernel - this was the case with the strict alias analysis code, for example).

It _sounds_ like gcc-2.96 is just not quite stable. Somebody claimed that the new 2.96-based one in 7.0beta was ok again. I certainly know of people using the latest CVS snapshots to compile the kernel, and it can often be a case of "it works for them" and then end up that some other configuration of Linux might show problems.

It's not a clear-cut problem. There have certainly been bugs in both gcc and the kernel, in all combinations of "stable vs experimental".

For more on 'gcc' issues, see Issue #77, Section #9  (14 Jul 2000: 'gcc-2.91.66' Recommended For Kernel Compilation) .

10. Twisted VM Tweaking

31 Jul 2000 - 1 Aug 2000 (7 posts) Archive Link: "kupdate, high CPU usage"

Topics: Virtual Memory

People: Rik van RielAndrea Arcangeli

At one point Rik van Riel said, "Disabling kupdate or kflushd is dangerous to your data and should never ever be done." Andrea Arcangeli replied:

Disabling kflushd can impact the stabiliy of the system.

Disabling kupdate is useful on the airplane to save battery power 8). (do it at your own risk of course)

And Rik said:

You don't want to disable it.

Just set it to one-hour wakeup intervals so it'll flush every piece of data once per hour ;)

The friendliest exchange Rik and Andrea have shared in awhile...

11. Status Of Crypto Patches

1 Aug 2000 - 5 Aug 2000 (66 posts) Archive Link: "Crypto"

People: Sandy HarrisH. Peter Anvin

Cindy Cohn asked if cryptography would be included in the mainstream sources, and at one point Sandy Harris said, "As I recall, someone from said a few months back that their lawyers were looking at this. Anyone know the results?" H. Peter Anvin replied:

Indeed we do :) The current policy on now is that cryptographic software is OK as long as it's Open Source and the source is available on itself.

The "no government end user" restriction -- which we were originally very concerned about -- turns out not to apply for Open Source software. Also, the BXA seems to have recently expressed "intent to clarify" ( that the Open Source exception applies as well to "object code compiled from source code that is considered publically available".

12. Ancient ext2 Race Uncovered

5 Aug 2000 (6 posts) Archive Link: "BUG in ext2"

Topics: FS: ext2

People: Andreas DilgerAlexander ViroAndrew Morton

Andrew Morton was getting repeatable assertion failures on 2.4.0-test6-pre2 in the ext2 block allocation code, and Andreas Dilger replied:

Unfortunately, the whole ext2 block allocation code was re-written recently by Al Viro for test6-pre1, and it looks like it has bugs (see also thread <test6-pre2 loop in ext2_get_block>)... I understand that there may have been some locking problems with the old code because of the VFS re-design, but it seems like a bad move IMHO to change such an important piece of code in a drastic way right now.

I think the majority of the change was a FEATURE to have zero-locking block allocation and while this itself is a good thing, RIGHT NOW is not the time to do it. My online resize patch, which is by far less intrusive since it is only called at mount time and resize time and is mostly just moving existing code into a subroutine, was rejected (rightfully so) because ext2 is too important to break at this late date.

If it were up to me, I'd back out this patch and fix only the minimum required areas.

To the idea that the changes were to implement a feature and not to fix a bug, Alexander Viro replied, "No, it was not. The reason of the change was to close several bad (read: fs-corrupting) races in ext2. Locking didn't change, BTW - it's still under BKL. List of the crap that required that fixing will be posted as soon as fix will be in 2.2 - I'm not too happy about posting "here's how user nobody can chew the fs, fsck quotas, eat reserved blocks and panic your box" recipes. If you want it right now - ask and I'll send it off-list. BTW, if you volunteer to help with minix/sysv/UFS - be my guest, they require the same bunch of fixes."

Andrew also replied to Andreas, saying that he suspected the problem was not as bad as Andreas feared. He posted some data, and Alexander pegged it as a slow block leak, adding, "that leak had been there since long (I'm afraid that "long" may be something about '93). Oh, well, one more race that needs fixing..."

13. VM Hangs On For 2.5

5 Aug 2000 - 6 Aug 2000 (9 posts) Archive Link: "[PATCH] lock troubles in pre6-2"

Topics: Virtual Memory

People: Paul Rusty RussellRik van RielDavid S. MillerRusty Russell

Paul Rusty Russell posted a patch, David S. Miller pointed out a problem. Rusty replied:

Errr... Yeah, I guess I'll take your word for it, because I can't follow that code at all 8(. I see that try_to_swap_out() does an unlock without a lock anywhere in sight, but I can't see the path between this and swap_out_mm().

Please: am I too stupid to understand this, or is the code a convoluted mess?

Must be this warm English beer.

Rik van Riel explained:

It's not the English beer that's bothering you (well, maybe it is, but it's not the cause of this particular itch).

I'm ashamed to admit that the VM code still is a horrible mess, but code readability will be a major goal in the new VM implementation.

For more on the VM situation, see Issue #77, Section #1  (25 Jun 2000: New Plans For the Virtual Memory Subsystem) .

14. Building XFS; Some Experiences With Other FSes

5 Aug 2000 - 6 Aug 2000 (9 posts) Archive Link: "how to actually build SGI's xfs?"

Topics: FS: JFS, FS: ReiserFS, FS: XFS

People: James Lewis NanceAndi KleenKeith Owens

Jeremy Hansen was anxious to try out XFS, but couldn't find any docs on how to actually build the filesystem. There didn't seem to be any XFS-related mailing lists on the SGI site, so he posted to linux-kernel. Keith Owens pointed him to a general info page and a mailing list page, but there was no reply. James Lewis Nance was also interested in getting XFS to work, and mentioned peripherally:

I spent last week playing with different file systems. Reiserfs and jfs patch and compile w/o too much trouble. Xfs and NWFS seem to be missing good instructions on exactly how to patch them into the kernel, so I did not play with them.

BTW, I was quite impressed with JFS. Its definitly not ready for production yet, but its seems slightly faster than reiserfs (for my single benchmark, which is to build mozilla on the fs) and its only 2/3 of the size of reiserfs.

He gave a link to IBM's JFS page, but Andi Kleen replied, "The Linux implementation of JFS is not journaled yet, so it is very likely to be faster than reiserfs for meta data intensive operations (like creating lots of small files in a compile) because it doesn't do any journal IO."







Sharon And Joy

Kernel Traffic is grateful to be developed on a computer donated by Professor Greg Benson and Professor Allan Cruse in the Department of Computer Science at the University of San Francisco. This is the same department that invented FlashMob Computing. Kernel Traffic is hosted by the generous folks at All pages on this site are copyright their original authors, and distributed under the terms of the GNU General Public License version 2.0.