linux-kernel FAQ (http://www.tux.org/lkml/) | subscribe to linux-kernel (http://www.tux.org/lkml/#s3-1) | linux-kernel Archives (http://www.uwsg.indiana.edu/hypermail/linux/kernel/index.html) | kernelnotes.org (http://www.kernelnotes.org/) | LxR Kernel Source Browser (http://lxr.linux.no/) | All Kernels (http://www.memalpha.cx/Linux/Kernel/) | Kernel Ports (http://perso.wanadoo.es/xose/linux/linux_ports.html) | Kernel Docs (http://jungla.dit.upm.es/~jmseyas/linux/kernel/hackers-docs.html) | Gary's Encyclopedia: Linux Kernel (http://members.aa.net/~swear/pedia/kernel.html) | #kernelnewbies (http://kernelnewbies.org/)
Table Of Contents
|1.||3 Jan 2001 - 9 Jan 2001||(58 posts)||Impact Of Sudden Power Loss On Journalled Filesystems|
|2.||4 Jan 2001 - 10 Jan 2001||(17 posts)||Maximum CPUs And RAM Under 2.4 Kernels|
|3.||4 Jan 2001 - 8 Jan 2001||(3 posts)||ext3fs 0.0.5d And reiserfs 3.5.2x Mutually Exclusive|
|4.||4 Jan 2001 - 9 Jan 2001||(30 posts)||Driver Submission Policy For 2.2|
|5.||4 Jan 2001 - 8 Jan 2001||(15 posts)||Modutils 2.4.0 Available|
|6.||5 Jan 2001 - 8 Jan 2001||(14 posts)||MM/VM Todo List|
|7.||5 Jan 2001 - 8 Jan 2001||(13 posts)||Why Use Modules?|
|8.||5 Jan 2001 - 11 Jan 2001||(43 posts)||Bug Report Generation Tool|
|9.||6 Jan 2001 - 10 Jan 2001||(7 posts)||Patch Submission Policy For 2.4|
|10.||8 Jan 2001 - 10 Jan 2001||(19 posts)||Bug In 2.4.0 Virtual Memory Subsystem|
|11.||8 Jan 2001||(4 posts)||Superfluous Whitespace In The Kernel Sources|
|12.||9 Jan 2001 - 10 Jan 2001||(7 posts)||2.0.39 Announced|
|13.||10 Jan 2001||(4 posts)||2.4.0 On The IA64|
|14.||10 Jan 2001 - 11 Jan 2001||(2 posts)||Statistical Kernel Profiler Available|
|15.||10 Jan 2001||(5 posts)||LVM Fixes Slow To Get Into The Official Kernel|
|16.||11 Jan 2001 - 13 Jan 2001||(12 posts)||Comparing Khttpd, Boa, And Tux|
|17.||12 Jan 2001 - 14 Jan 2001||(15 posts)||Unexplained 2.4.0 Filesystem Corruption|
|18.||13 Jan 2001||(2 posts)||PowerPC In The Official Tree|
Kernel Traffic will be trying out a Friday schedule, as of last week. Otherwise it's just too tempting to work through the weekend... ;-)
Mailing List Stats For This Week
We looked at 1640 posts in 7373K.
There were 516 different contributors. 261 posted more than once. 176 posted last week too.
The top posters of the week were:
Impact Of Sudden Power Loss On Journalled Filesystems
3 Jan 2001 - 9 Jan 2001 (58 posts) Archive Link: "Journaling: Surviving or allowing unclean shutdown?"
Topics: FS: ReiserFS, FS: ext2, FS: ext3, Web Servers
People: Michael Rothwell, Daniel Phillips, Alex Belits, Stephen C. Tweedie, Andreas Dilger, Alan Cox, Stefan Traby, David Woodhouse, Marc Lehmann, David Lang
Dr. David Gilbert was unsure whether journalling filesystems were intended to merely survive the occassional improper shutdown, or if users should feel comfortable just powering them down as part of normal operation. Michael Rothwell pointed out that journalling filesystems only guaranteed the consistency of data that had been written prior to shutdown, and that any buffers left unflushed at power-off would be lost, and any applications not properly exited could also do bad things. "Journaling mostly means not having to run FSCK," he said. Daniel Phillips replied to David at greater length:
Welllllll... crashes tend to produce different effects from sudden power interruptions. In the first case parts of the system keep running, and bizarre results are possible. An even bigger difference is the matter of intent.
Tux2 (http://innominate.org/~phillips/tux2/) is explicitly designed to legitimize pulling the plug as a valid way of shutting down. Metadata-only journalling filesystems are not designed to be used this way, and even with full-data journalling you should bear in mind that your on-disk filesystem image remains in an invalid state until the journal recovery program has run successfully. You would not want to upgrade your OS with your filesystem in this state, nor would you want to remove a disk drive that didn't have the journal file on it.
Being able to shut down by hitting the power switch is a little luxury for which I've been willing to invest more than a year of my life to attain. Clueless newbies don't know why it should be any other way, and it's essential for embedded devices.
I don't doubt that if the 'power switch' method of shutdown becomes popular we will discover some applications that have windows where they can be hurt by sudden shutdown, even will full filesystem data state being preserved. Such applications are arguably broken because they will behave badly in the event of accidental shutdown anyway, and we should fix them. Well-designed applications are explicitly 'serially reuseable', in other words, you can interrupt at any point and start again from the beginning with valid and expected results.
Alex Belits strongly disagreed that applications should be considered broken if they mis-handled sudden shutdowns. He said:
All valid ways to shut down the system involve sending SIGTERM to running applications -- only broken ones would live long enough after that to be killed by subsequent SIGKILL.
A lot of applications always rely on their file i/o being done in some manner that has atomic (from the application's point of view) operations other than system calls -- heck, even make(1) does that.
Daniel replied that the 'make' program in Alex's example, was a perfect case of a broken application. Alex disagreed, and the subthread petered out.
Elsewhere, Stephen C. Tweedie said in response to Daniel's statement that the on-disk filesystem image of journalling filesystems would remain inconsistent until the journal recovery program had run, "ext3 does the recovery automatically during mount(8), so user space will never see an unrecovered filesystem. (There are filesystem flags set by the journal code to make sure that an unrecovered filesystem never gets mounted by a kernel which doesn't know how to do the appropriate recovery.)" Daniel replied, "Yes, and so long as your journal is not on another partition/disk things will eventually be set right. The combination of a partially updated filesystem and its journal is in some sense a complete, consistent filesystem." But he asked, "I'm curious - how does ext3 handle the possibility of a crash during journal recovery?" Andreas Dilger explained, "Unless Stephen says otherwise, my understanding is that a crash during journal recovery will just mean the journal is replayed again at the next recovery. Because the ext3 journal is just a series of data blocks to be copied into the filesystem (rather than "actions" to be done), it doesn't matter how many times it is done. The recovery flags are not reset until after the journal replay is completed." Alan Cox replied tersely, "Which means an ext3 volume cannot be recovered on a hard disk error." And Stephen replied:
Depends on the error. If the disk has gone hard-readonly, then we need to recover in core, and that's something which is not yet implemented but is a known todo item. Otherwise, it's not worse than an error on ext2: you don't have a guaranteed safe state to return to so you fall back to recovering what you can from the journal and then running an e2fsck pass. e2fsck groks the journal already.
And yes, a badly faulty drive can mess this up, but it can mess it for ext2 just as badly.
Close by in the subthread, Stefan Traby asked, "I did not follow the ext3 development recently, how did you solve the "read-only mount(2) (optionally on write protected media)" issue ? Does the mount fail, or does the code virtually replays (without writing) only ?" Stephen explained:
The code currently checks if the underlying media is write-protected. If it is, it fails the mount; if not, it does the replay (so that mounting a root fs readonly works correctly).
I will be adding support for virtual replay for root filesystems to act as a last-chance way of recovering if you really cannot write to the root, but journaling filesystems really do expect to be able to write to the media so I am not sure whether it makes sense to support this on non-root filesystems too.
Stefan Traby had also added, "an unconditional hidden replay even if "ro" is specified is not nice. This is especially critical on root filesystem, because there is IMHO no way to specify mount arguments to the "/index.html" mount, except ro/rw." Stephen asked, "In what way? A root fs readonly mount is usually designed to prevent the filesystem from being stomped on during the initial boot so that fsck can run without the filesystem being volatile. That's the only reason for the readonly mount: to allow recovery before we enable writes. With ext3, that recovery is done in the kernel, so doing that recovery during mount makes perfect sense even if the user is mounting root readonly." But David Woodhouse pointed out that there were other reasons for mounting a root filesystem readonly. The disk could be so damaged, he explained, that writing anything at all to it would be horribly bad; in which case one would mount it readonly, recover as much as possible from it, and throw it in the parts bin. He added, "You _don't_ want the fs code to ignore your explicit instructions not to write to the medium, and to destroy whatever data were left." But Marc Lehmann dissented:
The problem is: where did you give the explicit instruction? Just that you define "read-only" as "the medium should not be written" does not mean everybody else thinks the same.
actually, I regard "ro" mainly as a "hey kernel, I won't handle writes now, so please don't try it", like for cd-roms or other non-writeale media, and please filesystem stay in a clean state.
That ro means "the medium is never written" is an assumption that does not hold for most disks anyway and is, in the case of journlaing filesystems, often impossible to implement. You simply can't salvage data without a log reply. Sure, you can do virtual log replays, but for example the reiserfs log is currently 32mb. Pinning down that much memory for a virtual log reply is not possible on low-memory machines.
So the first thing would be to precisely define the meaning of the "ro" flag. Before this has happened it is ansolutely senseless to argue about what it means, as it doesn't mean anything at the moment, except (man mount):
ro Mount the file system read-only.
Which it does even with journaling filesystems...
Elsewhere, back on the subject of how to handle sudden shutdowns, and whether simply pulling the plug could be considered a legitimate way to end a typical single-user session, David Lang blurted, "for crying out loud, even windows tells the users they need to shutdown first and gripes at them if they pull the plug. what users are you trying to protect, ones to clueless to even run windows?" David W. replied, "Precisely. Users of embedded devices don't expect to have to treat them like computers." David L. listed in response:
in an enbedded device you can
- setup the power switch so it doesn't actually turn things off (it issues the shutdown command instead)
- run from read-only media almost exclusivly so that power event's don't bother you much
- you can add extra power inside the device so that if someone does pull the plug, you have a few seconds of power to do the clean shutdown
- you can run out of ram and force the user to do an extra step to save any changes to non-volitile storage (and if they power off during the save the results are undefined)
I have seen all of these approaches used in different devices (that are not running linux). This is not a new problem and the people working in this space have a bunch of answers.
an improved filesystem that tolorates bad shutdowns reasonably well will be welcomed for other reasons, but should not be viewed as a fix for people pulling the plug on you.
Alan said that David L.'s item #1 and #3 were too expensive, item #2 depended on the device, and item #4 was "Frowned upon because you keep getting dead units back" . He concluded, "If it doesnt fix the pulling the plug case (at least as far as 'after fsync returned this data is safe') then its not working."
Maximum CPUs And RAM Under 2.4 Kernels
4 Jan 2001 - 10 Jan 2001 (17 posts) Archive Link: "Confirmation request about new 2.4.x. kernel limits"
Topics: Big Memory Support, SMP
People: Anton Blanchard, Tigran Aivazian, Pavel Machek
Someone asked about various limits for the 2.4 kernels. They thought SMP systems running 2.4 had a 32-cpu limit; and Anton Blanchard replied, "Max CPUs is 64 on 64 bit architectures (well you have to change NR_CPUS). I am told larger than 32 cpu ultrasparcs have booted linux already."
The original poster also thought there was a 64 Gigabyte maximum RAM size, and asked if there was any slowdowns when accessing RAM over 4G on 32-bit machines. Tigran Aivazian replied, "realistic benchmarks (unixbench) will show about 3%-6% performance degradation with use of PAE. Note that this is not "accessing RAM over 4G" but (what you probably meant) "accessing any RAM in a machine with over 4G of RAM" or even "accessing any RAM in a machine with less than 4G or RAM but running kernel capable of accessing >4G". If you really meant "accessing RAM over 4G" then you are probably talking about 36bit MTRR support which is present in recent 2.4.x kernels and works very nicely!" Pavel Machek added elsewhere, "I believe you can get few terabytes with ultrasparc."
ext3fs 0.0.5d And reiserfs 3.5.2x Mutually Exclusive
4 Jan 2001 - 8 Jan 2001 (3 posts) Archive Link: "ext3fs 0.0.5d and reiserfs 3.5.2x mutually exclusive"
Topics: FS: ReiserFS, FS: ext3
People: Stephen C. Tweedie, Chris Mason, Matthias Andree
Matthias Andree noticed that trying to patch ext3fs 0.0.5d onto a 2.2.18 kernel that already had reiserfs 3.5.28 would fail, because of overlapping patches in fs/buffer.c; he added that he'd reported this incompatibility some time before. Chris Mason, one of the reiserfs developers, said he'd start work on fixing it; and elsewhere, Stephen C. Tweedie, the ext3 author, said, "removing the extra debugging stuff and buffer.c code from the ext3 patches is on the todo list but is much lower priority than finishing off the tuning and user-space code for ext3-1.0."
Driver Submission Policy For 2.2
4 Jan 2001 - 9 Jan 2001 (30 posts) Archive Link: "Change of policy for future 2.2 driver submissions"
People: Alan Cox, Mark Hahn, Daniel Phillips, Wayne Brown, Tim Riker, Rik van Riel, Michael D. Crawford, Linus Torvalds, Nicholas Knight
Alan Cox announced:
Linux 2.4 is now out, it is also what people should be concentrating on first when issuing production drivers and driver updates. Effective from this point 2.2 driver submissions or major driver updates will only be accepted if the same code is also available for 2.4.
Someone has to do the merging otherwise, and it isnt going to be me...
There were mixed reactions to this. Nicholas Knight felt this policy was a mistake. Until the 2.4 series had stablized, he felt, 2.2 would continue to be the kernel of choice for many people, in which case Alan's policy might result in less work being done on that kernel, and thus, fewer new features in 2.2; he suggested waiting until 2.4 had reached a state where users could upgrade safely. There were several replies. Mark Hahn said:
egads! how can there be "development" on a *stable* kernel line?
maybe this is the time to reconsider terminology/policy: does "stable" mean "bugfixes only"? or does it mean "development kernel for conservatives"?
Daniel Phillips replied:
It means development kernel for those who don't have enough time to debug the main kernel as well as their own project. The stable branch tends to be *far* better documented than the bleeding edge branch. Try to find documentation on the all-important page cache, for example. It makes a whole lot of sense to develop in the stable branch, especially for new kernel developers, providing, of course, that the stable branch has the basic capabilities you need for your project.
Alan isn't telling anybody which branch to develop in - he's telling people what they have to do if they want their code in his tree. This means that when you develop in the stable branch you've got an extra step to do at the end of your project: port to the unstable branch. This only has to be done once and your code *will* get cleaned up a lot in the process. (It's amazing how the prospect of merging 500 lines of rejected patch tends to concentrate the mind.) I'd even suggest another step after that: port your unstable version back to the stable branch, and both versions will be cleaned up.
Wayne Brown objected, "In other words, there's no longer any such thing as a "stable" branch. The whole point of having separate production and development branches was to have one in which each succeeding patch could be counted upon to be more reliable than the last. If new development is going into the "stable" kernels, then there's no way to be certain that the latest patches don't have more bugs than the earlier ones, at least not without thoroughly testing them. And if testing is necessary, then we might as well just use the development kernels for everything, because we have to test them anyway." Alan replied, "By your personal definition of stable 2.0.3x is the current stable kernel."
The subthread trailed off at that point, but elsewhere, Tim Riker also replied to Nicholas criticism of Alan's initial announcement. He said:
here are some comments in Alan's favor:
He did not say people can not release 2.2 patches without 2.4 patches. He only said they will not be integrated into the kernel distribution without 2.4 patches.
If people continue to develop for 2.2 and have someone else, who is probably less familiar with the hardware, port to 2.4 for them, how soon would you trust the drivers over the 2.2 drivers?
In short, I agree with Alan completely. This is the correct move forward to cause 2.4 to become the stable release that everyone will be willing to adopt.
Rik van Riel also replied to Nicholas, regarding the suggestion that it was a mistake not to wait until 2.4 had stablized before instituting Alan's new policy. Rik said:
This is *exactly* why Alan's policy change makes sense.
If somebody submits a driver bugfix or update for 2.2, but not for 2.4, it'll take FOREVER for 2.4 to become as "trustable" as 2.2...
This change, however, will make sure that 2.4 will be as reliable as 2.2 much faster. Unlike 2.2, the core kernel of 2.4 is reliable ... only the peripheral stuff like drivers may be out of date or missing.
Elsewhere, Michael D. Crawford suggested that Linus Torvalds had arbitrarily decided to release 2.4.0 just to increase the number of people testing it. He said, "I understand Linus' desire to have more widespread testing done on the kernel, and certainly he can accomplish that by labeling some random build as the new stable version. But I think a better choice would have been to advocate testing more widely - don't just announce it to the linux-kernel list, get on National Public Radio, the Linux Journal and Slashdot and stuff." Linus replied:
You don't understand people, I think.
No amount of publicity will matter all that much in the end: yes, it will result in many people who are not afraid of a compiler to try it out. And we've had that for over six months now, realistically.
But that's very different from having somebody like RedHat, SuSE or Debian make such a kernel part of their standard package. No, I don't expect that they'll switch over completely immediately: that would show a lack of good judgement. The prudent approach has always been to have both a 2.2.19 and a 2.4.0 kernel on there, and ask the user if he wants to test the new kernel first.
That way you get a completely different kind of user that tests it.
The other thing is that even if something like 2.4.0-test8 gets rave reviews, that doesn't _matter_ to people who crave stability. The fact is that 2.4.0 has been getting quite a lot of testing: people haven't even seen how the big vendors have all done testing in their labs etc.
And to the people who really want to have stability, none of that matters. They will basically "start fresh" at the 2.4.0 release, and give it a few months just to follow the kernel list etc to see what the problems will be. They'll have people starting to ramp up 2.4.0 kernels in their own internal test environment, moving it first to machines they feel more comfortable with etc etc.
None of which would happen if you just try to make the beta testing cycle much bigger.
Which is why to _me_ the most important thing is that I'm happy with the core infrastructure - because once you've tested it to a certain degree, it's not going to improve without a real public release.
Modutils 2.4.0 Available
4 Jan 2001 - 8 Jan 2001 (15 posts) Archive Link: "Announce: modutils 2.4.0 is available"
People: Erik Mouw, Wichert Akkerman, Anuradha Ratnaweera, Keith Owens
Keith Owens announced modutils 2.4.0 and gave a link to the sources and some RPMs (ftp://ftp.us.kernel.org/pub/linux/utils/kernel/modutils/v2.4) . Anuradha Ratnaweera suggested also providing .deb packages, but Erik Mouw replied, "He just provides the rpms as a service, he doesn't have to do that. Install the "alien" package on your machine and you will be able to convert between rpm and deb." Wichert Akkerman replied:
Bad plan, considering packages rely on some infrastructure that is not in the rpm (update-modules). I tend to be pretty quick with making and uploading the deb anyway.
Having said that, I won't package 2.4.0 and will wait for 2.4.1 instead.
MM/VM Todo List
5 Jan 2001 - 8 Jan 2001 (14 posts) Archive Link: "MM/VM todo list"
Topics: Clustering, Virtual Memory
People: Rik van Riel, Ben LaHaise
Rik van Riel announced:
here is a TODO list for the memory management area of the Linux kernel, with both trivial things that could be done for later 2.4 releases and more complex things that really have to be 2.5 things.
Most of these can be found on http://linux24.sourceforge.net/ too
VM: better IO clustering for swap (and filesystem) IO
- Marcelo's swapin/out clustering code
- ->writepage() IO clustering support
- page_launder()/->writepage() working together in avoiding low-yield (small cluster) IO at first, ...
- VM: include Ben LaHaise's code, which moves readahead to the VMA level, this way we can do streaming swap IO, complete with drop_behind()
- VM: enforce RSS ulimit
Probably 2.5 era:
- VM: physical->virtual reverse mapping, so we can do much better page aging with less CPU usage spikes
- VM: move all the global VM variables, lists, etc. into the pgdat struct for better NUMA scalability
- VM: per-node kswapd for NUMA
- VM: thrashing control, maybe process suspension with some forced swapping ? (trivial only in theory)
- VM: experiment with different active lists / aging pages of different ages at different rates + other page replacement improvements
- VM: Quality of Service / fairness / ... improvements
Additions to this list are always welcome, I'll put it online on the Linux-MM pages (http://www.linux.eu.org/Linux-MM/) soon.
Why Use Modules?
5 Jan 2001 - 8 Jan 2001 (13 posts) Archive Link: "The advantage of modules?"
People: Michael Meissner, Drew Bertola
Evan Thompson asked if there were any real reason to prefer compiling modules as modules instead of compiling everything into the kernel binary. Drew Bertola suggested that module developers could load and unload modules for test purposes, without having to reboot the entire system. Michael Meissner said at greater length:
A couple of thoughts:
- A full kernel with everything compiled in might not fit on boot media such as floppies, while modules allows you to not load stuff that isn't needed to until after the main booting is accomplished.
- There are several devices that have multiple drivers (such as tulip, and old_tulip for example). Which particular driver works depends on your exact particular hardware. If both of these drivers are linked into the kernel, whatever the kernel chooses to initialize first will talk to the device.
- Having drivers as modules means that you can remove them and reload them. When I was working in an office, I had one scsi controller that was a different brand (Adaptec) than the main scsi controller (TekRam), and I hung a disk in a removable chasis on the scsi chain in addition to a tape driver and cd-rom. When I was about to go home, I would copy all of the data to the disk, unmount it, and then unload the scsi device driver. I would take the disk out, and reload the scsi device driver to get the tape/cd-rom. I would then take the disk to my home computer. I would reverse the process when I came in the morning.
- If you have multiple scsi controllers of different brands, building on into the kernel and the other brand(s) as modules allows you to control which scsi controller is the first controller in terms of where the disks are.
Bug Report Generation Tool
5 Jan 2001 - 11 Jan 2001 (43 posts) Archive Link: "[PATCHlet]: removal of redundant line in documentation"
People: Jeremy M. Dolan, Matthias Juchem, Alan Cox, Richard Torkar, Pavel Machek, David Ford, Rafael E. Herrera
In the course of discussing patch submissions, Jeremy M. Dolan suggested:
why not include a script which takes care of ALL the leg work? All of the files it asks the reporter to include are o+r...
I can whip up a bug_report script to walk the user though all of the steps in REPORTING-BUGS, if the list isn't averse to 'dumbing down' the process to the point where maybe some people who shouldn't be submiting bugs (two words: 'user error') end up not being scared off by the process.
Is perl allowed for kernel scripts intended for users, or am I stuck with sh?
Matthias Juchem that he'd already started work on such a script, and Pavel Machek had some suggestions.
Elsewhere, under the Subject: [PATCH] new bug report script (http://www.uwsg.indiana.edu/hypermail/linux/kernel/0101.0/1306.html) , Matthias posted a patch against 2.4.0 and explained, "It introduces a new bug reporting script (scripts/bugreport.pl) that tries to simplify bug reporting for users. I have also added a small hint to this script to REPORTING-BUGS." There was some discussion of possible fixes to the script, but elsewhere, Alan Cox objected, "The kernel doesnt require perl. I don't want to add a dependancy on perl." Matthias pointed out several other perl scripts in the official sources, and suggested making it optional. But Alan replied, "None of these are needed for normal build/use/bug reporting work. In fact if you look at script_asm you'll see we go to great pains to ship prebuilt files too." Matthias argued, "Why can't I assume that perl is installed? It can be found on every standard Linux/Unix installation. And besides, the bug report script doesn't replace anything the doesn't need perl - ver_linux, REPORTING-BUGS and oops-tracing.txt are still there for the more advanced user. My script is intended for the one who likes to provide bug reports but is too lazy to look up all the information or simply is not sure about what to include."
David Ford asked why the script couldn't be done as a shell script, and Matthias replied:
It can be done in sh, surely. I only tried to promote my perl version because I've done it in perl and nobody told me earlier that perl is not liked in the kernel tree - and I've seen some perl scripts there.
I guess I'll have to convert the script to sh.
Elsewhere, under the Subject: bugreporting script - second try (http://www.uwsg.indiana.edu/hypermail/linux/kernel/0101.1/0915.html) , Matthias announced, "I rewrote my previous bugreport.pl in bash. I would appreciate it if you had a look on this one. Run it once and give me feedback if you like." Richard Torkar reported success with it, though he'd been unable to test the ksymoops feature. After some more feedback from Richard, Matthias posted a link to a new version (http://www.brightice.de/src/bugreport.sh) . Rafael E. Herrera posted a patch to the script, to enable the use of /proc/config.gz if any were available. Matthias liked this idea and adopted it into the script.
Patch Submission Policy For 2.4
6 Jan 2001 - 10 Jan 2001 (7 posts) Archive Link: "Linux-2.4.x patch submission policy"
Topics: FS: ramfs, Virtual Memory
People: Linus Torvalds, Alan Cox, Rik van Riel, Andrew Morton
Linus Torvalds stated:
I thought I'd mention the policy for 2.4.x patches so that nobody gets confused about these things. In some cases people seem to think that "since 2.4.x is out now, we can relax, go party, and generally goof off".
The linux kernel has had an interesting release pattern: usually the .0 release was actually fairly good (there's almost always _something_ stupid, but on the whole not really horrible). And every single time so far, .1 has been worse. It usually takes until something like .5 until it has caught up and surpassed the stability of .0 again.
Why? Because there are a lot of pent-up patches waiting for inclusion, that didn't get through the "we need to get a release out, that patch can wait" filter. So early on in the stable tree, some of those patches make it. And it turns out to be a bad idea.
In an effort to avoid this mess this time, I have two guidelines:
- I've basically thrown away all patches sent to me so far, and I will continue to do so at least over the weekend. I'm not going to bother thinking about patches for a few days.
- In order for a patch to be accepted, it needs to be accompanied by some pretty strong arguments for the fact that not only is it really fixing bugs, but that those bugs are _serious_ and can cause real problems.
- Obviously, the size of the patch matters too: if you can make an obvious fix in 5 lines, do it. Don't try to make a clean fix that fixes the problem the clever way in 150 lines.
In short, releasing 2.4.0 does not open up the floor to just about anything. In fact, to some degree it will probably make patches _less_ likely to be accepted than before, at least for a while. I want to be absolutely convicned that the basic 2.4.x infrastructure is solid as a rock before starting to accept more involved patches.
Right now my ChangeLog looks like this:
- Don't drop a megabyte off the old-style memory size detection
- remember to UnlockPage() in ramfs_writepage()
- 3c59x driver update from Andrew Morton
The first two are true one-liners that have already bitten some people (not what I'd call a showstopper, but trivially fixable stuff that are just thinkos). The third one looks like a real fix for some rather common hardware that could do bad things without it.
Now, I'm sure that ChangeLog will grow. There's the apparent fbcon bug with MTRR handling that looks like a prime candidate already, and I'll have people asking me for many many more. But basically what I'm asking people for is that before you send me a patch, ask yourself whether it absolutely HAS to happen now, or whether it could wait another month.
Another way of putting it: if you have a patch, ask yourself what would happen if it got left off the next RedHat/SuSE/Debian/Turbo/whatever distribution CD. Would it really be a big problem? If not, then I'd rather spend the time _really_ beating on the patches that _would_ be a big issue. Things like security (_especially_ remote attacks), outright crashes, or just totally unusable systems because it can't see the harddisk.
We'll all be happier if my ChangeLog is short and sweet, and if a 2.4.1 (tomorrow, in a week, in two, in a month, depending on what comes up) actually ends up being _better_ than 2.4.0. That would be a good new tradition to start.
And before you even bother asking about 2.5.x: it won't be opened until I feel happy to pass on 2.4.x to somebody else (hopefully Alan Cox doesn't feel burnt out and wants to continue to carry the torch and feels ok with leaving 2.2.x behind by then).
Historically, that's been at least a few months. In the 2.2.x series, 2.3.0 was the same as 2.2.8 with just the version changed - and it came out in May, almost four months after 2.2.0. In the 2.0.x series, 2.1.x was based off 2.0.21, four and a half months after 2.0.0.
Yes, I know this is boring, and all I'm asking is for people to not make it any harder for me than they have to. Think twice before sending me a patch, and when you _do_ send me a patch, try to think like a release manager and explain to me why the patch really makes sense to apply now.
In short, I'm hoping for a fairly boring next few months. The more boring, the better.
Alan Cox added regarding his own patches, "Think of -ac as a way to get patches you need that everyone else might not need yet, and a way to filter stuff. Im happy to take sane stuff Linus doesn't (within reason) and propogate it on as (or more to the point if) it proves sane." Rik van Riel also volunteered to "gather all non-bug VM patches and combine them into a special big patch periodically. Once we are sure 2.4 is stable for just about anybody I will submit some of the really trivial enhancements for inclusion; all non-trivial patches I will maintain in a VM bigpatch, which will be submitted for inclusion around 2.5.0 and should provide one easy patch for those distribution vendors who think 2.4 VM performance isn't good enough for them ;)"
Bug In 2.4.0 Virtual Memory Subsystem
8 Jan 2001 - 10 Jan 2001 (19 posts) Archive Link: "VM subsystem bug in 2.4.0 ?"
Topics: Virtual Memory
People: Rik van Riel, Linus Torvalds, Stephen C. Tweedie, Tim Wright, Christoph Rohland
Sergey E. Volkov was testing an Informix IIF-2000 database server running on a dual Intel Pentium II 233MHz; when running 'make -j30 bzImage' on the kernel source tree, the system would completely hang. Trying the same thing on the same machine without Informix running, no hang occurred. He suspected the problem was that Informix allocated about 50% of the system's RAM as locked shared memory. So the kernel would try to swap out the locked segments, fail, and wait forever for them to swap out. Rik van Riel replied:
You are right. I have seen this bug before with the kernel moving unswappable pages from the active list to the inactive_dirty list and back.
We need a check in deactivate_page() to prevent the kernel from moving pages from locked shared memory segments to the inactive_dirty list.
He asked for advice from Christoph Rohland and Linus Torvalds, and Linus suggested:
The only solution I see is something like a "active_immobile" list, and add entries to that list whenever "writepage()" returns 1 - instead of just moving them to the active list.
Seems to be a simple enough change. The main worry would be getting the pages _off_ such a list: anything that unlocks a shared memory segment (can you even do that? If the only way to unlock is to remove, we have no problems) would have to have a special function to move all pages from the immobile list back to the active list (and then they'd get moved back if they were for another segment that is still locked).
Rik suggested just having a special "do not deactivate me" data-bit for each item on the list. "When this special bit is set," he said, "we simply move the page to the back of the active list instead of deactivating." He added, "when the bit changes again, the page can be evicted from memory just fine. In the mean time, the locked pages will also have undergone normal page aging and at unlock time we know whether to swap out the page or not." He admitted that this method would have higher overhead than Linus', but it seemed simpler and more flexible to him. Stephen C. Tweedie objected that he didn't see a way to clear the bit properly, saying, "Locking is a per-vma property, not per-page. I can mmap a file twice and mlock just one of the mappings. If you get a munlock(), how are you to know how many other locked mappings still exist?" Linus replied:
Note that this would be solved very cleanly if the SHM code would use the "VM_LOCKED" flag, and actually lock the pages in the VM, instead of trying to lock them down for writepage().
That would mean that such a segment would still get swapped out when it is not mapped anywhere, but I wonder if that semantic difference really matters.
If the vma is marked VM_LOCKED, the VM subsystem will do the right thing (the page will never get removed from the page tables, so it won't ever make it into that back-and-forth bounce between the active and the inactive lists).
Christoph posted a lightly tested patch, and Linus asked:
I'd really like an opinion on whether this is truly legal or not? After all, it does change the behaviour to mean "pages are locked only if they have been mapped into virtual memory". Which is not what it used to mean.
Arguably the new semantics are perfectly valid semantics on their own, but I'm not sure they are acceptable.
In contrast, the PG_realdirty approach would give the old behaviour of truly locked-down shm segments, with not significantly different complexity behaviour.
What do other UNIXes do for shm_lock()?
The Linux man-page explicitly states for SHM_LOCK that
The user must fault in any pages that are required to be present after locking is enabled.
which kind of implies to me that the VM_LOCKED implementation is ok. HOWEVER, looking at the HP-UX man-page, for example, certainly implies that the PG_realdirty approach is the correct one. The IRIX man-pages in contrast say
Locking occurs per address space; multiple processes or sprocs mapping the area at different addresses each need to issue the lock (this is primarily an issue with the per-process page tables).
which again implies that they've done something akin to a VM_LOCKED implementation.
Does anybody have any better pointers, ideas, or opinions?
In terms of how other UNIXes handled the situation, Tim Wright said:
It appears that the fine-detail semantics vary across the board. DYNIX/ptx supports two forms of SysV shm locking - soft and hard. Soft-locking (the default) merely makes the pages sticky, so if you fault them in, they stay in your resident set, but don't count against it. If, however the process swaps, they're all evicted, and when the process is swapped back in, you get to fault the back in all over again. Hard locking pins the segment into physical memory until such time as it's destroyed. It stays there even if there are currently no attaches. Again, such pages are not counted against the process RSS.
SVR4 only support one form. It faults all the pages in and locks them into memory, but doesn't treat the especially wrt rss/paging, which seems none too clever - if they're locked into memory, you might as well use them :-)
The discussion ended around there.
Superfluous Whitespace In The Kernel Sources
8 Jan 2001 (4 posts) Archive Link: "Extraneous whitespace removal?"
People: David Weinehall, Rusty Russell, Jeremy M. Dolan
Jeremy M. Dolan took all whitespace off of the ends of lines in the kernel sources, removing almost 200 K and producing almost a 2 M patch. David Weinehall replied:
While I really like the idea with this patch, I'm 100% certain that Linus would not, under any circumstances, accept this patch.
I suggest that we instead force everyone to program with:
(Or equivalent Emacs/[insert favourite editor here]-setting instead)
While at it, force people to read linux/Documentation/CodingStyle and make them adhere to it.
Of course, I guess this is a free world (yeah, right) and everyone should have the right to code in their own way, but I'd wish that people at least could be consistent when indenting/spacing/bracing/whatever, and when patching other people's code, also follow the already set standard of that file instead of introducing a new one...
Rusty Russell added, "I've done this before, but never posted it, lest they think I'm insane. I vote this for 2.5.1." He suggested listing Jeremy in the MAINTAINERs file as the official whitespace maintainer.
9 Jan 2001 - 10 Jan 2001 (7 posts) Archive Link: "[Announcement] linux-kernel v2.0.39"
Topics: CREDITS File, Disks: IDE, FS: devfs, FS: ext2, FS: smbfs, MAINTAINERS File, Networking
People: David Weinehall, Matthew Grant, Jan Kara, Stephen C. Tweedie, Jari Ruusu, Andries Brouwer, Alan Cox, Ivan Passos, Andrea Arcangeli, Andre Hedrick, Jean Tourrilhes, Richard Gooch
David Weinehall announced 2.0.39:
Everyone laughs, I guess. The 2.0.39final didn't became the final release (could've told you so...) The good thing? Well, some bugs were found and removed. But this is it. Enjoy!
Changelog for v2.0.39
- Fix memory-leak in af_unix (Jon Nelson, Alan Cox, me)
- Added headerfiles for devfs to simplify backports of drivers (Richard Gooch)
- Fix a bug involving syncronous writes and -ENOSPC that could cause file-corruption (Jari Ruusu)
- Added new versions of PCI-2000 (Mark Ebersole)
- Added new versions of PCI-2220i (Mark Ebersole)
- Fixed a few typos in PCI-2000, PCI-2220i, PSI-240i and related files (me)
- Removed unused variable in xd.c (me)
- Renamed the initfunctions in pi2.c and pt.c, as their names clashed with paride-names (obviously, noone uses paride together with hamradio) (me)
- Changed most references to vger.rutgers.edu to vger.kernel.org (me)
- Fix the few vger.rutgers.edu references that I missed (Daniel Roesen)
- Fix a bug in af_unix that wrote to a socket after freeing it (aka the Win9x-related oops) (Michael Deutschmann)
- Fixed typo in Documentation (Martin Douda)
- IDE-patches (Andre Hedrick)
- Fixes for the IDE-patches (Andries Brouwier,
- Move memory-offset for dynamic executables (Michael Deutschsmann)
- Fixes to the Cyclades-driver (Ivan Passos)
- Fix for a bug in ext2 (Stephen C. Tweedie)
- Added marketing-names for 3Com NICs in drivers/net/Config.in (Yann Dirson, me)
- Fix for a buf in smbfs (Rick Bressier)
- Large-disk fixes (Andries Brouwer)
- Wavelan-driver cleanup & bugfixes (Jean Tourrilhes)
- Security-fixes (Solar Designer)
- Quota-fixes (Jan Kara)
- Fixed GPF using IPsec Masquerade (Rudolf Lippan)
- Fixed Config.in bugs in drivers/net and drivers/isdn (Marc Martinez)
- Added IPX-routing of NetBIOS packages (Jan Rafaj)
- Fix for a bug in paride (Wolfram Gloger)
- Fix an erroneous printk in ip_fw.c (Todd Sabin)
- Fix for IP multicast on WAN-adapters (Matthew Grant)
- Big updates to MAINTAINERS (me)
- Big updates to CREDITS (me, others)
- Various updates in Documentation/* (me)
- Styled up all Configuration-files in a similar manner to newer v2.3 kernels, and various other cleanups (me)
- Updated CodingStyle to the one used in recent v2.3 kernels (me)
- Backported nls_8859-14 (me)
- Added support for sparse superblocks (Theodore T'so)
- Fix for the ping -s 65468 exploit (Andrea Arcangeli, others)
2.4.0 On The IA64
10 Jan 2001 (4 posts) Archive Link: "2.4.0 release and ia64"
People: Bill Nottingham
Someone asked if 2.4.0 would run on the IA64 or if some special patches were required. Bill Nottingham replied, "There's a patch for it in ports/ia64 on your favorite linux kernel mirror." The original poster replied that those patches appeared to be only for test kernels, not the official 2.4.0 release. Bill replied:
There *should* be a patch for 2.4 final:
If not, your mirror isn't up to date.
Statistical Kernel Profiler Available
10 Jan 2001 - 11 Jan 2001 (2 posts) Archive Link: "[ANNOUNCE] oprofile profiler"
People: John Levon, Karim Yaghmour
John Levon announced:
oprofile is a low-overhead statistical profiler capable of instruction-grain profiling of the kernel (including interrupt handlers), modules, and user-space libraries and binaries.
It uses the Intel P6 performance counters as a source of interrupts to trigger the accounting handler in a manner similar to that of Digital's DCPI. All running processes, and the kernel, are profiled by default. The profiles can be extracted at any time with a simple utility. The system consists of a kernel module and a simple background daemon.
Typical overhead is around 3 or 4 percent. Worst case overhead on a Pentium II 350 UP system is around 10-15%
You can read a little more about oprofile, and download a very alpha version at :
oprofile is released under the GNU GPL.
Karim Yaghmour replied:
This is really interesting. Great stuff.
As Alan had once suggested, it would be very interesting to have this information correlated with the content of the traces collected using the Linux Trace Toolkit (http://www.opersys.com/LTT). For instance, you could see how many cache faults the read() or write() operation of your application generated and other unique info. It would also be possible to enhance the post-mortem analysis done by LTT to take in account this data. You could also use LTT's dynamic event creation mechanism to log the profiling data as part of the trace.
There are definitely opportunities for interfacing/integrating here.
Let me know what you think.
There was no reply.
LVM Fixes Slow To Get Into The Official Kernel
10 Jan 2001 (5 posts) Archive Link: "Oops in 2.4.0 (@ LVM)"
Topics: Disk Arrays: LVM, Version Control
People: Andreas Dilger, Paul Jakma
Gustavo Zacarias got an oops from LVM running under 2.4.0, and Andreas Dilger replied:
There is a patch to the LVM kernel code which should help: ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/patches/v2.4/2.4.0ac4/lvm-fix-2
You should also get the LVM user tools from CVS (with TAG LVM_0-9-patches) to solve this problem. There will hopefully be a new LVM release soon.
Paul Jakma asked, "any word on when the kernel fixes are going to linus?" Andreas replied, "I've heard "soon" on the LVM list, but I'm just one of the chickens. If it were up to me, the fixes would go to Linus as soon as they are found." And Paul said, "indeed. it looks bad when code is updated irregularly, and it's a pain for users." End Of Thread.
Comparing Khttpd, Boa, And Tux
11 Jan 2001 - 13 Jan 2001 (12 posts) Archive Link: "khttpd beaten by boa"
Topics: Web Servers
People: Christoph Lameter, Lars Marowsky-Bree, David S. Miller, H. Peter Anvin, Arjan van de Ven, Dean Gaudet, Alan Cox
Christoph Lameter reported losing an argument over which web server was faster, khttpd or boa. He posted some numbers and said that in the first test, "boa won hands down because it supports persistant connections." They'd ran the same test with persistant connection turned off, but found that boa still won. He said:
This shows the following problems with khttpd:
1. Connect times are on average longer than boa. Why???
2. Transfers also take longer,
What is wrong here?
Lars Marowsky-Bree replied disgruntledly, "This just goes on to show that khttpd is unnecessary kernel bloat and can be "just as well" handled by a userspace application, minus some rather very special cases which do not justify its inclusion into the main kernel." David S. Miller added, "My take on this is that khttpd is unmaintained garbage. TUX is evidence that khttpd can be done properly and beat the pants off of anything done in userspace." H. Peter Anvin suggested, "Then why don't we unload khttpd and put in Tux?" Elsewhere, Arjan van de Ven remarked, "TuX is certainly the "next and better" generation, and I look forward to working with Ingo and others on it." But Alan Cox mentioned that, since tux required the 'zero copy' patches, those patches would have to go in before Tux could be considered.
Elsewhere, under the Subject: khttpd beats boa with persistent patch (http://www.uwsg.indiana.edu/hypermail/linux/kernel/0101.1/1573.html) , Christoph said with glee:
I applied the persistent khttpd patch + my vhost patch and now khttpd beats boa!!! (patch against 2.4.0 follows at the end of the message)
The connection times of boa are still better but khttpd wins in transfers.
Dean Gaudet pointed out that running the test locally ignored network latencies, and was thus a meaningless benchmark. He explained, "latency is as important, or even more important than raw throughput. anything beyond a second or two is the point where humans start giving up on the server. if you study a real benchmark such as specweb99 you'll find that if you don't have good response latency then your score is not valid. they actually have a minimum throughput that each connection must meet or else it's considered an error -- it's similar to having a latency budget, with some slight differences."
Unexplained 2.4.0 Filesystem Corruption
12 Jan 2001 - 14 Jan 2001 (15 posts) Archive Link: "2.4 ate my filesystem on rw-mount"
Topics: Disks: IDE
People: Tobias Ringstrom, Alan Cox, Vojtech Pavlik
Tobias Ringstrom gave a hair-raising account of his 2.4.0 experiences:
I've never seen anything like it before, which I'm happy for. The system had been running a standard RedHat 7 kernel for days without any problems, but who wants to run a 2.2 kernel? I compiled 2.4.0 for it, rebooted, and blam! The RedHat init stripts got to the "remounting root read-write" point, and just froze solid.
Rebooting into RH7 failed, becauce inittab could not be found. In fact the filesystem was completely messed up, with /dev empty, lots of device nodes in /etc, and files missing all over the place. I had to reinstall RH7 from scratch.
I do not understand how this could happen during a remounting root rw. Is the filesystem really that unstable?
Am I right in suspecting DMA, which was enabled at the time? Any other ideas? Is it a known problem?
This is on a 450 MHz AMD-K6 with the following IDE controller:
00:07.1 IDE interface: VIA Technologies, Inc. VT82C586 IDE [Apollo] (rev 06)
Alan Cox replied, "There are several people who have reported that the 2.4.0 VIA IDE driver trashes hard disks like that. The 2.2 one also did this sometimes but only with specific chipset versions and if you have dma autotune on (thats why currently 2.2 refuses to do tuning on VP3)"
Vojtech Pavlik also replied to Tobias, saying, "Wow. Ok, I'm maintaining the 2.4.0 VIA driver, so I'd like to know more about this." He asked for specific hardware details, which Tobias provided, and they went back and forth for a bit, though no solution appeared on the list.
PowerPC In The Official Tree
13 Jan 2001 (2 posts) Archive Link: "PPC 2.4 ?"
People: Cort Dougan, Giuliano Pochini
Giuliano Pochini asked when the PowerPC tree would be merged into the official sources, since none of the official versions would even compile. Cort Dougan replied:
Grab a tree from http://www.fsmlabs.com/linuxppcbk.html. Those always compile and are up-to-date.
I send patches, but they don't always make it into the main tree. In the mean time, you have a consistent source of kernels with the above web site.
We Hope You Enjoy Kernel Traffic
Kernel Traffic is grateful to be developed on a computer donated by Professor Greg Benson and Professor Allan Cruse in the Department of Computer Science at the University of San Francisco. This is the same department that invented FlashMob Computing. Kernel Traffic is hosted by the generous folks at kernel.org. All pages on this site are copyright their original authors, and distributed under the terms of the GNU General Public License, version 2.0.