Kernel Traffic #73 For 26 Jun 2000

By Zack Brown

Table Of Contents

Introduction

Thanks go out to Marius Gedminas for finding some serious HTML errors in last week's issue, and a bug in the compiler that produces these pages! Good eye, Marius! Thanks!!

Mailing List Stats For This Week

We looked at 1363 posts in 5519K.

There were 463 different contributors. 217 posted more than once. 173 posted last week too.

The top posters of the week were:

1. SPX Unfinished In Stable Series

7 Jun 2000 - 13 Jun 2000 (10 posts) Subject: "how about SPX ?"

People: Jay SchulistJeff V. MerkeyAlan CoxH. Peter Anvin

Lorinczy Zsigmond got an oops trying to use Sequenced Packet Exchange (SPX) (http://developer.novell.com/research/appnotes/1995/december/03/04.htm) in 2.2.15. He posted the error and asked who the SPX maintainer was. As far as he could tell, it hadn't been touched since January 1999. Jay Schulist replied:

I am the maintainer, but I hardly have any time for SPX at the moment. I would be more than happy to work with you to get it working again.

Let me know if you are available to debug and test it with me.

There was no reply to this, but Alan Cox replied to Lorinczy (apparently without having seen Jay's post), confirming that SPX was broken, and didn't seem to have a maintainer. Jeff V. Merkey volunteered, with, "To whomever is attempting to use SPX, please forward me the Ooops info, and I'll attempt to take a look at it and fix it. The current SPX engine in Linux is incomplete, but I am willing to take a stab at looking at the bugs. Alan -- I'll try to help keep it current." Alan replied with his assessment, saying, "Finish writing it is closer. The connection accept logic for example is entirely ficticious ;)"

Jeff replied that he'd be happy to do a full SPX implementation, with NetWare-compatible APIs; and added that he'd done it on three other OSes as well. But H. Peter Anvin warned, "Note that we already have a socket-compatible interface, which is the interface of choice. A NetWare-compatible API should be done as a user-space library." He went on, though:

However, would certainly appreciate your help in getting the IPX/SPX implementation cleaned up and improved -- it clearly has been suffering from neglegt/disuse -- and to get that user-space library written!

This, combined with your NWFS implementation, should make Linux pretty much a drop-in replacement for NetWare, I'm guessing.

There was no reply, but elsewhere and a week later Jeff reported briefly, "I'm going through this code in detail now. Packet burst isn't working correctly either."

2. Linux 2.5 To Do List Looks For Web Server Space

8 Jun 2000 - 14 Jun 2000 (10 posts) Subject: "Linux 2.5 TODO on the web"

People: Alan CoxJames SutherlandMatthias AndreeBartlomiej ZolnierkiewiczGary Lawrence MurphyKenneth C. Arnold

Kenneth C. Arnold posted a URL to the 2.5/2.6 To Do List (http://kena.8k.com/linux-kernel/) . But he replied to himself about an hour and a half later, to explain the "404" error some folks had been getting from the server. Apparently the host company had HTML requirements, so that banner ads could be displayed automatically on the page. Since the To Do list was in plain text, several different browsers were running up against this wall. Bartlomiej Zolnierkiewicz commented unhappily that the wrapper page could list all "broken" browsers, but Alan Cox replied, "It isnt a 'broken' browser. Its a correctly implemented browser. Referrer is an unacceptably flawed privacy problem. Good tools do not send referrer entries." James Sutherland replied historically:

Mozilla dropped this entry from their request headers for a while, until they discovered that broke too many WWW sites, when they had to put it back in. (According to one of the developers, they had been trying to make requests as small (=> fast) as possible, but been a little overzealous.)

One of the privacy settings available in Squid is to strip out these header fields. They do make special mention of User-Agent, since enough sites modify the results based on which browser you are using, but last time I looked, Referer wasn't mentioned.

Maybe if enough people report this to FreeServers as being a problem, they'll fix it?

Elsewhere, Matthias Andree interjected, "Fuck it! If they want to give me their information only if I sell my privacy for marketing or advertising sake, I give a shit. Please move the information elsewhere. Espionage as industry standard? How deep have we fallen? The marketing dudes everywhere plug the fingers in your asses and you are considering which browser works and which fails. That's the wrong approach. Unregister that stuff and place it elsewhere." He offered to host it on his own web site, so long as it stayed smaller than 64K. James also offered 10M of web space. Gary Lawrence Murphy also offered to host the To Do list, if Kenneth didn't mind banner ads for other GPLed projects.

3. Stallman Advocates "MSDOS-Style Floppy Handling"

8 Jun 2000 - 18 Jun 2000 (64 posts) Subject: "Floppy handling"

People: Richard M. StallmanIan McKellarAlex BuellAlan CoxJeff GarzikAdam Sampson

Richard M. Stallman requested:

Is there any possibility of making Linux handle file systems on floppies like MSDOS, so that there is no need to explicitly mount and unmount a floppy drive in order to access floppies through the file system?

Because of the inconvenience of mounting and unmounting floppies, I never do that; I only use mcopy. mcopy is covenient enough, for people who don't mind learning another few commands. But that isn't very easy for non-hackers, the kind of people who now prefer Windows. A complaint from a non-hacker about this is what inspired me to write this message.

We want to make GNU/Linux appeal to Windows users, and this is one of the things necessary to do that. And if MSDOS could do this, surely we can.

There were a number of replies, and a couple longer threads. Adam Sampson suggested that Stallman's feature could be implemented via a user-space filesystem. But he acknowledged that Linux currently had no interface for such a thing, although 'podfuk' worked by emulating the userspace portion of a networked filesystem.

Elsewhere, Ian McKellar mentioned he'd been working on implementing MSDOS filesystem support in the GNOME Virtual Filesystem layer. But Richard replied, "Despite my loyalty to GNOME (as a part of the GNU Project), on technical grounds I have to disagree. File access needs to be available to all the programs on the system, and the only natural way to do that is through the kernel and the C library. Doing this in GNOME will produce a feature that half-works, which means it will appear to the users as an unreliable feature." Ian ended the subthread with, "I sort of agree with you on this, but you can extend this argument to just about any feature of a non-core library. In GTK we implement widgets which the X libraries don't. This leads to user interface inconsistencies. In gnome-vfs we support remote filesystems which libc doesn't. This leads to inconsistent filesystem handling between applications."

Alex Buell also replied to Richard's initial request, saying, "Over my dead body, mate. There's the mtools package to do this anyway, and RH 6.x defaults to non-root users mounting /dev/fd0. I don't see why we have to bend over backwards when it's better for them to learn to get to grips with UNIX rather than be cosseted with historical MS crud. Even CP/M needed to mount floppies." Richard replied in all seriousness, "I do plan to try to find someone to implement this fully, but I think it would be unfortunate if your death were the result. I hope you will find the courage to keep on living despite the existence of this feature."

Alan Cox also replied to Richard's initial request, suggesting 'supermount' as a possibility. But he added, "It needs more hands to port it to 2.3.x and it needs a very good review of the code to fix remaining questionable habits (or rewriting as a stackable fs). The stuff needed is out there however." Jeff Garzik gave a historical overview:

Porting supermount from 2.0.x to 2.2.x was a cosource.com project, undertaken by Alexis Milhailov. He also ported it to 2.3.x. Unfortunately the web page where supermount is supposed to be located doesn't resolve (http://supermount.cornpops.cx/)

I uploaded the latest I have, against 2.3.99-pre5, at http://gtf.org/garzik/kernel/files/supermount-0.3.1-2.3.99-pre5.diff.gz

It definitely needs review, especially by the original SuperMount author (sct?) Al Viro looked at one version of the patch and worried about races.

There were several other threads in which 'supermount' came up as the solution to Richard's problem, and Richard eventually said, while discussing a different possible solution, "Supermount seems to come closer to what I had in mind, and maybe could do the job. But some people say it is not reliable. If it is made reliable and gets included as a standard part of Linux, maybe it will do the job."

4. Linus On Micro-Kernels

9 Jun 2000 - 14 Jun 2000 (39 posts) Subject: "linux and micro kernel"

Topics: Microkernels

People: Linus TorvaldsMatthew WilcoxVal Henson

Tonglu Yi asked if Linux might become micro-kernel-based in the future. Val Henson gave a pointer to the FAQ (http://www.tux.org/lkml/#s1-5) , and to Liedtke's L4Linux paper (http://os.inf.tu-dresden.de/pubs/sosp97) . Most replies indicated very slim chances for this, and at some point in the discussion, Matthew Wilcox tried to remember a quote by Linus Torvalds on the subject, and someone else found it in the archives. Linus had said, "message passing as the fundamental operation of the OS is just an excercise in computer science masturbation. It may feel good, but you don't actually get anything DONE." For Linus' full quote (it's long) and a summary of the thread in which it came up, see Issue #25, Section #4  (20 Jun 1999: The Future Of OS Design) .

5. Putting Restrictions On Untrusted Code

9 Jun 2000 - 16 Jun 2000 (18 posts) Subject: "Running Untrusted Code in a Restricted Process"

Topics: User-Mode Linux

People: Jeff DikeJesse HammonsBrian GerstDavid A. WagnerAlan CoxPavel Machek

Jesse Hammons (working off the 2.2.12 kernel) made his first ever post to linux-kernel, in which he proposed a way to restrict the system calls an untrusted binary could make. Someone gave a link to an overview of various access control systems (http://research-cistw.saic.com/cace/) , and someone else gave a pointer to the Medusa (http://medusa.fornax.sk/) project.

Jeff Dike said that a dedicated sandbox arrangement would probably be better than Jesse's proposal; and gave a pointer to his user-mode linux (http://user-mode-linux.sourceforge.net) kernel port, adding, "It gives you a virtual machine whose disk space consumption, cpu consumption, memory comsumption, and network traffic can be completely controlled. Plus, it's all in user-space. Nothing needs to be added to the kernel at all." Jesse agreed that uml did sound better than what he'd envisioned, and asked if any resource stats were available. Jeff replied:

No good ones at this point. The user-space kernel is larger than a native one, but I haven't added much code to it, so I imagine that I've done some stupid code-bloating things. After I look for them and fix them, I imagine that it will be comparable to a native kernel. So, I would look at the size of a native kernel, and I think that will in the ballpark of what you can expect to see.

Also, if you care enough about something to stick it in a virtual machine, a couple of megs is probably not a big deal. If it is, and you have a bunch of things that need to live in virtual machines, you can make them all live in the same one, where they can all infect each other with viruses and send love notes to each other :-)

In the same post, he added, "this is somewhat more resource-intensive than other sandboxes, but it's also more secure."

In the midst of replying to Jeff on other levels, Jesse suggested:

I hope it doesn't sound silly to say this but assuming just for a moment that I could compile this sucker on say, windows (maybe using cygwin32), and reimplement the part that does system calls in terms of windows system calls, could this be used to run sandboxed linux (elf) plugins on windows as well? *That* would be cool.

I guess you would need a way to trap system calls from the windows OS. I don't know if they provide that facility.

Jeff replied that he'd heard the Windows-port suggestion before, and added, "You could do that. Apparently, 95 doesn't have the capabilities needed, 98 is iffy, and NT seems to be ok. If you're (or anyone else is) interested in this, let me know, and I'll point you to the (scanty) information I have on doing a Windows port." As far as the specific ability to trap system calls, he repeated, "NT supposedly has the ability to do that, as well as the mmap stuff that's needed." The subthread ended there.

Elsewhere, Brian Gerst pointed out that Jesse's original proposal could already be done under Linux using the ptrace() system call. He explained, "Ptrace can intercept system calls made by the traced process (strace uses this) and can modify or deny them." Jesse checked this out, and found that this was true for 2.3.x, but not for the kernel he was using (2.2.12). Alan Cox replied that the functionality was indeed in 2.2.x, but Jeff clarified, "For the record, this was added in 2.3.22 and 2.2.15 for i386."

Elsewhere but still on 'ptrace', David A. Wagner said, "As others have noted, you can use ptrace() to selectively deny syscalls. See http://www.cs.berkeley.edu/~daw/janus/ for an implementation that used this idea in a more general context." And Jeff replied, "And see Pavel Machek's site (http://atrey.karlin.mff.cuni.cz/~pavel/dipl/eng.html) for how Janus (and any other ptrace syscall filterer) can be faked out. Plus a bunch of other sandbox possibilities."

6. Some Debate Over POSIX And Symlinks

11 Jun 2000 - 19 Jun 2000 (29 posts) Subject: "[BUG] Kernel 2.4.0-test1-ac10 changes open of symlink behavior."

Topics: BSD, POSIX

People: Alexander ViroAlan CoxUlrich DrepperAndries BrouwerDaniel Pittman

Under the most recent 2.4.0test-* kernels, Daniel Pittman found that trying to edit files through a symlink would give an ENOENT error when he tried to save to disk. Under 2.2.15, he could save just fine. He explained that at the time of editing, the symlink pointed to a nonexistant file; but his understanding was that the referenced file should be created automatically in that circumstance. Andries Brouwer agreed, and quoted the new POSIX draft, "In general the open() function follows the symbolic link if path names a symbolic link." But Alexander Viro replied, "Excuse me, but I'll take difference from POSIX over a bunch of very real races, thank you very much. Again, feel free to propose race-free implementation if you want that thing back. Until then O_CREAT without O_EXCL will return -ENOENT on broken symlinks. Userland should not rely on objects' creation/removal following symlinks. Period." But Alan Cox replied in turn, "POSIX says otherwise. Period ;)" Alexander replied that for this particular case, "POSIX draft is broken and we would be better off fixing that bug instead of casting it in stone. Behaviour in question is inconsistent with every other case when links are created/removed/renamed." Alan suggested, "You might want to take that up with the relevant posix committee then."

Ulrich Drepper replied, "Symlinks are not in the current POSIX standard and therefore the existing POSIX standard is OK. Symlinks are coming in the next revision of the standard as an option (due to the merging with the Unix specs). If you find something inadequate in the Austin group draft tell me what it is and a solution for it. I'll make sure it gets handled in the next meeting." Alexander proposed:

Case in question: foo/bar/baz being a dangling symlink, open() with O_CREATE applied to it. Behaviour mandated by the draft: create a file in place where symlink points to. Problem: it is wildly inconsistent with every other case when we create/remove/rename objects. In every other case it's "you've got foo/bar/baz, you either create/remove/rename the entry 'baz' in directory 'foo/bar' or fail". Here the operation is applied to directory potentially different from the foo/bar.

Proposed (minimal) change: "Portable programs can not rely on open()/create() following dangling links. It should not be confused with O_EXCL semantics - there we refuse to follow _any_ symlinks, as well as open existing files."

In other words, proposed semantics looks so:

  1. Broken symlinks are never followed.
  2. Normal symlinks are followed unless we have O_NOFOLLOW
  3. If O_EXCL is passed we refuse to open existing file, be it a result of following a symlink or not.

Andries said that the behavior Alexander was complaining about was the standard path resolution for symlinks; and guessed Alexander hadn't read the POSIX drafts. He quoted POSIX at length, and Alexander replied, "Lovely. IOW, draft sucks badly for mkdir() too - AFAICS with security consequences. And makes 4.4BSD, Solaris and Linux non-compliant in bargain. mkdir() on a dangling symlinks does not follow links on any of these systems." Andries replied, "Why don't you get the text yourself and read it, before shouting wildly? Concerning mkdir() I can reassure you. Quoting from the mkdir() section: "If path names a symbolic link, mkdir() shall fail and set errno to EEXIST"."

The subthread ended there, but elsewhere Andries gave pointers to The Austin Common Standards Revision Group (http://www.opengroup.org/austin/) , also to a Long Description of the Proposed Common Standards Revision Project (http://www.opengroup.org/austin/docs/austin_9r6.txt) , and finally the Austin Group's mailing list page (http://www.opengroup.org/austin/lists.html) .

7. Attempt At New Slab Allocator

11 Jun 2000 - 15 Jun 2000 (5 posts) Subject: "New slab allocation (pre-alpha)"

Topics: SMP

People: Mark Hemment

Mark Hemment announced:

For the last few months, I've been working (on and off) on a replacement for my implementation of the Slab allocator.

The new allocator has the features;

  1. Designed with SMP in mind.
    The allocator has "clips", which are per-cpu lists of objects without locks. To allow support for large clips and good slab reaping, the objects in the clips can be unloaded back into their slabs.
  2. Support for different types of memory.
    Previously, DMA support was a "kludge"! The new design supports subcaches. Each cache has subcaches for the type of memory it supports. Currently, only two types of memory are supported; NORMAL and DMA. A better name might be zone-caches - matching the 2.4.x page types.
  3. Better object packing.
    Yet another object packing method has been added. This can increase the number of objects per slab with on-slab bufctls. All with reduced branching on the allocation/release paths!
  4. Page allocator friendly.
    Always tries to use zero-order page sized slabs, even if it causes memory to be wasted. A bit of waste is better than order-1 page allocations!
  5. Ability to destroy caches cleanly.
    Well, it will be "clean" when I finish it off and tidy the code up. The "cleaniess" refers to not having to continiously hold a lock to prevent a cache being destroyed (removing scalability bottlenecks on SMP). This will probably panic/lockup your machine at the moment.
  6. Ability to add new general size caches on the fly.
    This is coded, but never tested. Again, will probably cause a panic/lockup. To avoid having to lock the general cache linkage for SMP (searched from a kmalloc(), so a lock would be bad), some "tricks" are pulled.
  7. Cleaner code.
    It will be - eventually. For now, some comments are correct, some aren't, and some are missing.
  8. Some more stuff.

At present, the code is for 2.2.15 only and is pre-alpha quality. I might decide to re-design it tomorrow if I think of something better.

I will soon be moving to 2.4.x - infact, I will need to when finalising this. The page structures aren't cached aligned in 2.2.x, so the slab's internal usage of this structs' members may well cross L1 cache lines (not good for performance).

This code has only been run on an UP box.

I don't have access to an MP box, but hope to by next weekend. Note; the MP code uses different methods to add general size caches, and to empty clips - none of which can be easily tested on UP (even running an MP kernel on UP doesn't help much). MP owners, bewarned.

A patch is available at;

http://www.nextd.demon.co.uk/slab.01.patch

Or to just view the allocator's source;

http://www.nextd.demon.co.uk/slab.c
http://www.nextd.demon.co.uk/slab.h

Hopefully, I've loaded up the latest version. :(

There were a couple replies, but no real discussion.

8. NFSv3 In The Stable Series: The Saga Continues

11 Jun 2000 - 14 Jun 2000 (12 posts) Subject: "Linux 2.2.17pre1"

Topics: FS: NFS, Kernel Release Announcement

People: Alan CoxMatthias AndreePaul Jakma

Alan Cox announced 2.2.17pre1, adding, "I'm going for stabilising the oddments 2.2.16 got a bit wrong before we move onwards. This even though a pre patch should be somewhat more solid than 2.2.16." Matthias Andree asked if NFSv3 might get into 2.2.18, and Paul Jakma was also hopeful. For more on the ongoing saga of NFSv3 in the stable series, see Issue #61, Section #14  (21 Mar 2000: Status Of NFSv3) .

Alan replied, "Various bits are going on. It may be in .17. Just for the moment .17 is the bug fixes." Matthias said, "I got the imagination you were planning 2.2.17 as bugfix-2.2.16-only release." And Alan corrected, "For the moment Im putting in the bug fix patches. What happens then depends how many and how serious."

9. Developers Discuss Microsoft

11 Jun 2000 - 16 Jun 2000 (8 posts) Subject: "Ballmer speaks a truth"

Topics: Microsoft, Patents

People: Rick HohenseeBrandon S. AllberyRik van RielJames SimmonsDavid FordRichard TorkarAndrew SharpRicky Beam

There was a bit of revelling this week, starting when Rick Hohensee gave:

"So far, Linux doesn't have a lot of traction on the client [Microsoft-ese for desktop computers], except in some university environments."

Steve Ballmer of Microsoft, as quoted and remarked by John Schwartz in the Washington Post, June 11 2000

Brandon S. Allbery replied with a smirk, "Heh. We *are* pissing them off. Good. :)" And Rik van Riel said with a grin:

It must be worrying for MS how much Linux has increased in popularity. I believe we've gone from "Linux will never be successful" to the above "Linux isn't successful yet" in just about one year ... :)

Kind of worrysome that even MS is admitting the success of Linux :) (outside of their legal arguments)

James Simmons mentioned hearing that, "M$ in retaliation to the DJ will release advance technology they have been keeping from the people to wipe out their competitors. According to M$ linux will be gone in a year if they do this." David Ford replied, "Considering the past history of 'innovation', we already have this 'new technology'. And to m$ wiping out Linux.. a) can't happen and b) is pretty darn funny thinking 'bout it." Richard Torkar let his fiendish laugh be heard, "*muahahahahahahahaha* Since when has MS come up with an innovation?" And Andrew Sharp added, "Reality check...since when would M$ have EVER held back a technology that would do harm to their competitors? If the 2000 programmers Microserf has assigned to writing a Linux virus had anything ready to go, they wouldn't be "holding it back."" Ricky Beam answered the question of Microsoft's most recent innovations, estimating:

About 20 years ago. The wave they think they are riding evaporated long ago. Let's see, they bitch about linux being "based on 30yr old technology" -- they seem to forget how old MS-DOS, windows, and NT are and how much of the same "30yr old tech" they've adopted -- but very recently Microsoft gave windows the same capabilty UNIX has had for "ever" -- probablly only because 3rd parties are making money selling such capabilities.

I am, of course, referring to "Terminal Server" and "X"... X has had the ability to have it's API calls sent to any display from the first line of code.

Billy boy can say what ever he wants. There's two decades of evidence of his greed -- you should buy everything from Microsoft... There are too many cases to count of Microsoft ignoring various laws and IP -- patents, copyrights, out-right theft of technology...

10. Developer Philosophy: Quietly Breaking Hardware Ports In Unstable Series

12 Jun 2000 - 16 Jun 2000 (18 posts) Subject: "Linux 2.4.0-test1-ac16"

Topics: Assembly, I2O, Kernel Release Announcement, SMP

People: Alan CoxJens AxboeEd CarpMatthew WilcoxPeter RivalTigran AivazianArnaldo Carvalho de MeloMike PhillipsIngo MolnarDavid WoodhouseRussell KingRichard TorkarBen LaHaiseChris Ricker

Alan Cox announced 2.4.0-test1-ac16:

These patches are versus 2.4.0-test1 from your favourite kernel mirror (ftp.us.kernel.org:/pub/linux/kernel/v2.4.0 (ftp://ftp.us.kernel.org/pub/linux/kernel/v2.4.0) ). The patches are in /pub/linux/kernel/people/alan (ftp://ftp.us.kernel.org/pub/linux/kernel/people/alan) .

James Cloos is maintaining diffs between -ac versions at
<http://jhcloos.com/pub/linux-2.4.0-test1-ac/>
<ftp://jhcloos.com/pub/linux-2.4.0-test1-ac/>

2.4.0-test1-ac16

  1. Squash PSN on the Transmeta CPU too | Not sure if its needed but.. (me)
  2. Small shmiq driver optimisation (Tigran Aivazian)
  3. Updates for arch/arm (Russell King)
  4. Fix drivers/char/Makefile for arm (Russell King)
  5. Fix etherh driver for 8390 changes (Russell King)
  6. Update ARM docs/Configure.help (Russell King)
  7. Fix docs references in config.in files (Andrzej M. Krzysztofowicz)
  8. Bring UDF into line with the inode wait flags (Jens Axboe)
  9. Fix a couple of CD driver bugs (Jens Axboe)
  10. tdfxfb fixes for flicker and for cursor (Alexander Lukyanov)
  11. Update Changes file (Chris Ricker)
  12. Fix potential buffer handling overrun with 512 byte blocks (Ben LaHaise)
  13. Fix a small glitch in an i2o timeout (Steve Ralston)
  14. Fix warnings in vesafb (Arnaldo Carvalho de Melo)
  15. Large raid patch update (Ingo Molnar)
  16. Fix missing blk_cleanup_queue on loop (Jens Axboe)
  17. Fix races in exec handling (Al Viro)

Mike Phillips replied that item 9 (Fix a couple of CD driver bugs) broke the compile by removing the CDROM_CAN macro definition, and Richard Torkar confirmed the breakage using 'gcc' 2.95.1; Jens Axboe replied, "Yup, pretty stupid error on my part, don't know how that macro disappeared. Just reverse that part of the patch, Alan already has a fix."

Peter Rival reported that both ac15 and ac16 would hang his Alpha ES40, right after enabling swap. David Woodhouse confirmed a similar problem on his ia32 SMP, though it was not repeatable, and happened on 2.3.99pre8 and 2.4.0test1-ac12.

Jim Barriault reported that ac16 and ac17 (apparently available by the time he posted) would both give compiler errors on his DS20, and Alan replied, "All the non x86 platforms got broken by the ptrace change. This fixes an ugly kernel race and needs asm level changes for the other platforms. Painful but important to do." Ed Carp chastized:

An extremely shortsighted thing to do. The x86 platform isn't the only platform that runs Linux, and these sorts of changes should be done for ALL platforms at once, rather than fixing "ugly" kernel stuff (which should've been fixed before anyway) at the expense of non-x86 platforms.

Then again, since it's Alan's personal release, maybe it doesn't matter, as long as it gets fixed for all platforms by the time that 2.4 gets kicked out the door.

There were three replies to this. Matthew Wilcox said, "you don't understand. this is the normal way of doing changes which require assembly or machine-specific changes. break the build on those architectures then the port maintainers notice immediately and fix them."

Alan also replied to Ed, "Tough. Its up to you to fix the other platforms. Its always worked like this. I am not playing nanny and co-ordinator to all the port maintainers. It broke the PPC and Alpha folks have updated their ports. Either fix the sparc one or wait until someone does. If I wait for all the maintainers then everyone suffers. If we break ports now and then most of them don't." Ed replied, "In other words, Alan tells the non-x86 folks that they can (in the words of one of my favorite movies) "all line up and kiss his ass!" ROFL!" And in the same post, he went on, "Oh, pretty please, DOCUMENT THE ASM CHANGES so we don't have to figure out what the hell you did to fix it? I hate to be a pain in the ass and scream yet again about kernel docs, but it really needs to be done if people want to consider themselves professional programmers working on a system they want to be accepted by the mainstream. It also makes porting code to other platforms A LOT LESS PAINFUL. See, one of my clients is this really big hardware company and I'd really like them to continue down the Linux road, but it's a tough sell if the technical gang tells them that VxWorks or PsOS is a lot better because they're designed to be relatively easy to port to other platforms." To this last, Matthew replied, "that's bollocks. who told you to use the latest development kernel? would you go to VxWorks, demand the latest nightly snapshot of their build and judge them on that? just because the latest snapshot of linux is available to you, doesn't mean we recommend it." And to the idea of Alan telling the non-x86 folks to kiss his ass, Matthew said, "you'd prefer alan to guess what the correct sparc, mips, ppc, arm, s/390, m68k, alpha and parisc asm code is? or ship a kernel with a known security hole? or not ship a kernel at all for weeks until all the port maintainers have had enough time to submit changes? face it, this is the only way to work. and you're the only one who has a problem with it."

Peter also replied to Ed's initial post, "it's already been fixed for Alpha, and I believe sparc & sparc64 either work or almost work now as well. What is _really_ annoying is when there are releases put out where nobody says anything about non-x86 ports being broken until someone complains and then the reaction is "oh yeah, we knew about that". I'm not complaining, it's just frustrating when I'm trying to get work done and I have to waste time trying to update to a release that is known at release time not to work." In the same post, he requested, "Alan - any chance we could get a list of things known to be broken/not to work added to the 2.4.0test-acXX release announcements? I don't care that things get broken, I'd just like to not have to find it out if people already know. Or, smack me in the head and tell me "moron, things are always documented - look here". Either way - I'm not offended... ;)" Alan replied that there would be so much to report, that he'd have to gzip his posts. But he also pointed out where the release notes had said "Fix ptrace races | may need tweaks for non x86". And Peter concluded the thread with, "Good 'nuff - my bad. I just didn't realize that "may need tweaks" could also mean "doesn't build". I'm just gonna shut up now and try to find some more caffeine. :)"

11. Developers Argue Over Virtual Memory: 'classzone' Vs. 'strict zone'

12 Jun 2000 - 15 Jun 2000 (31 posts) Subject: "[patch] improve streaming I/O [bug in shrink_mmap()]"

Topics: Virtual Memory

People: Zlatko CalusicAndrea ArcangeliRik van RielStephen C. Tweedie

The discussion started peacefully enough. Zlatko Calusic posted a one-liner to fix a long-standing problem in the virtual memory system. He explained, "While searching for a discardable page in shrink_mmap() Linux was too easily failing and subsequently falling back to swapping. The problem was that shrink_mmap() counted pages from the wrong zone, and in case of balancing a relatively smaller zone (e.g. DMA zone on a 128MB computer) "count" would be mistakenly spent dealing with pages from the wrong zone. The net effect of all this was spurious swapping that hurt performance greatly." Stephen C. Tweedie was impressed, and added that Zlatko's bug might be the same thing causing the excessive CPU usage recently reported. For more on this, see Issue #66, Section #3  (22 Apr 2000: 'kswapd' Instability; Debugging Deadlocks) , also Issue #68, Section #5  (8 May 2000: Virtual Memory Problems Persist In Development Series) , and Issue #69, Section #6  (16 May 2000: Possible Fix For 'kswapd' CPU Overuse) . Classzone had a brief mention in Issue #70, Section #2  (18 May 2000: Things To Do Before 2.4: Saga Continues) , and there was a brief exchange in Issue #72, Section #6  (3 Jun 2000: More VM Bug Hunting) .

Rik van Riel, the main VM maintainer, agreed with Stephen, and added that only one known bug remained with the current 'shrink_mmap()'. Andrea Arcangeli, author of the classzone patch (the only patch confirmed to solve the 'kswapd of death' problem), also replied to Stephen. He remarked that the "strict zone" approach preferred by Rik and others, would have higher loads anyway just by its nature. He described an exploit, "You boot, you allocate all the normal zone in cache doing some fs load, then you start netscape and you allocate the lower 16mbyte of RAM into it, then doing some other thing you trigger kswapd to run because also the lower 16mbyte are been allocated now. Then netscape exists and release all the lower 16m but kswapd keeps shrinking the normal zone (this shouldn't happen and it wouldn't happen with classzone design)." He went on, "I think Linus's argument about the above scenario is simply that the above isn't going to happen very often, but how can I ignore this broken behaviour? I hate code that works in the common case but that have drawbacks in the corner case. It would be better if I wouldn't know what the current code is doing, then I could accept it more easily."

Rik agreed that there was a theoretical load increase, but that it was balanced out by other benefits. He went on:

Let me summarise the drawbacks of classzone and the strict zone approach:

Strict zone approach:

Classzone:

Here you'll see that both systems have their advantages and disadvantages. The zoned approach has a few (minimal) performance disadvantages while classzone has a few stability disadvantages. Personally I'd chose stability over performance any day, but that's just me.

The big gains in classzone are most likely from the _other_ changes that are somewhere inside the classzone patch. If we focus on merging some of those (and maybe even improving some of the others before merging), we can have a 2.4 which performs as good as or better than the current classzone code but without the drawbacks.

Andrea replied point-by-point, disagreeing with the drawbacks Rik found with classzone. He argued that it was strict zone, and not classzone, that had the incorrect behavior. He put it, "Classzone provides the correct behaviour but at a potentially major fixed cost during allocations/deallocations and the lock is not per-zone anymore. However this additional information that we collect we'll avoid us to waste CPU and memory so it's not obvious that classzone will decrease performance."

Elsewhere Rik and Andrea had a long angry staircase, in which they both had to step back at times and take a deep breath before going on. Finally, they were unable to see eye to eye on the technical points, and the thread ended inconclusively. Since Rik is the official maintainer of the current code, the burden of proof seems to rest on Andrea, if he wants to see his classzone patch in the main tree.

12. Alan Cox Not Updating EXTRAVERSION In -ac Patches

13 Jun 2000 - 19 Jun 2000 (9 posts) Subject: "Alan, tie a string around your finger"

People: Xuan BaldaufAlan CoxGarst R. Reese

Garst R. Reese chastized Alan Cox, saying that Alan kept forgetting to set the EXTRAVERSION variable, to indicate the new ac version numbers. Xuan Baldauf agreed, and explained:

everytime you forget to update EXTRAVERSION, I get my modules overwritten in the wrong place. As a consequence, I cannot use the modules on the yet running ac18 anymore, because the modversions are for ac19, not ac18.

So *please* use a script which automatically increments EXTRAVERSION (I'm sure you use a script to produce patches). If that's to difficult to act on a whole file, create a file "extraversion", and import it into the makefile. It's not important how you do it, but please do it.

Alan had no reply.

13. Alan's Latest List Of Things To Do Before 2.4 Can Come Out

13 Jun 2000 - 17 Jun 2000 (64 posts) Subject: "Semi up to date JOBS list"

Topics: Compression, Disk Arrays: RAID, Disks: IDE, Disks: SCSI, FS: FAT, FS: NFS, FS: NTFS, FS: UMSDOS, FS: devfs, FS: ext2, FS: ramfs, Forward Port, I2O, Networking, PCI, Power Management: ACPI, Real-Time, SMP, Samba, Security, USB, Virtual Memory, VisWS

People: Alan CoxJeremy KatzRichard GoochRik van RielRogier WolffAlexander Viro

Alan Cox posted the most recent version of his list of things to do before 2.4 could come out. He listed:

  1. Should Be Fixed (Confirmation Wanted)

    1. IDE fails on some VIA boards (eg the i-opener)
    2. Floppy driver broken by VFS changes. Other drivers may be too (Stuff gets called after _close now - unload race possibly too)
    3. Fbcon races
    4. Fix all remaining PCI code to use new resources and enable_Device (mostly done)

  2. Capable Of Corrupting Your FS

    E820 memory setup causes crashes/corruption on some laptops Use PCI DMA by default in IDE is unsafe (must not do so on via VPx x<3)

  3. Security

    1. Fix module remove race bug (mostly done - Al Viro)
    2. exec loader permissions
    3. access_process_mm oops/lockup if task->mm changes (Manfred) [user can cause deliberately]

  4. Boot Time Failures

    1. AHA29xx driver appears to stomp other cards (may be BIOS)
    2. AHA27xx is broken (maybe 28xx too)
    3. Use PCI DMA 'lost interrupt' problem with some hw [which ?] (NEC Versa LX with PIIX tuning)
    4. HT6560/UMC8672 ide sets up stuff too early (before region stuff can be done)
    5. Crashes on boot on some Compaqs ? (may be fixed)
    6. ACPI hangs on boot for some systems
    7. Boot hangs on a range of Dell docking stations (Latitude)

  5. In Progress

    1. Dcache threading (Al Viro)
    2. Merge the network fixes (DaveM)
    3. Finish I2O merge (Intel/Alan)
    4. Exploitable leak in file locking (Willy)
    5. Restore O_SYNC functionality (Stephen) - core code and ext2 done

  6. Obvious Projects For People (well if you have the hardware..)

    1. pci_socket crash on unload
    2. DEFXX driver appears broken
    3. NCR5380 isnt smp safe
    4. DMFE is not SMP safe
    5. Make syncppp use new ppp code
    6. Fix SPX socket code
    7. Merge the 2.2 ServeRAID driver into 2.4
    8. Merge the current Compaq RAID driver into 2.4

  7. Fix Exists But Isnt Merged

    1. Update SGI VisWS to new-style IRQ handling (Ingo)
    2. 64bit lockf support
    3. Support MP table above 1Gig (Ingo)
    4. Dont panic on boot when meeting HP boxes with wacked APIC table numbering (AC)
    5. Scheduler bugs in RT (Dimitris)
    6. HFS is still broken
    7. AIC7xxx doesnt work non PCI ? (Doug says OK, new version due anyway)
    8. Fix hpfs_unlink (Al Viro)
    9. Loopback fs hangs
    10. Fix boards with different TSC per CPU and kill TSC use on them
    11. Floppy last block cache flush error
    12. TB Multisound driver hasnt been updated for new isa I/O totally.

  8. To Do

    1. mount crashes on Alpha platforms
    2. Tulip hang on rmmod/crashes sometimes
    3. Devfs races, Sockfs (removing NULL ->i_sb stuf) (Al Viro)
    4. Debian report that the gcc 2.95 possibly miscompiles fault.c or mm/remap.c (Perl script available from Arjan)
    5. Fix further NFS races (Al Viro)
    6. Test other file systems on write
    7. Audit all char and block drivers to ensure they are safe with the 2.3 locking - a lot of them are not especially on the open() path.
    8. Stick lock_kernel() calls around driver with issues to hard to fix nicely for 2.4 itself
    9. PCMCIA/Cardbus hangs, IRQ problems, Keyboard/mouse problem (may be fixed ?)
    10. Fix mount failures due to copy_* user mishandling
    11. Fix default mount behaviour to disallow repeat mounting

  9. To Do But Non Showstopper

    1. Finish 64bit vfs merges (lockf64 and friends missing)
    2. Go through as 2.4pre kicks in and figure what we should mark obsolete for the final 2.4
    3. Union mount (Al Viro)
    4. Per Process rtsigio limit
    5. iget abuse in knfsd
    6. Some people report 2.3.x serial problems
    7. USB hangs on APM suspend on some machines
    8. PCMCIA crashes on unloading pci_socket
    9. ISAPnP IRQ handling failing on SB1000 + resource handling bug
    10. DVD-RAM is apparently not working for write currently (Rogier Wolff)
    11. Parallel ports should set SA_SHIRQ if PCI (eg in Plip)
    12. Devfs compiled in but not mounted causes crap for ->mnt_devname of root (Al Viro)

  10. Compatibility Errors

    1. Xterm broke in 2.3.99pre6 (FIONREAD/select loop)

  11. Probably Post 2.4

    1. per super block write_super needs an async flag
    2. addres_space needs a VM pressure/flush callback (Ingo)
    3. per file_op rw_kiovec

  12. Drivers In 2.2 not 2.4
  13. To Check

    1. Check O_APPEND atomicity bug fixing is complete
    2. Protection on isize (sct) [Al Viro mostly done]
    3. Mikulas claims we need to fix the getblk/mark_buffer_uptodate thing for 2.3.x as well
    4. Network block device seems broken by block device changes
    5. VFS?VM - mmap/write deadlock (demo code seems to show lock is there)
    6. rw sempahores on page faults (mmap_sem)
    7. kiobuf seperate lock functions/bounce/page_address fixes
    8. Fix routing by fwmark
    9. Some FB drivers check the A000 area and find it busy then bomb out
    10. rw semaphores on inodes to fix read/truncate races ? [Probably fixed]
    11. Not all device drivers are safe now the write inode lock isnt taken on write
    12. File locking needs checking for races
    13. Multiwrite IDE breaks on a disk error [minor issue at best]
    14. ACPI/APM suspend issue - IDE related stuff ?
    15. NFS bugs are fixed
    16. Chase reports of SMB not working
    17. Locking on getcwd
    18. IRDA calls get random bytes before random is set up
    19. Some AWE cards are not being found by ISAPnP ??
    20. SHM segments not always being detached and destroyed right ?
    21. RAM disk contents vanishing on cramfs (block change) and bforget cases

  14. Fixed

    1. Incredibly slow loopback tcp bug (believed fixed about 2.3.48)
    2. COMX series WAN now merged
    3. VM needs rebalancing or we have a bad leak
    4. SHM works chroot
    5. SHM back compatibility
    6. Intel i960 problems with I2O
    7. Symbol clashes and other mess from _three_ copies of zlib!
    8. PCI buffer overruns
    9. Shared memory changes change the API breaking applications (eg gimp)
    10. Finish softnet driver port over and cleanups
    11. via rhine oopses under load ?
    12. SCSI generic driver crashes controllers (need to pass PCI_DIR_UNKNOWN..)
    13. UMSDOS fixups resync (not quite done)
    14. Make NTFS sort of work
    15. Any user can crash FAT fs code with ftruncate
    16. AFFS fixups
    17. Directory race fix for UFS
    18. Security holes in execve()
    19. Lan Media WAN update for 2.3
    20. Get the Emu10K merged
    21. Paride seems to need fixes for the block changes yet
    22. Kernel corrupts fs and gs in some situations (Ulrich has demo code)
    23. 1.07 AMI MegaRAID
    24. Merge 2.2.15 changes (Alan)
    25. Get RAID 0.90 in (Ingo)
    26. S/390 Merge
    27. NFS DoS fix (security)
    28. Merge the RIO driver
    29. Fix Space.c duplicate string/write to constants
    30. Elevator and block handling queue change errors are all sorted
    31. Make sure all drivers return 1 from their __setup functions (Done ?)
    32. Enhanced disk statistics
    33. Complete vfsmount merge (Al Viro)
    34. Merge removed-buf-open directory stuff into VFS (Al Viro)
    35. Problems with ip autoconfig according to Zaitcev
    36. NFS causes dup kmem_create on reload (Trond)
    37. vmalloc(GFP_DMA) is needed for DMA drivers (Ingo)
    38. TLB flush should use highest priority (Ingo)
    39. SMP affinity code creates multiple dirs with the same name (Ingo)
    40. Set SMP affinity mask to actual cpu online mask (needed for some boards) (Ingo)
    41. heavy swapping corrupts ptes (believed so)
    42. pci_set_master forces a 64 latency on low latency setting devices.Some boards require all cards have latency <= 32
    43. msync fails on NFS (probably fixed anyway)
    44. Find out what has ruined disk I/O throughput. (mostly)
    45. PIII FXSAVE/FXRESTORE support
    46. The netdev name changing stuff broke GRE
    47. put_user is broken for i386 machines (security) - sem stuff may be wrong too
    48. BusLogic crashes when you cat /proc/scsi/BusLogic/0 (Robert de Vries)
    49. Finish sorting out VM balancing (Rik Van Riel, Juan Quintela et al)
    50. Fix eth= command line
    51. 8139 + bridging fails
    52. RtSig limit handling bug
    53. Signals leak kernel memory (security) [FIX in ac tree]
    54. TTY and N_HDLC layer called poll_wait twice per fd and corrupt memory
    55. ATM layer calls poll_wait twice per fd and corrupts memory
    56. Random calls poll_wait twice per fd and corrupts memory
    57. PCI sound calls poll_wait twice per fd and corrupts memory
    58. sbus audio calls poll_wait twice per fd and corrupts memory
    59. IBM MCA driver breaks on Device_Inquiry at boot
    60. SHM code corrupts memory (Russell)
    61. Linux sends a 1K buffer with SCSI inquiries. The ANSI-SCSI limit is 255.
    62. Linux uses TEST_UNIT_READY to chck for device presence on a PUN/LUN. The INQUIRY is the only valid test allowed by the spec.
    63. truncate_inode_pages does unsafe page cache operations
    64. Fix the ptrace code to be back compatible and add a new PTRACE call set for getting the PIII extra registers
    65. EPIC100 fixes
    66. Tlan and Epic100 crash under load

Jeremy Katz replied with an entry to add to section 4 (boot time failures). He said, "Add "AIC7xxx driver doesn't work with Western Digital drives". This was fixed in Doug's 2.2.x drivers (the 5.1.29 one was the one with the fix iirc) but it hasn't been forward ported to 2.3 yet."

To item 8.3 (Devfs races, Sockfs (removing NULL ->i_sb stuf) (Al Viro)), Richard Gooch replied, "Actually, I've been working on fixing the devfs races. My latest patch improves things a lot."

To item 3.1 (Fix module remove race bug (mostly done - Al Viro)), Alexander Viro replied that this was in progress.

To item 14.3 (VM needs rebalancing or we have a bad leak), Rik van Riel replied:

There are two small things still needed to be done.

  1. in a highly unlikely situation (all lru pages are either of a 'wrong' zone or unfreeable) shrink_mmap() can get into an infinite loop
  2. I see if the classzone patch has something worthwhile left we should merge (I believe we should have most of it by now) OR put in the active/inactive/scavenge thing I'm still working on now

(I could have finished the active/inactive/scavenge queue thing a week ago, but I want the code to be obvious, readable and understandable for the untrained eye ... in the long run I want all VM code to be easily readable and maintainable, if only so it's easy to spot and fix bugs when somebody needs to do some modifications)

Alan added to Rik's list, item 3: "Figure out why it all went to shit about ac14." Rik pointed out that there had been virtually no VM changes since ac10, and Alan replied, "I suspect the ac10 change may be the actual one that did the damage." They went on to have a brief hunt for the exact time of breakage.

14. Dell Binary-Only Drivers May Go Open Source

14 Jun 2000 - 15 Jun 2000 (8 posts) Subject: "Compiling Linux 2.2.16+ with Dell Proprietary PERC 3/Si Raid Device"

People: Byron StanoszekMatt DomschRik van RielAlan Cox

Byron Stanoszek asked:

I have a couple of Dell PowerEdge 4350 computers with a PERC 3/Si raid controller installed as the boot device. I know about Dell's website that has a binary-only version of their percraid.o module, but they provide no means of upgrading this kernel to 2.2.15 or 2.2.16 while using the same binary driver.

Is there any chance that anyone is working on an open-source driver that can be included into the kernel to enable support for this device in the future?

There were several replies, including one from Dell representative Matt Domsch, who said, "Yes. Dell and our partners have been working on making this driver open-source. We're in the "seriously-pound-it-until-you-find-most-bugs" stage, and it's not quite ready for release. We understand the concern, and are anxious to make this driver ready for everyone. Supporting binary-only drivers (particularly with MODVERSIONS enabled) is really really hard, and I'm looking forward to dropping that one from my list of worries."

Alan Cox confirmed that he'd heard a rumor about the driver being Open Sourced at some time in the future. Nicholas Marouf asked what he and others could do to speed the process along, and Rik van Riel replied, "Not much. The only thing we can do is advice people to buy elsewhere until the Dell hardware is properly supported."

15. Possible Solution For Recent VM CPU Hogging

15 Jun 2000 - 18 Jun 2000 (14 posts) Subject: "kswapd at 96% CPU on my 16Mb system"

Topics: Virtual Memory

People: Rik van Riel

Kees Bakker reported bad 'kswapd' performance on 2.4.0test1-ac17 (see Issue #73, Section #11  (12 Jun 2000: Developers Argue Over Virtual Memory: 'classzone' Vs. 'strict zone') in this issue). After some back-and-forth, in which Kees posted some log output, Rik van Riel said:

Ahh, I see the problem. The function do_try_to_free_pages() continues to free pages long after we have reached enough free memory.

The problem is that if that happens, shrink_mmap() will loop for a long long time but at the same time refuse to free pages because zone->free_pages > zone->pages_high.

In effect, shrink_mmap() enters something quite close to an infinite loop ...

 

 

 

 

 

 

Sharon And Joy
 

Kernel Traffic is grateful to be developed on a computer donated by Professor Greg Benson and Professor Allan Cruse in the Department of Computer Science at the University of San Francisco. This is the same department that invented FlashMob Computing. Kernel Traffic is hosted by the generous folks at kernel.org. All pages on this site are copyright their original authors, and distributed under the terms of the GNU General Public License version 2.0.