Kernel Traffic #63 For 17�Apr�2000

By Zack Brown

Table Of Contents

Introduction

Many thanks go to Arun Sharma, who pointed out that in Issue�#62, Section�#13� (30�Mar�2000:�Intel eepro100 Driver To Be GPL-Compatible?) , the article title, "Intel eepro100 Driver Going Open Source?" implied that the Intel driver was not already Open Source. As Arun pointed out, the code and license are available on Intel's site (http://developer.intel.com/support/network/adapter/pro100/30504.htm) .

Originally I figured I'd just make sure, so I went over to opensource.org (http://www.opensource.org) to see if Intel's license was listed in the Open Source Initiative's list of certified licenses (http://opensource.org/licenses/) . When I couldn't find it there, I figured I'd just change the article title anyway, since the main point in the KT article was that Intel's code was incompatible with the GPL, not that it was closed source.

I changed the title to "Intel eepro100 Driver To Be GPL-Compatible?" and wrote back to Arun, letting him know I'd changed the article, but that actually, the license was not Open Source after all.

Within minutes I got a reply. Arun said, "The text matches http://www.opensource.org/licenses/bsd-license.html as far as I can tell. Am I overlooking something ?" I ran to the site to check it out, and lo and behold, he was right! Arun, you completely rock!

In relation to this, thanks also go to Alan Cox, who took the time to answer a few of my questions. He said of the BSD license on opensource.org (and therefore Intel's copy), "They have a variant of the advertising clause which is not GPL compatible. Intel seem keen to sort that out. I just have to finish clarifying that with Intel then we can merge the two drivers into one great one." Thanks for the extra information, Alan!

Mailing List Stats For This Week

We looked at 1366 posts in 5959K.

There were 447 different contributors. 205 posted more than once. 169 posted last week too.

The top posters of the week were:

1. NetWare Filesystem Sources Published

27�Mar�2000�-�30�Mar�2000 (53 posts) Subject: "NWFS Source Code Posted at 207.109.151.240"

Topics: Backward Compatibility, Clustering, Disk Arrays: LVM, Disk Arrays: MD, Disk Arrays: RAID, FS: NFS, FS: NTFS, FS: ROMFS, FS: ext2, Ioctls, SMP

People: Jeff V. Merkey,�Pavel Machek,�Steve Dodd,�Matthew Kirkwood,�Erez Zadok,�Alan Cox,�David S. Miller,�Stephen C. Tweedie,�Alexander Viro,�Christoph Hellwig,�Andi Kleen,�Richard Gooch,�Stephen Tweedie

Jeff V. Merkey announced, "The Open Source Release of NWFS 2.2 for the Linux 2.0 and 2.2 kernels is posted to our site at www.timpanogas.com (http://www.timpanogas.com) and 207.109.151.240 (ftp://207.109.151.240) . Included are the release notes. 2.4 will be posted Wednesday, March 29, 2000 at 7:00 a.m. Eastern Time." He listed the new features:

  1. Full Asynch IO Manager (SMP)
  2. NetWare-ish LRU Mirrored Block Cache.
  3. Handle Based Virtual Partition Mirroring and Hotfixing Engine (NWVP).
  4. Full SMP Support (and we even tested it).

Andi Kleen asked for more explanation of why NWFS used its own buffer cache instead of the standard Linux one. He also asked, regarding item 1, what NWFS' asynch manager offered that was better than the normal asynchronous block device interface. Jeff explained:

The Linux Buffer Cache does not present a logical block cache for NetWare's flavor of mirroring support, although basically what's there is implemented as a Logical Block Cache on top of the Linux Buffer Cache. You will notice that NWFS can use either it's own LRU or the linux buffer cache (there's an #if/#else/#endif for LINUX_BUFFER_CACHE in block.c -- this will make NWFS sit on top of buffer.c). You would want to crank down the MAX_BLOCK_BUFFERS count in nwfs.h to something smaller than 2000.

The Async processes bascially provides an elevator for remirroring and concurrent mirroring I/O so the file system doesn't get starved. It sits on top of the linux disk interface. These abstractions are also necessary for our clustered file system for Linux (M2FS). Look carefully at the layering in nwvp.c, and it will be clear why we did what we did.

This code base also runs on several other platforms, and some of what is there is not applicable to Linux, but is to NT and DOS. The tools use these interfaces and we have to have them as convinient abstractions since they are not available on every platform (not all OS's are as blessed as Linux).

What's here is a smaller version of the internal architecture of Native NetWare. Some of the services and what they do is a little different.

Two threads branched out of this explanation. Pavel Machek pointed out, "putting additional abstraction layers is considered evil... I don't think you can push nwfs into linus's tree with design like that. It is fine for occasional access of nwfs drives, through."

Jeff replied characteristically, "It's a design that Novell uses to gross more $$$$$$$ routinely in one month than Linux has brought in with all the Linux companies revenues combined in the past five years. :-)" And then added:

Wasn't trying to push anything, there are lots of people who want NetWare stuff, just trying to make Linux more attractive to Novell's 9,000,000 server installed base and get the NetWare file system on Linux out the door. You guys are always welcome to carve up NWFS like a christmas turkey and chunk the pieces that don't make sense for Linux moving forward. You'll find I'm not too religious about the sanctity of my own code, I'd rather be a contributor and team player with you guys. :-)

With the page cache, the file systems in Linux are more or less reduced to meta-data drivers (very similiar to where NT is at today). The extra layers are not really all that heavy, and are there to support our clustering stuff (which we will open source as well). Most of them use fast paths in the code (which you will see if you take the time to look at it). Unlike most of the File System code in Linux, NWFS has COMMENTS, NOTES, EXPLANATIONS, DETAILED TECHNICAL INFO, etc. :-)

You guys might learn something by looking at it. I am right at this moment staring at the NCPFS code in 2.3.99Pre3 since there is NO DOCUMENTATION ON THE PAGE CACHE, DCACHE, VFS or anything else here to help besides pouring over source code. :-)

Don't knock NetWare til you try it. :-)

He replied to himself shortly afterwards, with a reproach for Pavel:

Oh Yeah, and I almost forgot. Remember those emails I sent you 6 months ago. The one's asking for your help with creating an md driver for the NetWare mirroring so I didn't have to replicate a NetWare style LRU so I could tightly integrate NWFS with the Linux Buffer Cache and md (in fact, I even was willing to give you the nwvp code and sent it to you so you could put it into the current RAID driver)?

Did you ever respond. NO ......

How many emails did I send you? Was it seven or eight?

Did you answer any of them. NO .......

If you are still interested in doing what I suggested, unlike you, I will answer your emails and help you take over the code (and let you control it for Linux) so Linux will be able to enter Novell's market with a tighter implementation.

Think about it ......

:-) :-) :-).

Touche .....

Andi brought things back to a technical level, and there followed a bit of implementation discussion.

Steve Dodd also replied to Jeff's explanation of the buffer cache, saying that as he understood things, "Al Viro, Stephen Tweedie, and a number of others have some interesting plans for the block device layer, buffer cache and page cache, and the interactions between them and with the VFS -- for 2.5. I assume they're busy with tidying up 2.4pre loose ends at the moment, but hopefully after 2.4 is out the door and stable there will be some discussion of the planned changes on linux-fsdevel. It sounds (I've not looked at your code yet) like your input would be useful when that happens. An I/O system that meets everyone's needs for 2.6 would be a great goal.." He added, "regarding your comments about Linux documentation - have you looked at the work Alan Cox is doing on this yet?" To this last point, Jeff replied, "Yes, i've looked at Alan's Documentation stuff. And as always, Alan's work is absolutely excellent -- however, what's really needed is something like NT puts out with their IFS Kits, a sample VFS code example for a NULL file system for folks who are porting to Linux. Would cut down on time along with an "oh shit" list of interface issues to watch out for. Richard Gooch did a decent job on the vfs.txt file (it was better than nothing). We need more though to get on par with NT."

Matthew Kirkwood replied with an interesting comment:

As I understand it, your nwfs is probably the first filesystem to have been successfully "ported" to Linux. Pretty much everything else (with, perhaps, the exception of the abomination that is the NTFS driver) started off native.

The multiple times that I have written 30 to 70% of a filesystem, I found the romfs and minixfs code to be most instructive as a guide to the VFS interfaces. The buffer and page cache stuff is rather harder to track down canonical examples for, though again minixfs is pretty helpful, if rather simplistic.

Jeff replied that he also used Minix for reference. Steve pointed out, "ext2 really isn't too bad an example, either. The page cache stuff seemed to be surprisingly simple (unless I've missed some important wrinkles) to figure out from that, and the relevant bits of vfs code." And Erez Zadok added:

Whenever Ion and I wanted to understand the VFS-f/s interaction, we often looked at the VFS code, and samples of three file systems:

Alan Cox replied to Jeff's comments about his documentation activities, pointing out that documenting functions was more important than documenting the structures. He added that too many programmers thought documentation was "boring and uncool," adding, "Documentation is worth it just to be able to answer all your mail with 'RTFM'."

Jeff volunteered to create and maintain the code and docs for a NULL filesystem driver, and Alan gave his blessings.

The thread ended, but under the Subject: NWFS 2.2.1 Source Code Released (nwfs0328) () , Jeff announced:

The Open Source Code for NWFS 2.2.1 hss been posted to www.timpanogas.com (http://www.timpanogas.com) and 207.109.151.240 (ftp://207.109.151.240) . This version corrects a reported mirroring coalescence problem and fixes a bug in the nwvp_vpartition_map_asynch_write() function.

NWFS 2.2.1 Sources are in the nwfs0328.zip and nwfs0328.tar.gz files on our FTP server at 207.109.151.240 or www.timpanogas.com. We will post NWFS 2.2 for Linux Kernel 2.4 support tommorrow morning at 7:00 a.m.

There was no reply to that, but under the Subject: Oops in 2.2.15 with NWFS using the Linux Buffer Cache SMP () , Jeff reported:

I am seeing an oops and message indicating that the VFS buffer free lists are getting corrupted under 2.2.15 when I use NWFS with multiple SMP threads with the Linux Buffer Cache. The NWFS LRU works just fine, however.

If I put a big lock over all accesses and make all the interactions with the buffer cache synchronous, the problem goes away.

Is it required that I hold the kernel lock with lock_kernel() over the Linux Buffer Cache when doing I/O to it in 2.2.15?

David S. Miller replied, "Yes. The buffer cache was not thread safe until I did that work circa ~2.3.13 or so." Stephen C. Tweedie also confirmed, "the bulk of the 2.2 VFS is still single-threaded," and later added, "the SMP threading is one of the more pervasive changes between 2.2 and 2.3. A lot of people have worked on that, and the VFS is much, much more scalable on SMP in 2.3 now. That necessarily makes life more complex for driver or filesystem writers, but there's really no way to avoid that when you go for fine-grained SMP."

That thread ended, but elsewhere, under the Subject: Release of NWFS 2.2.2 on Linux 2.3.99pre3 Moved to March 31, 2000 () , Jeff announced that he was pushing back the date of release for NWFS 2.2.2 due to lack of sleep and remaining problems with the code. There was no reply to that, but under the Subject: block_dev.c not backward compatible with 2.2.15 APIs () , Jeff reported that he couldn't find a good way for NWFS to scan for all the hard drives on the system. Alexander Viro replied, "I see absolutely no valid reasons for filesystem to scan all disks. What are you actually trying to do?" Jeff explained:

NetWare does not follow the Unix model of "one partition, one file system tree" (which is an ancient and limiting architecture for File Systems). The model linux is taking is almost identical to NT, a model where the I/O subsystem presents single partitions through a very restrictive I/O interface. (Doing this in NT is extremely nasty -- I have to build a disk and partition pointer map in user space, then pass it to the driver via an IOCTL to the Windows NT NetWare File System).

NetWare uses a multi-segmented architecture where several segments for a particular volume can be stripped across multiple drives (a single NetWare partition can host 8 volumes segments per partition, and these segments can be for the same volume or 8 other stripped volumes). I need to be able to scan the volume tables on each NetWare partition in order to build a global map of all NetWare volume segments on all drives (a single sector read per drive to locate the volume segment tables).

I also need it to locate all mirror groups and mirror members before bringing the mirroring and hotfixing engine on line. This is why. If there is a "cleaner" way to do this with a published API, please show me.

Christoph Hellwig and Steve recommended checking out LVM. Elsewhere in the discussion, Stephen suggested, "Scan the gendisk_head list. That contains all present disks and partitions (as visible in /proc/partitions). It also contains the partition type information if you want to restrict your volume scan to specific partition types (this is how the software raid code performs its scan for raid autostart devices, for example)." Jeff liked this, and switched the implementation he was working on, to do that instead.

Elsewhere, under the Subject: NWFS 2.2.2 Source Code for Linux Kernels 2.0/2.2/2.4 Released () , Jeff announced the latest NWFS source code, and said, "There is still much work to do on Linux 2.4. Support for Linux Kernel's 2.0/2.2 is nearing completion." Some folks tried it out, had problems, and there was some implementation discussion.

2. devfs Bitterness

28�Mar�2000�-�10�Apr�2000 (92 posts) Subject: "Location of shmfs; devfs automagics"

Topics: FS: devfs, FS: procfs, Feature Freeze

People: H. Peter Anvin,�Michael H. Warfield,�Richard Gooch,�Byron Stanoszek,�Theodore Y. Ts'o,�Jes Sorensen,�Alan Cox

H. Peter Anvin protested with vehemence, "devfs seems to think it is above the normal way of doing things (unlike ALL OTHER filesystems, including procfs and shmfs) and not only will mount itself on /dev automatically, but will do so *by default*. This is incredibly antisocial behaviour, and has no justification. If there are conditions under which you would need devfs before you have mount(8) available -- which I do not believe is ever the case -- it may be justifiable to have it as an option, but making it default behaviour is pretty much unacceptable. Use /etc/fstab like everything else, please." A lot of people agreed with this (and with other objections to 'devfs') but some spoke out in support of 'devfs' as well. Namespace incompatibilities came up as well. It was pointed out that even with 'devfsd', which is supposed to give '/dev/index.html' the traditional device names, some names were left out. As Michael H. Warfield said:

I don't see devfs creating anything for my usb printer when I plug it in, or the usbmouse either, for that matter. I've got 2.3.99pre3 and I've got devfsd running. If these things are supported, why isn't anything happening?

Patches for devfs support for the Computone boards were submitted weeks ago, but they're not in the kernel sources yet and I doubt their going to make 2.4.0. How many other drivers have no support? Somehow, I suspect that most of the intelligent multiport boards fall into that category and I don't see anything in the pcmcia stuff either. I agree that some things "fall out in the wash" when they register things like standard tty ports and drives. But that still leaves a lot out that haven't been done yet. Sure, I figured out that you aren't having any problems with this because you aren't using any of these things.

I also sent Richard (and the devfs list) a list of the ports and devices that the Computone boards use that are not in devfsd. Maybe one day they'll show up in devfsd and maybe they won't. I haven't heard a peep back on that front. I'm sure the patches will make it into the kernel eventually, I'm just not sure when. That's in the hands of the man in charge.

Richard Gooch replied to this, "The problem has been one of time. I've spent the last few weeks first at a conference, then travelling, then came back and promptly caught the flu (misery for a week). And then bring my main machine up to date after an extended absence. So soon I should be able to start tackling that mountain of devfs email I've got in my inbox :-( Oh. And I need to write a better HOWTO :-("

Elsewhere, under the Subject: Shmem filesystem? DevFS? Why. () , another 'devfs' debate broke out, and it looks like a bitter argument. Byron Stanoszek started it out by saying that although he liked the idea of 'devfs', he felt its implementation left something to be desired. He said, "The current /devfs method forces all user-level programs to be rewritten to use the new format, i.e. changing say /dev/tty10 to /devfs/tty/10. This also causes the potential problem that new programs will be written only for /devfs and not maintain compatibility with the old /dev structure." Richard replied simply, "My original devfs patch had what you wanted: the old names were preserved. Linus mandated that the old names would not appear in the kernel. So instead, I updated devfsd which can automatically create the old names." Theodore Y. Ts'o spoke up with some harsh criticism of that process:

A number of people thought devfs was a bad idea. It went in, anyway, during feature freeze, because Linus was God.

The reason why I dislike devfs is precisely demonstrated by a lot of people kvetching with the namespace. It causes policy to be dictated in the kernel; in this case Linus single-handedly dictated the new naming convention, and now everyone has to live with it. Of course, it's *his* kernel, so he has the right to do this, even if it causes a lot of people pain and annoyance. Personally, it'll be interesting to see how many distributions actually ship with devfs turned on, or with devfs mounted somewhere else, like /devfs.

A number of user utilities, such as e2fsprogs, need modification to work well with a mandatory devfs with the new names. I don't plan to do that work right away, since I have other higher priority items to worry about. So if any distributions do ship with devfs, they *will* have to ship with devfsd. Perhaps some of them will simply decide to not ship with devfs at all. (It after all is still marked as experimental.)

The discussion skewed off into linguistic history at this point, because Ted added as a P.S., "Can we at least *please* chance all occurances of "disc" to "disk" in the publically visible devfs namespace? The kernel uses American english everywhere else, and having a mixture of British and American english is just going to cause massive confusion." There was much humour made of this, but at some point, Jes Sorensen put in, "Ted's request is perfectly valid, just about every single Linux application I have seen refers to disk as "disk" and not disc. Changing it in one place like that is silly and going to cause a lot of nusiance for the users." Alan Cox recommended against using 'devfs'. He said, "it's policy in the wrong place, if you dont like it you are screwed."

(ed. [] My own personal opinion of this discussion, is that something very important is taking place. In general, the top kernel hackers accept Linus' decisions because they trust him in the wider sense, whether they agree about a particular issue or not. This is one of the only (if not the only case I've seen in which a fair number of major contributors are strongly and vocally dissatisfied by Linus' decision about something. I'm very interested to see how this eventually resolves itself. Will the disgruntled folks quit submitting patches? Will they fork a parallel project? Will 'devfs' improve enough to satisfy their objections? Will Linus change his mind? Or will the whole debate just fade away? It's a very strange and interesting situation.�--�Z.B.)

3. XFS Goes GPL!! (Finally)

30�Mar�2000�-�4�Mar�2000 (14 posts) Subject: "Source code for Linux XFS now available!"

Topics: Access Control Lists, FS: EFS, FS: JFS, FS: XFS, Raw IO, Version Control

People: Jim Mostek,�Christoph Hellwig,�Russell Cattelan,�Peter Rival,�Stephen C. Tweedie,�Steve Lord,�Dominik Kubla

The possibility of Open Sourcing XFS was first covered in Issue�#20, Section�#4� (16�May�1999:�EFS Filesystem Appears In 2.3.2) ; SGI's announcement was still spawning discussion in Issue�#21, Section�#2� (20�May�1999:�XFS Going Open Source) . XFS came up in a general journaling filesystem discussion in Issue�#43, Section�#2� (28�Oct�1999:�Journalled Filesystem For Linux) , and again in Issue�#49, Section�#3� (17�Dec�1999:�ReiserFS Or Ext3 In Standard Kernel?) .

This time there was no speculation necessary, XFS has been GPLed (http://oss.sgi.com/projects/xfs/license.html) . Jim Mostek made the announcement:

Source code for Linux XFS now available!

A complete linux 2.3.99pre2 tree including the XFS filesystem is available for cvs checkout.

Please refer to: http://oss.sgi.com/projects/xfs/cvs_download.html for instructions.

A snapshot of the CVS tree is also availble: ftp://oss.sgi.com/projects/xfs/ftpdir/03302000linux-2.3-xfs.tgz This tar file will not be generated on a reqular basis. A "cvs update -d" should be performed once the tree is download and unpacked.

*************** PRELIMINARY WORK IN PROGRESS CODE *****************

While most of the basic functionality of the XFS file system is working, this code is still very unstable.

IT MAY CRASH OR HANG YOUR SYSTEM! sooner or later. THIS IS BLEEDING-EDGE CODE.

This release has only been tried on IA32 systems. People are certianly free to work on getting it running on other architectures.

Many of the more advanced XFS features are yet to be completed:

For a list of items currently being working on or soon to be worked on refer to: http://oss.sgi.com/projects/xfs/todos.html

This list will updated as new items are found.

A beta release is planned in a few months and at that time we will release an xfs rpm.

There is a linux-xfs@oss.sgi.com (mailto:linux-xfs@oss.sgi.com) mail list that you can subscribe to and watch problems/issues as the get fixed and are found.

Please e-mail linux-xfs with any issues/problems/... that you find in the code or while running.

If you want to help with specific work items, please e-mail xfs-masters@oss.sgi.com (mailto:xfs-masters@oss.sgi.com) .

There were a few threadlets in reply. Christoph Hellwig wanted to know where the archives of the linux-xfs mailing list were, and after asking a couple times, Dominik Kubla gave a link to http://oss.sgi.com/cgi-bin/archive/linux-xfs. However, Steve Lord from SGI replied that this link appeared not to be working, and said it was being looked into. By KT press time, however, the problem seems to have been cleared up, and the archives go all the way back to February!

Several people also asked for discrete patches against the Linux sources, instead of having to download an entire CVS tree. Jim Mostek replied that Russell Cattelan had just uploaded a patch to ftp://oss.sgi.com/projects/xfs/ftpdir/03312000linux-2.3.99pre2-linux-2.3-xfs.patch.gz. Christoph replied, "Ok, I've downloaded it. BTW: the patch has all the CVS dirs included, so it is not very well readable. Could you please remove this in the next version?" Russel replied with two points. First, he said, "Neither the patch file nor the snapshot will be generated on a regular basis. The CVS stuff will be needed to keep up with the current state of the tree." And second, he asked why anyone would want to read the patch anyway, since it was so large. Christoph said simply, "To know what's inside."

In Christoph's original reply to the announcement, he also mentioned that direct I/O, as touted in the announcement, seemed to him to not be a filesystem issue. But Peter Rival replied, "No, Direct I/O _is_ an FS issue. Direct I/O != raw devices. It's more like taking both of them, slamming them together, and coming up with something really fast that people like Oracle get all sorts of excited about. It's actually very cool, and I'd imagine not too hard with all the work sct" [Stephen C. Tweedie] "(et. al.) put in 2.3 with kiobufs and all." There was no reply.

4. Nearing 2.2.15; Assembly Warnings; Tape Drives

2�Apr�2000�-�5�Apr�2000 (18 posts) Archive Link: "Linux 2.2.15pre17"

Topics: Assembly, Disks: IDE, Disks: SCSI, Kernel Release Announcement, Networking

People: Alan Modra,�Alan Cox,�H. Peter Anvin,�Andre Hedrick,�Peter Svensson,�Ha Quoc- Viet,�Philipp Thomas,�Chris Kloiber,�Ron Flory,�Barry K. Nathan,�Richard Henderson

Alan Cox announced 2.2.15-pre17, and said that if there were no new problems, he'd release 2.2.15 in a day or so. He posted the changelog:

Henrik Storner reported some "Warning: using `%eax' instead of `%ax' due to `l' suffix" warnings he hadn't seen before, on his Red Hat 6.2 system, with binutils 2.9.5.0.22 and egcs 1.1.2; and Ron Flory confirmed seeing these errors on a Red Hat 6.2 system running 2.2.15-pre15; but Alan Modra, also replying to Henrik, explained, "Ignore them, at least for 2.2.x The newer assemblers have much better syntax checking than older ones, and are just warning that the asm code here isn't completely self-consistent. It's done that way for compatibilty with really old assemblers that add a redundant opcode prefix on certain instructions. In the case of 2.3.x and the coming 2.4.x, these warnings should be fixed. You require a fairly new assembler for 2.3.x anyway, to assemble the x86 boot code."

Elsewhere, Andre Hedrick asked Alan C. if he could get the basic OnStream driver in for ide-tape. Chris Kloiber supported this and all of Andre's IDE patches, but Alan C. replied, "No. 2.2.16pre1 for the pending new driver stuff and anything not obviously correct - that wont be long."

H. Peter Anvin replied peripherally, asking about the performance, reliability, and portability of tape backup devices, adding that they seemed to be "the only affordable backup drives in their size range -- currently 25 GB actual." Andre replied, "Yes, their DI-30 flys at ~1200KB per second. This is an internal hardware limit of the device," an offered to give H. Peter the name of a contact inside the company. There was no reply to that, but Peter Svensson replied to H. Peter's post, adding, "There is also the VXA drive from Ecrix which (at least in Sweden) is priced similar and with a similar tape size (33GB uncompressed). It works as advertised under Linux." Ha Quoc- Viet replied to this with his own experiences of OnStream tape drives:

In my own experience :

now, for the SCSI version of OnStream tape drives, Documentation/ide.txt mentions that no support is currently available. I havent' try though, and I don't know how updated this file is.

as far as realiability is concerned, I have written 2 tapes so far, about 10Gigs on a 15 actual gig tape :

please remember, this is an IDE tape drive. SCSI tape drives perform better usually.

performance is as expected, no more no less.

There was no reply to that, but elsewhere, Philipp Thomas also replied to H. Peter's initial question, correcting him about the size of the available drives. As he put it, "OnStreams IDE drives currently only support the 15 GB tapes. And if I remember correctly, that's 15 * 10^6 bytes, making it ~13 GB for normal computer users. But even at that capacity they're still an attractive buy." H. Peter pointed out that SCSI drives were larger, to which Philipp replied, "Yes, they do. But at least here in Germany the SCSI drives cost nearly twice as much, although at that capacity that's still a lot cheaper than comparable streamers."

5. Mounting Audio CDs; The Open Source Development Process

2�Apr�2000�-�10�Apr�2000 (68 posts) Subject: "Music CD's"

Topics: Disks: SCSI, FS: FAT, FS: NFS, FS: devfs, Ioctls

People: David Elliott,�Jens Axboe,�JM Geremia,�Alan Cox,�Matthew Kirkwood,�Richard Gooch,�Thomas Molina,�Albert D. Cahalan,�Linus Torvalds,�David Balazic

Someone wanted to mount audio CDs, but several people, including Alan Cox, pointed out that audio CDs did not contain filesystems, and therefore couldn't be mounted. Several other folks pointed out that tracks could be interpreted as files, and so something should be workable. At one point, David Elliott said:

Sorry, but I have to say that that has to be the dumbest idea I have ever seen.

Reading audio off of an audio CD is not a perfect process. If you read the Audio CD specification you'll notice that the best resolution you can get is 1/63 of a second (since every frame is 1/75 of a second). So to fix that problem you need things like jitter correction algorithms (unless your CD-ROM already does it) and algorithms to correctly get around scratches.

Trying to put all that crap in the kernel is pretty dumb. And if you just do a half-assed job and only let it work for perfect non-scratched CDs on CD-ROMs with built-in jitter correction, then we will have even more half-assed MP3s with pops and skips in them.

By far, the best thing to do is keep this crap OUT of the kernel. The cdparanoia program can do just about anything you want, and NEVER pops or skips even on shitty drives. If you are making MP3s of CDs I strongly suggest that you use cdparanoia and a good encoder.

Jens Axboe burst into applause, saying, "This "audio file system" topic has been brought up so many times before, your comment is right on the spot. Should be added to an FAQ." And JM Geremia added:

As someone who has written an extremely sophisticated audio cd filesystem, I must say that I completely agree with Jens. It's a BAD IDEA. Actually, he yelled at me about this about a year ago :) This would be a great topic to add to the FAQ!

There are two schools of thought:

  1. Things like jitter and scratch correction are not obvious kernel functionality. They bloat the kernel and can be done just as effectively in user space. Implementing corrections for drives that do it with hardware is a waste. Not implementing corrections is inadequate for the rest. The whole idea behind standards like ide and SCSI is that the drivers can be more generic.
  2. The CD-ROM is a device, and doing corrections are the job of the driver. However, the audio sector size is bigger than the pagesize and this means doing translation if the filesystem is to use the normal blkdev approach.

    The inodes and superblock would all have to be virtual since the only directory structure on an audio cd is the TOC. I have no problem with this, personally.

    This means that the cd drivers (all of them) would need to be changed to translate between pagesized blocks and framesize blocks and then have a fairly intricate cache scheme. Otherwise, the audiofs implementation would need to make ioctl() calls. I, for one, don't think that the kernel should use ioctl's. Personally, I don't like them because the kernel has to turn off memory write verification using get_FS() and set_FS(). I'm not happy about that.

Now, honestly, do you think Option (2) is worth all that? Having implemented it all (for fun, mainly) I can tell you to stick with the userspace rippers/encoders. They are the right way to go.

Elsewhere, a discussion cropped up about the Linux development process in general. Leading in, Alan reiterated that an audio FS should not be in the kernel. As he put it, "If its in kernel someone did it wrong. You can do it just as well using userspace and the coda vfs hooks." David Balazic replied that the coda VFS hooks could be used to implement any other filesystem (FAT, NFS, UFS, etc.) as well; and Richard agreed, saying he didn't understand where the distinction was drawn. Why would one filesystem be done in the kernel, while another would have to be in user space? Matthew Kirkwood stepped forward:

Isn't this what makes Linux good? We don't have to have absolute, rock-solid, hard-and-fast rules about what goes where, or how it is done.

We can use what has gone before as precedent, but not be tied by it.

Do we always have to draw a line?

Richard replied, "We don't really have guidelines either, or discussion papers giving at least halfway rational reasons. What we have now is agitation and flaming and a lot of hysteria." But Matthew warned, "Anyone who has watched linux-kernel for a while should have picked up a reasonable intuition about such things. A set of written heuristics would quickly degenerate to dogma, and prevent people from judging each case on its own merit, IMO." Richard came back with:

But it's not consistent. I'm not asking for dogma. If it's not written down somewhere (even if marked "in the opinion of the author"), then every new FS^H^Hpiece of code has to survive a trial by flamewar. And inevitably, the same old ground is covered each time. It's wasteful of the time of the people flaming, and it makes linux-kernel more voluminous, which leads to more people filtering it (or unsubscribing).

Something that might work is that opposing views are written up, by respected people (i.e. they should have contributed real code, rather than vocal hangers-on), and referred to in the FAQ. And they should have links to each other, as well.

Thomas Molina entered the debate, with:

IMHO this flamage/trial by fire is a strength, not a weakness. Yes, each "new" idea generates the same/similar arguments. However, it is through the trial of the arguments that the ideas are tested and retested to see if they are valid.

Certainly anything which has gone through the wringer more than once deserves mention in the FAQ. If it generates identical arguments then it deserves to be ignored. However, what about when someone says I understand that a, b, and c were extensively discussed before, but here is a new answer which I think invalidates some of the objections and here is the code which demonstrates that.

Good/Bad/Indifferent ideas are certainly floatin around all over the place. I've seen a number of them, and a percentage of those have matured in their seperateness and eventually got included in the kernel. vfat support and devfs are cases in point.

We also have the court of final authority in our benevolent dictator, Linus Torvalds -- even if sometimes he claims to wear a brown bag. He's said more than once, "I won't even consider X, so don't bother submitting patches for it." Let's use that; such things also belong in the FAQ. The system works well, even if it occasionally generates numerous messages. That is what procmail recipies and the 'n' key are for.

Albert D. Cahalan replied somewhat cryptically:

Trial by fire excludes some very important parts of the world. There is a cultural issue here. Consider world population and economic strength, then notice a really major country from which there are seldom any linux-kernel posts. (no, English is not the most serious problem - many can write it well enough)

Yes, really, there are places where flaming isn't accepted.

Not that I have an answer though, and don't bother asking me to single out the place that comes to mind.

6. 2.4 Jobs List: Saga Continues

3�Apr�2000�-�10�Apr�2000 (16 posts) Subject: "Linux 2.4 Jobs Update"

Topics: Compression, Disk Arrays: RAID, Disks: IDE, Disks: SCSI, FS: Coda, FS: FAT, FS: NFS, FS: NTFS, FS: UMSDOS, FS: ext2, I2O, Networking, PCI, Power Management: ACPI, SMP, Security, USB, Virtual Memory, VisWS

People: Alan Cox,�Tigran Aivazian,�Mark Hemment,�Andre Hedrick,�Stephen C. Tweedie

Alan Cox posted his latest list of things to do before 2.4:

  1. Fixed
    1. Tulip hang on rmmod (fixed in .51 ?)
    2. Incredibly slow loopback tcp bug (believed fixed about 2.3.48)
    3. COMX series WAN now merged
    4. VM needs rebalancing or we have a bad leak
    5. SHM works chroot
    6. SHM back compatibility
    7. Intel i960 problems with I2O
    8. Symbol clashes and other mess from _three_ copies of zlib!
    9. PCI buffer overruns
    10. Shared memory changes change the API breaking applications (eg gimp)
    11. Finish softnet driver port over and cleanups
    12. via rhine oopses under load ?
    13. SCSI generic driver crashes controllers (need to pass PCI_DIR_UNKNOWN..)

  2. In Progress
    1. Merge the network fixes (DaveM)
    2. Merge 2.2.15 changes (Alan)
    3. Get RAID 0.90 in (Ingo)

  3. Fix Exists But Isnt Merged
    1. Signals leak kernel memory (security)
    2. msync fails on NFS
    3. Semaphore races
    4. Semaphore memory leak
    5. Exploitable leak in file locking
    6. Merge the RIO driver (probably do post 2.4.0 as it is large) (in AC tree)
    7. S/390 Merge (merged in AC tree)
    8. 1.07 AMI MegaRAID
    9. Any user can crash FAT fs code with ftruncate
    10. Mark SGI VisWS obsolete

  4. To Do
    1. Restore O_SYNC functionality
    2. Fix eth= command line
    3. Trace numerous random crashes in the inode cache
    4. Fix Space.c duplicate string/write to constants
    5. VM kswapd has some serious problems
    6. vmalloc(GFP_DMA) is needed for DMA drivers
    7. put_user appears to be broken for i386 machines
    8. Fix module remove race bug (mostly done - Al Viro)
    9. Test other file systems on write
    10. Security holes in execve()
    11. Directory race fix for UFS
    12. Audit all char and block drivers to ensure they are safe with the 2.3 locking - a lot of them are not especially on the open() path.
    13. Stick lock_kernel() calls around driver with issues to hard to fix nicely for 2.4 itself
    14. IDE fails on some VIA boards (eg the i-opener)
    15. PCMCIA/Cardbus hangs, IRQ problems, Keyboard/mouse problem (may be fixed ?)
    16. Use PCI DMA by default in IDE is unsafe (must not do so on via VPx x<3)
    17. Use PCI DMA 'lost interrupt' problem with some hw [which ?]
    18. Crashes on boot on some Compaqs ?
    19. pci_set_master forces a 64 latency on low latency setting devices.Some boards require all cards have latency <= 32
    20. usbfs hangs on mount sometimes
    21. Loopback fs hangs
    22. Problems with ip autoconfig according to Zaitcev
    23. Still some SHM bug reports
    24. SMP affinity code creates multiple dirs wit the same name
    25. TLB flush should use highest priority
    26. Set SMP affinity mask to actual cpu online mask (needed for some boards)
    27. pci_socket crash on unload

  5. To Do But Non Showstopper
    1. Make syncppp use new ppp code
    2. Finish 64bit vfs merges (lockf64 and friends missing)
    3. NCR5380 isnt smp safe
    4. DMFE is not SMP safe
    5. ACPI hangs on boot for some systems
    6. Get the Emu10K merged
    7. Finish I2O merge
    8. Go through as 2.4pre kicks in and figure what we should mark obsolete for the final 2.4
    9. Per Process rtsigio limit
    10. Fix SPX socket code
    11. Boot hangs on a range of Dell docking stations (Latitude)
    12. HFS is still broken
    13. iget abuse in knfsd
    14. Mark NTFS as obsolete
    15. Paride seems to need fixes for the block changes yet
    16. PIII FXSAVE/FXRESTORE support
    17. Some people report 2.3.x serial problems
    18. AIC7xxx doesnt work non PCI ?
    19. USB hangs on APM suspend on some machines
    20. PCMCIA crashes on unloading pci_socket
    21. DEFXX driver appears broken
    22. ISAPnP IRQ handling failing on SB1000 + resource handling bug

  6. Compatibility Errors
    1. exec() returns wrong codes on a file not found

  7. Probably Post 2.4
    1. per super block write_super needs an async flag
    2. addres_space needs a VM pressure/flush callback
    3. per file_op rw_kiovec
    4. enhanced disk statistics
    5. AFFS fixups
    6. UMSDOS fixups resync

  8. Drivers In 2.2 not 2.4
    1. Lan Media WAN

  9. To Check
    1. Truncate races (Debian apt shows it nicely) [done ? - all but Coda]
    2. Elevator and block handling queue change errors are all sorted
    3. Check O_APPEND atomicity bug fixing is complete
    4. Make sure all drivers return 1 from their __setup functions
    5. Protection on isize (sct) [Al Viro mostly done]
    6. Mikulas claims we need to fix the getblk/mark_buffer_uptodate thing for 2.3.x as well
    7. Network block device seems broken by block device changes
    8. Fbcon races
    9. Fix all remaining PCI code to use new resources and enable_Device
    10. VFS?VM - mmap/write deadlock
    11. rw sempahores on page faults (mmap_sem)
    12. kiobuf seperate lock functions/bounce/page_address fixes
    13. Fix routing by fwmark
    14. Some FB drivers check the A000 area and find it busy then bomb out
    15. rw semaphores on inodes to fix read/truncate races ? [Probably fixed]
    16. Not all device drivers are safe now the write inode lock isnt taken on write
    17. File locking needs checking for races
    18. Multiwrite IDE breaks on a disk error
    19. AFFS doesn't work on current page cache
    20. ACPI/APM suspend issue

Tigran Aivazian pointed out, "One issue that you did not mention and I believe is a show-stopper is the malfunctioning of 8139too driver. I have it easily reproduced here on 100M hub. I have a suggestion from Mark Hemment to try if it is a RxFIFO overflow issue (then insert a reset in appropriate place) which I still haven't found a minute to try and confirm/disprove." Mark posted a couple patches, and explained:

I've attached two patches for 2.2.15 (made against 2.2.2.15pre16).

The first prevents the RealTek 8139 NIC from wedging after a receive FIFO overflow. There maybe a better work around for this, but what I've got works here.

The second is not a high priority patch. In ide-pci.c/ide_match_hwif(), the code assumes that MAX_HWIFS is at least two. For those of us trying to build small kernels, this isn't always true.

Elsewhere, Andre Hedrick asked for more explanation of item 9.18 (Multiwrite IDE breaks on a disk error). Alan replied:

If you have one bad sector you should write the other 7..

Also the error recovery code was wrong. I've not checked this for a while and much has changed.

To this last comment, Andre asked exactly how long it had been since Alan had last checked it. Alan said about 6 weeks ago, and Andre replied that he was pretty sure it had been fixed since then.

To Alan's remark in his same post about bad sectors, Andre replied in disbelief, "Are you kidding that this does not happen?" Alan replied, "Not when I got a bad block it didnt. Its a corner case so relatively low priority. At the point it bites you have problems anyway" Andre replied after some thought:

Yes, but that is a "multi-mode" RW.

I need to think how to do a complete recovery.

A preferred solution in my mind would all for resetting the request cue and retry in the next sector and continue on; however, the need to update the badblock table that is written/created or needs to be similar to the "badblocks" call from "mke2fs". Do you agree?

Stephen C. Tweedie came in at this point, and had the last word, with, "There is no automatic bad block grokking in ext2 right now --- if you need to update the bad blocks list, "e2fsck -[Llc]" is currently the only way to do it."

7. Structural Changes Before Stable Series

4�Apr�2000�-�8�Apr�2000 (17 posts) Subject: "patch] Space.c and -fwritable-strings"

People: Andrew Morton,�Linus Torvalds,�Alexey Kuznetsov,�Nick Holloway,�David S. Miller,�Tim Waugh

Tim Waugh and Andrew Morton went back-and-forth on a patch, then took it to Linus Torvalds. Andrew explained:

Linus, Tim and I come to you hand-in-hand with a patch to Space.c.

It does the following:

A patch against 2.3.99-pre4-1 is attached.

Linus replied:

Ok, I have an alternative approach that I think should be a lot cleaner and that looks obviously correct. How about this one-liner patch instead:

include/linux/netdevice.h:

struct net_device {
...
- char *name;
+ char name[16];
...

The above gets rid of _all_ of the problems, and gets rid of the need to use the silly PAD macros - both the writable-strings and the PAD macro are only needed because this thing was misdesigned in the first place. Putting the device name into the net_device thing just automagically makes both things work correctly.

Sure, some drivers would have to be changed to do

strcpy(dev->name, namelist[index]);

instead of doing the current

dev->name = namelist[index];

but looking at at least one of the drivers (3c503.c), the whole and only reason for the name games is _again_ that what 3c503.c really wanted in the first place was really to have the name be done as an array rather than as a pointer (so at least 3c503.c would be trivially cleaned up a bit by that simple change too).

Alexey Kuznetsov and David S. Miller both thought this hit the nail on the head (Alexey's comment was "Aggressively agreed." ), but Nick Holloway objected:

Less obviously correct are the modifications to the network drivers :-)

I did start this (just to see how bad it really was), and I have 46 modified source files. This is only looking at "drivers/net/*.c".

I have checked it compiles with all network drivers as modules, but that is all. I may have goofed in moving from 2.3.99-pre2 to 2.3.99-pre3 (no compilation test performed).

If you want to have a look, point an HTTP client at:

http://www.alfie.demon.co.uk/dev_name-2.3.99-pre3.diff.gz

My own feeling is that this is rather a large upheaval close to 2.4. Then again, it isn't my call. In addition, many of the modified drivers should be using init_etherdev anyway.

Linus agreed that the network drivers would be more difficult to fix, but he felt they'd be fairly straightforward anyway, and the changes should be considered "cleanup" more than anything else. To the idea of upheaving the code soon before a stable release, he replied, "What I hate having is to have pending structural changes immedately after a stable release - it makes it very hard to maintain patches between a stable and the next development series if there issome fundamental small detail that has changed, and there ends up being lots of small syntactic differences that hide the really big and scary ones ;)"

8. Some Discussion Of mmap

4�Apr�2000�-�5�Apr�2000 (5 posts) Subject: "mmap/mlock performance versus read"

Topics: Virtual Memory

People: Paul Barton-Davis,�Linus Torvalds,�Albert D. Cahalan

Paul Barton-Davis posted two benchmark programs, each of which would copy data from a file into memory. He reported:

One program uses mmap/mlock to accomplish the lockdown, and calls munmap to release the previously locked chunk.

The other locks down a malloc-ed buffer for each file, and uses read to move the data into user space.

I was very disheartened to find that on my system the mmap/mlock approach took *3 TIMES* as long as the read solution. It seemed to me that mmap/mlock should be at least as fast as read. Comments are invited.

Linus Torvalds explained:

People love mmap() and other ways to play with the page tables to optimize away a copy operation, and sometimes it is worth it.

HOWEVER, playing games with the virtual memory mapping is very expensive in itself. It has a number of quite real disadvantages that people tend to ignore because memory copying is seen as something very slow, and sometimes optimizing that copy away is seen as an obvious improvment.

Downsides to mmap:

Upsides of mmap:

But your test-suite (just copying the data once) is probably pessimal for mmap().

To Linus' second downside ("page faulting is expensive" etc), Albert D. Cahalan asked, "Could mmap get a flag that asks for async read and map? So mmap returns, then pages start to appear as the IO progresses." Linus replied:

It's not the IO on the pages themselves, it's actually the act of populating the page tables that is quite costly. And doing that in the background is basically impossible.

You can do it synchronously, and that is basically what mlock() will do with "make_pages_present()". However, that path is not all that optimized (not worth it), and even if it was hugely optimized it would _still_ be quite slow. The page tables are just fairly complex data structures.

And on top of that you still have the actual CPU TLB miss costs etc. Which can often be avoided if you just re-read into the same area instead of being excessively clever with memory management just to avoid a copy.

memcpy() (ie "read()" in this case) is _always_ going to be faster in many cases, just because it avoids all the extra complexity. While mmap() is going to be faster in other cases.

9. Legal Status Of LVM

5�Apr�2000 (6 posts) Subject: "**** LVM 0.8final patch vs. 2.3.99-pre3 ***"

Topics: Disk Arrays: LVM

People: Bert Hubert,�Heinz Mauelshagen

In the course of discussion, Bert Hubert asked, "I was wondering, LVM very closely emulates the 'look and feel' (So closely that my LVM Viewer tool (screenshot on http://ds9a.nl/lvm-howto) actually works on both HP/UX and Linux with the same code) of HP/UX Logical Volume Management - is there any chance of HP/UX getting upset about this?" Heinz Mauelshagen replied, "I don't hope that this could happen still, because i talked to HP and asked for legal constraints reimplementing (most of) their command line interface on Linux before i first released LVM 0.1 back in 1998."

10. Using 'reiserfs' As The Root Partition

9�Apr�2000�-�10�Apr�2000 (8 posts) Subject: "What boot loader supports reiserfs root partitions?"

Topics: FS: ReiserFS, FS: ext2

People: Erik Andersen,�Albert D. Cahalan,�Otto E Solares,�Felix von Leitner,�Christoph Hellwig

Felix von Leitner wanted to use reiserfs for his root partition, but neither 'lilo', 'grub', or other bootloaders seemed to work; 'lilo', for example, would give a "Hole found in map file (alloc_page)" error. Erik Andersen offered a possible solution:

lilo works, but you must mount the reiserfs partition with the "notail" option. reiserfs tries to be efficient with disk space by taking the excess portions of files that do not fill up a whole block and it takes these "tails" and packs them together. This leaves holes in the files which prevents lilo from finding a continuous kernel image.

As root something like:

# mount /dev/root / -o remount,rw,notail
# cp -a /boot /boot1
# rm -rf /boot
# mv /boot1 /boot
# lilo

The process of copying the files after the remount will cause the files to all be rewritten without any tails

Moritz Schulte gave a similar answer, and suggested there might be more information at http://devlinux.org/namesys/.

Albert D. Cahalan said alternatively, "Look at GRUB again. There is a patch that lets it read Reiserfs. Unlike LILO, GRUB actually reads the filesystem. I think you can even type in a pathname to select a kernel." And Otto E Solares added from his own experience, "grub is more capable and user friendly than lilo. It actualy works nice here with reiserfs in twelve servers (for the idiots that think reiserfs doesn't have to be included in the kernel this is a serious complain, all good admins have take a look @ reiserfs, i just remember when ext2 generates massive fs curruption like 3 years ago on two of our servers and yes ext2 was IN the kernel)." Christoph Hellwig asked where the necessary patch for 'GRUB' could be found, and Otto replied with a pointer to http://www.mail-archive.com/bug-grub@gnu.org/msg01186.html. End of thread.

Sharon And Joy

Kernel Traffic is grateful to be developed on a computer donated by Professor Greg Benson and Professor Allan Cruse in the Department of Computer Science at the University of San Francisco. This is the same department that invented FlashMob Computing. Kernel Traffic is hosted by the generous folks at kernel.org. All pages on this site are copyright their original authors, and distributed under the terms of the GNU General Public License version 2.0.