Kernel Traffic #52 For 24 Jan 2000

By Zack Brown

Table Of Contents

Introduction

A very big thanks goes to Martin Stromberg for pointing out that in last week's Issue #51, Section #4  (4 Jan 2000: Swapping Via NFS) , swapping via NFS uses a file, not a partition. Sorry for the misinformation.

Thanks also go to Mads Kiilerich for suggesting that the link to the latest KT and KC issues should redirect to their permanent locations. That's been done.

Mailing List Stats For This Week

We looked at 1241 posts in 5483K.

There were 451 different contributors. 207 posted more than once. 177 posted last week too.

The top posters of the week were:

 

1. ToDo Before 2.4
4 Jan 2000 - 13 Jan 2000 (125 posts) Archive Link: "First draft list of 2.3.x 'Things to fix'"
Topics: Disk Arrays: LVM, Disk Arrays: RAID, Disks: IDE, FS: NFS, FS: ReiserFS, FS: XFS, FS: ext2, FS: ext3, Ioctls, Networking, USB, Virtual Memory, VisWS
People: Alan CoxHans ReiserStephen C. TweedieTheodore Y. Ts'oLinus TorvaldsMiquel van SmoorenburgChuck LeverDavid L. ParsleyMatthew WilcoxPedro M. RodriguesJamie LokierAndre HedrickNathan ZookMartin MaresChris MasonDavid WeinehallDonald BeckerJes SorensenWakko WarnerPeter Svensson

Alan Cox posted a long list of things that still needed to be done before 2.4 could go out. He added that the list was approximately in order of priority, and that he may have made some mistakes along the way. Here's his list:

  1. Multiwrite IDE breaks on a disk error
  2. Poll on > 16000 file handles fails
  3. Restore O_SYNC functionality
  4. Merge the network fixes - there is a ton of backed up stuff to do asap
  5. ISA DMA is no longer allocating correctly aligned data
  6. vmalloc(GFP_DMA) is needed for DMA drivers
  7. VM needs rebalancing
  8. NFSD fixes for path walking to regenerate dentries
  9. Fix eth= command line
  10. Check O_APPEND atomicity bug fixing is complete
  11. Protection on isize (sct)
  12. Merge 2.2.13/14 changes
  13. Get RAID 0.90 in
  14. PAE36 failures
  15. USB HID merge
  16. Mikulas claims we need to fix the getblk/mark_buffer_uptodate thing gor 2.3.x as well
  17. PIII/Athlon/MMX/etc acceleration merge from 2.2.x-ac
  18. Merge arcnet update (DONE)
  19. Fix SPX socket code
  20. AHA152x isnt smp safe (FIXED)
  21. NCR5380 isnt smp safe
  22. isofs break on 4Gig disk (FIXED ?)
  23. Finish 64bit vfs merges (stat64 etc) (DONE ??)
  24. Make syncppp use new ppp code
  25. Fbcon races
  26. Fix all remaining PCI code to use new resources and enable_Device
  27. Stackable fs ?? (Erez)
  28. Get the Emu10K merged
  29. Test PMC code on Athlon
  30. Fix module remove race bug (-- not in open so why did I see crashes ??? --)
  31. Per Process rtsigio
  32. Maybe merge the ibcs emulation code
  33. VFS?VM - mmap/write deadlock
  34. initrd is bust
  35. rw sempahores on page faults (mmap_sem)
  36. kiobuf seperate lock functions/bounce/page_address fixes
  37. per super block write_super needs an async flag
  38. addres_space needs a VM pressure/flush callback
  39. per file_op rw_kiovec
  40. enhanced disk statistics
  41. Fix routing by fwmark
  42. put_user appears to be broken for i386 machines

Someone reported that the USB HID merge had been completed as of 2.3.36-6

At some point in the conversation, David Weinehall suggested including ext3 in 2.4, but Alan pointed out, "There is a lot of work to be done to get the journalling layer nicely arranged to do the right things and to do them right for XFS, ext3 and Reiserfs - not 2.4 material by any means." Hans Reiser (author of reiserfs), replied:

While there is a lot of work to be done to do things right, I don't think there is a lot of work to put ext3 and ReiserFS into 2.3. We are working on the port right now for ReiserFS, and I don't think we are far away.

If we wait for everything to be right, software never ships....

I think the thing to consider is that we can put journaling into 2.4, and then niceties like allocate on flush can be done later in 2.5.

I have some concern that you are suggesting that there should be only one journaling coding for both filesystems, and that is not only far away but far from clear to me. Chris Mason (ReiserFS journaling code author) may disagree with this, and is encouraged to comment. I'll just note that we envision re-engineering journaling now that the current journaling is stable. Users should use the current journaling, it should go into 2.4 because lack of journaling is keeping many users away from Linux, but journaling is not ready for the ANSI standards process.:-) Nor is it ready for sharing between many filesystems.... We have more than one filesystem, why not more than one journaling system?

Meanwhile, Stephen C. Tweedie (author of ext3) also replied to David, explaining, "It's not just a matter of how stable it is --- journaling requires a bit of extra help from the kernel for various reasons, and we need to do a bit more work to agree exactly how the journaling functionality will integrate into the VM before we can make it part of the official mainstream kernel."

David L. Parsley replied that he supposed ext3 and reiserfs were coded very differently from each other, and asked if anyone had done a comparison, to see if one handled the VM issues better than the other. Theodore Y. Ts'o replied, "My understanding is that reseirfs suffers from the same VM issues as ext3; both currently don't coexist well with soft-RAID (i.e. /dev/md), because of different assumptions about how the buffer cache works, and what devices like the MD devices are allowed to do." And Stephen explained:

There are other problems as well: on smaller machines, we don't have any way for the "pinned" buffers associated with transactions to be flushed early in response to memory pressure, for example.

There are a few such VM issues which have to be addressed before we can get rock-solid journaling in the kernel.

Meanwhile, Linus Torvalds also replied to David Weinehall's suggestion, predicting, "ext3 is not going in, but reiserfs might. Unlike ext3, raiserfs actually has gotten a lot of real-world testing: SuSE seems to be using it in production environments with good results. It's still 2.2.x-based, so it may not make it, but it is at least a potential thing."

Later in the same thread, Peter Svensson asked if reiserfs would be made compatible with RAID5 in the near future, and Hans Reiser replied, "We have funding to hire somebody to do it.... but no person yet...."

Coming back to the general issue, Wakko Warner asked if PCMCIA stuff would make it into 2.4, but Alan replied, "For now Im not even going try and track machine specific/user specific problems just generic core stuff."

Wakko had also added that he was using a NEC Versa SX laptop, to which Linus replied, "I'd LOVE to hear what happens on the box with 2.3.36. Ifyour NEC has a cardbus controller (most modern laptops do), 2.3.36 has a completely rewritten cardbus layer, and I'll be jumping up and down with joy about repors about it." He offered, "To enable the new cardbus layer, please just say Y to the PCMCIA and cardbus questions and say N to i82365 support."

Coming back to Alan's original ToDo list, Miquel van Smoorenburg if merging the network fixes (item 4) would include the ringbuffer / dev_alloc_skb problems that a lot of ethernet drivers had, and which he had personally confirmed on tulip, eepro100, and epic100. The problem apparently would cause drivers to hang indefinitely on heavily loaded machines. He went on, "It looks like most if not all of Donalds drivers have this problem, but Donald hasn't posted one word (or patch) on this subject yet, alas." But he also added, "There is an alternative eepro100 driver that works fine (running it in production on a 2.2.x kernel - the stock one hangs about two times an hour, the new one runs for weeks and weeks), patches for the tulip have been posted to this list around Nov. 25, and it looks like the patches for the tulip are easy to apply to at least the epic100 driver as well." Jes Sorensen replied, asking for more information on what the problem actually was, and Mark van Walraven gave pointers to http://www.uwsg.indiana.edu/hypermail/linux/kernel/9911.3/0050.html and http://www.uwsg.indiana.edu/hypermail/linux/kernel/9911.3/0021.html. As far as whether the fixes suggested by Miquel would be included, Alan replied, "No but the 2.2.14 merge will do tulip and eepro100 is on my .15 list"

Chuck Lever also came back to Alan's long list, suggesting possible additions:

  1. some or all of ingo's latency patches
  2. 32 bit uid/gid support
  3. madvise
  4. ibm's task struct cacheline fix for goodness()
  5. appropriate dynamic sizing of kernel hash tables

Alan pronounced that Chuck's item #2 would go in, and item #3 was a maybe. He asked for a copy of the patch for item #4, and to item #5 he replied, "Thats part of pushing the 2.2.13/14 fixes into 2.3.x probably"

David L. Parsley also suggested, in reply to Alan's long list, that LVM be included in 2.4. He gave a pointer to the Petition to get LVM into the Linux kernel (http://www.the-infinite.org/lvm_petition/index.phtml) , and added that 1257 people had signed (as I write this, the number is up to 1366, a jump of over 100 people in 18 days). He went on, "With the proliferation of ext2 resizing tools, this sure would be sweet; LVM could have saved my butt a few times in the last few years," and asked, "Any numbers on the "minimal" performance loss of the extra layer?" Alan replied:

When I tried it for a bit in 2.2.x-ac I couldnt measure any.

The reason I gave up adding it to -ac was that I cleaned it up , I fixed it for the Coding Style document and I fixed some bugs. I got an update from the author that simply ignored all that work and reverted to wrong formats.

Every annoyance I personally have with the LVM code comes down to two things

  1. Not following the Coding Style
  2. General poor readability - lots of complex loops, huge ioctl functions

The main thing it made me wonder was if the ioctl interface layer was perhaps structured badly and perhaps using different ways to pass the data would avoid a lot of the mess in the existing code.

The actual remapping code is fast, clean and works. It also has about zero impact if the LVM is disabled.

Regarding the petition, Matthew Wilcox retorted, "Someone decided to petition Linus. I never saw a post from Linus asking whether anyone wanted it or not. Hopefully this lame effort will have no impact on whether or not LVM goes in. This sort of thing is more likely to prejudice people against it, to be honest." And Pedro M. Rodrigues added, "I agree. Even though i do have a need for LVM in my servers (they look kinda ugly next to those AIX machines) i felt awkward when i knew about the petition. Don't ask me why, but i feel that a lot of people asking for the same thing doesn't mean it's the right thing to do." But David replied:

I think you took my message the wrong way; my impression from the LVM site was "LVM meets the technical criteria, but Linus wants an idea of who is interested in support for this" - maybe Linus replied to a message and didn't CC lkml. Maybe not. Still, LVM made it into a couple of -ac patches, so I figure it's technically sound.

I agree that the idea of petitioning for a patch that Linus has rejected based on his design judgement is, well, lame. 'Petition' was maybe the wrong word, but not my choosing.

Going back to Alan's big list, Jamie Lokier replied with some suggested additions:

  1. memory detection is broken and may be causing fs corruption
  2. IDE DMA failures

Andre Hedrick replied, of the IDE DMA failures, "This I have fixed, and hope to submit something to night. There are a few more general features and fixes also." And Nathan Zook said of Jamie's item #1, "Actually, the "famous" memory detection break IS fixed. If I had been paying better attention, it would have been fixed months ago instead of weeks."

Martin Mares also had some things to add to Alan's big list; among other things, he asked, "Are we going to merge new drivers by Donald Becker? If so, merge his PCI scan code first and make it call 2.3 functions." Alan explained:

I tried a couple of times to sort the mess out. I never got the PCI scan code to work right, it kept not checking resources, registering I/O as memory and stuff. I at least don't think its ready. The drivers also seemed to have problems (not that the existing ones dont). Right now I'd rather 2.3.x picked up the fixes from 2.2.14 for tulip and acquired the eepro100 and other work people have been doing then the bug fixes from Donalds driver updates.

Its not that I don't think Don is on the right track, just I don't think he has the right implementation.

Martin offered, "If you want, I can take a look at his PCI probing code and fix the resource problems. It seems to be close to trivial." Alan didn't reply.

In that same staircase with Alan, Martin had also suggested in a separate threadlet, "Check whether VISWS compiles and if there's nobody to test it, mark it as experimental. It's probably got broken by many i386 changes." Alan replied, "Thats IMHO SGI's problem. Im assuming they will at some point resync with it and produce a small chunk of needed patches," and Martin said, "They say nobody is working on Linux VISWS stuff now."

 

2. /proc And sysctl()
6 Jan 2000 - 12 Jan 2000 (43 posts) Archive Link: "/proc guidelines and sysctl"
Topics: BSD, Executable File Format, FS: procfs, Ioctls, Networking
People: Benjamin ReedLinus TorvaldsMarcin DaleckiAlexander ViroAndi KleenTheodore Y. Ts'oMark Lord

Benjamin Reed wrote a wireless ethernet driver that used /proc as its interface. But he was a little uncomfortable defining his own namespace under /proc, and asked if there were any conventions he should follow. He added, "And finally, what's up with sysctl? Are driver writers recommended to use that over extending /proc or is it deprecated? Again guide lines would be nice."

Linus Torvalds replied with:

The thing to do is to create a

/proc/drivers/<drivername>/

directory. The /proc/drivers/ directory is already there, so you'd basically do something like

create_proc_info_entry("driver/mydriver/status", 0, NULL, mydriver_status_read);

to create a "status" file (etc etc).

For the sysctl question, he added, "sysctl is deprecated. It's useful in one way only: it has some nice functions that can be used to add a block of /proc names. However, it has other downsides (allocating silly numbers etc - there should be no need for that, considering that the /proc namespace is alreayd a perfectly good namespace)."

Marcin Dalecki flamed Linus:

Are you just blind to the neverending format/compatiblity/parsing/performance problems the whole idea behing /proc induces inherently? Oh yes they don't turn up that frequently anylonger, since everybody learned in the time between don't touching anything there like a heap of shit. Instead of changing something, one leaves the broken /proc interface where it is and adds just another new file (or even dir) there.

My favorite examples for how broken they are

/proc/stat the information there is entierly *broken* misleading and incomplete. (leftover from early days.)
/proc/pci static data continuously reconstructed on the fly. (binary to string and then back string to binray in userland...) And now (2.3.xx) it's event binary only...
/proc/cpuinfo same here static data. uname is since the beginnging the proper interface for this stuff.
/proc/ksyms entierly redundant and not used by the modutils.
/proc/modules entierly redundant to the module syscalls. *Not* used by lsmod.
/proc/version entierly static data with no apparent value
/proc/kmsg entierly redundant to syslog.

One could continue with no end...

root:/proc# cat meminfo
total: used: free: shared: buffers: cached:
Mem: 64577536 62787584 1789952 20643840 1339392 17186816
Swap: 139821056 36478976 103342080
MemTotal: 63064 kB
MemFree: 1748 kB
MemShared: 20160 kB
Buffers: 1308 kB
Cached: 16784 kB
SwapTotal: 136544 kB
SwapFree: 100920 kB

Wonderfull!!!! The same data twice, albeit no one of them easly parsed! Easly parsed? By what? AWK? SED? or should the procps utilities beeing implemented in damn PERL? (Some loosers who don't know C would apreciate this, certainly) !!!!! The only thing I'm missing is adding floating point formats to this...

And then there is the phenomenon of proliferation of /proc items. Just an example...

root:/proc/ide# find /proc/ide
/proc/ide
/proc/ide/drivers
/proc/ide/hdd
/proc/ide/ide1
/proc/ide/ide1/hdd
/proc/ide/ide1/hdd/capacity
/proc/ide/ide1/hdd/settings
/proc/ide/ide1/hdd/model
/proc/ide/ide1/hdd/media
/proc/ide/ide1/hdd/identify
/proc/ide/ide1/hdd/driver
/proc/ide/ide1/model
/proc/ide/ide1/mate
/proc/ide/ide1/config
/proc/ide/ide1/channel
/proc/ide/hda
/proc/ide/ide0
/proc/ide/ide0/hda
/proc/ide/ide0/hda/smart_thresholds
/proc/ide/ide0/hda/smart_values
/proc/ide/ide0/hda/geometry
/proc/ide/ide0/hda/cache
/proc/ide/ide0/hda/capacity
/proc/ide/ide0/hda/settings
/proc/ide/ide0/hda/model
/proc/ide/ide0/hda/media
/proc/ide/ide0/hda/identify
/proc/ide/ide0/hda/driver
/proc/ide/ide0/model
/proc/ide/ide0/mate
/proc/ide/ide0/config
/proc/ide/ide0/channel

Hell only God know's what they are good for! And there is no userland tool for this. This is the last thing Mark Lord added before ditching ide developement.

root:/proc/sys# find /proc/sys | wc
208 208 7305

Don't tell me any sane admit will fiddle with ALL this... And in esp. any sane system doesn't need this degree of pseudo configuration flexibility.

And here my ABSOLUTE FAVORITE:

PID USER PRI NI SIZE RSS SHARE STAT LIB %CPU %MEM TIME COMMAND
21821 root 19 0 1032 1032 816 R 0 4.7 1.6 0:00 top
    *
   ***
  *****
 *******
*********
   ***
   ***
   ***
   ***
   ***
   ***

Yes reading files, walking dirtrees and parsig them is indeed very very time consuming. I would like to know how well this design will scale to an enterprise server with 32 CPU and X*10000 concurrent processes:

user:~/mysweethome: Message from root@localhost to user@localhost resived... BLAH BLAH: "Please stop any intensive intermittient computational activity. Due to maintainance work I'm going to run ps auxw int 5 minutes. Thank's in advance for your understanding! You's sincerly: root@localhost"

Oh don't tell me procps could have been done better, there where years of time for this and apparently nobody managed to get it right for practical reaons..

I think you don't write enough user-land code... (just a guess) go and just compare for example the ps/netstat utlities from *BSD just too see WHY /proc as it is, is a BAD design :-).

Maybe it appears cute as an idea to have something like this, but in practice something like this is inevitable going to result in a coding mess in esp. in an such uncoordinated effort like Linux.

And I didn't even tell a word about the bloat/mess/races inside the kernel code caused by this all...

Really man sysctl *is* much much saner and what should be "depricated" is /proc

There was a bit of discussion, but Linus did not reply.

Alexander Viro replied to Linus' statement that sysctl was deprecated. He burst out with:

Oh, please! All we need is sysctlbyname(2) - _not_ a problem, and closes all problems with numbers. And it should not work through mounted procfs - we can traverse the tree doing comparisons by name just fine. The fact that sysctl(8) needs mounted procfs is an artificial misfeature, nothing more.

What _is_ bogus is the idea of sysctl() doing more than read/write access to constant-sized variables. Or procfs entries doing ioctl(), for that matter - just look at /proc/mtrr, for one specimen.

sysctl() is a perfectly reasonable subset of pseudofs-type stuff, with well-defined semantics (unlike the rest ;-/). The rest is pretty much a maze of twisted little formats, none alike. IMO dissolving the thing is _not_ a good idea. You have the final word, indeed, but I think that sysctlbyname() may remove most of the problems.

Linus replied that he'd accept a patch to turn sysctl into a proc-only thing. He added, "The current problem is that sysclt tries to be more than proc, and has its own name-space etc. Not worth it." Andi Kleen proposed, "The nice thing of giving up the sysctl numbers is that it would be possible to use some ELF section based scheme for declaring sysctl variables in nice wrapper macros. You could get a sysctl variable with a single declaration. This would make them a lot more easy. Would you accept a patch for that?" Linus replied, "Show me the patch, and I can consider it. It would certainly be nicer than what it is now (the include/linux/sysctl.h file is EVIL, and a perfect example of the kind of idiotic brokenness we used to have in /proc before it was cleaned up)."

Theodore Y. Ts'o also replied to Alexander, saying, "I actually like the original sysctl() design --- including the use of reserved numbers. After all, we have system calls, and we don't try to look up system calls when we executed them by name..... why is this OK for system calls, but not OK for sysctl()?" Linus replied:

Because system calls are performance-sensitive.

And system calls are not clearly "hierarchical".

And system calls are supposed to be there regardless of what software and hardware configuration we have there.

In contrast, sysctl isn't all that performance-sensitive, AND they are extremely hierarchical, AND they depend on configuration and timing.

In short, sysctl NEEDS:

  • "naming": you cannot name the sysctl space with a number: it is much too dynamic for that. How do you enumerate drivers? Give them random numbers?
  • "listing": showing which sysctl's are there, in a hierarchical manner. Again, a listing is useless with a number.
  • "hierarchy". You have different devices, but they have the same controls. Do they get the same name? Yes. But in different places in the hierarchy.

In short, you NEED a filesystem. You need to be able to "ls" the thing. You need to be able to search the thing. You need to be doing all the things you can do with a real filesystem.

And flattening it out and trying to number it does not work. Never has, never will. It's not an enumerated space.

 

3. Block Device Interface Change And Related Pain
7 Jan 2000 - 11 Jan 2000 (52 posts) Archive Link: "[ANNOUNCE] block device interfaces changes"
Topics: FS: NFS, FS: devfs, FS: procfs, Ioctls, POSIX, Real-Time
People: Alexander ViroRichard B. JohnsonVictor KhimenkoGregory MaxwellRik van RielDavid ParsonsAlan CoxHorst von BrandJamie LokierTheodore Y. Ts'oDavid LangDonald Becker

Alexander Viro announced that the block device interface would be changing, and that some of these changes had made it into 2.3.38; he listed:

  1. New type (struct block_device) is defined. We have a cache of such objects, indexed by dev_t. struct block_device * is going to replace kdev_t for block devices. Handling of the cache is done in fs/block_dev.c
  2. They have methods (struct block_device_operations). Currently the set is { open, release, ioctl, revalidate, check_media_change }. For now (and it's going to change) types are the same as in file_operations. However, in the near future they are going to become
  3. int (*open)(struct block_device *bdev, mode_t mode, unsigned flags);
    int (*release)(struct block_device *bdev);
    int (*ioctl)(struct block_device *bdev, unsigned cmd, unsigned long arg);
    int (*revalidate)(struct block_device *bdev);
    int (*check_media_change)(struct block_device *bdev);
  4. ->revalidate() and ->check_media_change() disappeared from file_operations.
  5. register_blkdev() takes block_device_operations instead of file_operations now. For one thing, it means that block devices are more or less insulated from all future changes in file_operations (Good Thing(tm)). For another, it means that drivers should be modified. I did the change for all drivers in the main tree, see the patch for details. It's pretty easy.
  6. blkdev_open() doesn't change ->f_op. def_blk_fops has all needed methods (open, release and ioctl call the methods from block_device_operations, indeed).
  7. Inodes got a new field: i_bdev. Filesystems should not worry about it - just remember to call init_special_inode() when you are initializing device/fifo/socket in-core inode (in foo_read_inode() or in foo_mknod(); all filesystems in the tree are doing it now). Contents of this field: pointer to struct block_device if it is a block device inode, NULL otherwise.
  8. Superblocks got a new field: s_bdev. Handled by code in fs/super.c, points to the struct block_device if the mount is device-backed, NULL otherwise (i.e. for NFS, CODA, procfs, etc.).
  9. do_mount() first argument is struct block_device * now. It does the right thing for non-device mounts - just pass NULL and it will work (allocate the anonymous device, etc.)
  10. Instead of calling get_blkfops(), use ->bd_op in struct block_device. Moreover, better use blkdev_get()/blkdev_put()/ioctl_by_bdev() (see examples in mm/swapfile.c, drivers/char/raw.c, fs/super.c, fs/isofs/inode.c, fs/udf/lowlevel.c).
  11. Thing that is probably going to happen RSN: instead of struct gendisk per major we may want to go for struct gendisk per _disk_. It would mean that at some point near ->open() we will put the pointer to it into the struct block_device. One obvious consequence being that partitions-related ioctls() will become completely generic.

Notice that it is _not_ the same as devfs (and not a beginning of moving devfs into the main tree). It just provides the backplane - no namespace, no nothing. Inodes (either in normal filesystems or in devfs) point to such animals. That's it. Eventually things like ->b_dev, ->b_rdev, ->i_dev, ->rq_dev, etc. are going to become pointers to such objects, but it will be done step-by-step - otherwise we'll end up with a moby patch and moby breakage in bargain...

Character devices are not affected at all - IMO using the same type both for block and character device was a mistake. So their handling remains as-is. Probably something should be done for them too, but that's completely different story.

Richard B. Johnson picked himself up off the floor and said:

Good grief Charley Brown! You, in a few key-strokes, just blew away major portions of the work done over the past few years by software engineers who ported their drivers to Linux. Linux will never be accepted as a 'professional' operating system if this continues.

It's enough of a problem putting one's job on-the-line convincing management to risk new product development to Linux. Once these products are in Production, and bugs are discovered in the OS, we must be able to get the latest version of the OS and have our drivers compile. If this is not possible, you do not have an operating system that is anything other than an interesting experiment.

For instance, there was a simple new change in the type of an object passed to poll and friends. This just cost me two weeks of unpaid work! Unpaid because I had to hide it. If anyone in Production Engineering had learned about this, the stuff would have been thrown out, the MicroCreeps would have settled in with "I told you so..", and at least three of us would have lost our jobs.

Industry is at war. You can't do this stuff to the only weapons we have. Once you claim to have a "Professional Operating System", its development must be handled in a professional way. If major kernel interface components continue to change, Linux is in a heap of trouble as are most all of those who are trying to incorporate it into new designs.

The industrial use of Linux is not at the desktop. It involves writing drivers for obscure things like machine controllers (read telescope controllers), Digital signal processors (read medical imaging processors), and other stuff you can't buy at the computer store. It doesn't matter if you fix all of Donald Becker's drivers to interface with the new kernel internals. You have still broken most everything that counts.

There were a number of replies to this. Alexander found Richard's post clueless and Monty-Pythonesque. On a serious (though annoyed) note, he explained, "one of the worst things about block drivers-to-kernel interface is that they share it with files. I.e. _any_ change in file_operations or in struct file or in struct inode and you are deep in it. Change the size of any field prior to ->i_dev and you are in for recompile. Change <gasp> device number bitness and even recompile may be of little help. Removing those dependencies (not all of them are removed yet, more will follow) is going to save _your_ ass a year later."

Also replying to Richard, Victor Khimenko said, "Drivers MUST be changed with new kernel release (and thus via development branch: development kernels are just snapshots of development process after all). It was true from the start and it'll be true tomorrow. It's true for most OSes available. It's ESPECIALLY true for Linux where drivers are linked directly in kernel. If you expected something other then you made wrong choice choosing Linux."

Gregory Maxwell said to Richard:

We all know your position on compability. :) Many people, including myself, usually understand and agree with it.

However, you are going a little far on this one.

The change is going into 2.3.x, and that *IS* the approiate place to break interfaces. These kinds of changes should certantly not be introduced into 2.2.x.

This should cause you little difficulity, as your example of having to upgrade to fix a bug should not apply. When you upgade to fix a bug then you should just be increasing patchlevel. If there is not a patch for a bug in 2.2.x which is fixed in 2.4.x then there is a bug in the Linux development process.

In order to move forward, we *must* break things. To make up for this we continue to maintain old versions. There are still bugfixes being made against 2.0.x and there will be bugfixes against 2.2.x. RedHat even still issues updates against RH4.2..

So if this were to have occured within a stable kernel version, or if it had severly affected userspace, I would agree.

Rik van Riel put it this way to Richard:

Industrial use of Linux usually doesn't involve the kernels which are marked as `development', ie. where the `middle' version number is odd and where major things are expected to change.

People venturing out on that terrain can know what they're heading into (see http://kt.zork.net/) and shouldn't come whining when some actual development happens in the development branch of the kernel. The should only whine when development stops, not when useful changes are taking place...

But David Parsons objected to Rik, "Except, of course, that when the changes go in they are never backed out so the interfaces remain stable for the production kernels. That's the *really* annoying thing about this line of argument; when else should someone complain that an interface has been turned into gravel? If you wait until the development tree has become a production tree, enough code will be modified to work with the New! And! Improved! interfaces that your complaints (cf: old-style fcntl locking) will be dismissed sight unseen by the Core Team." He added, "The big support providers are the ones who benefit from interface churning. It's the small shops that get bitten in the ass because they don't have enough money to buy programmers or enough time to do the patches." There was no reply to this.

Alan Cox also replied to Richard with the quote of the day, saying, "Linux isnt at war. War involves large numbers of people making losing decisions that harm each other in a vain attempt to lose last. Linux is about winning."

At some point, Richard posted again, having received many private emails in addition to the slew on the list. He said:

I have gotten a lot of mail on this so I will reply only once.

Many of the professional industrial uses of Unix were previously covered using Sun boards, boxes and SunOs. If you ever dial 10 before a long-distance number to get a cheaper rate, that's voice over IP and we make that stuff. This was developed on Suns, runs on them, but will soon be running on cheap Intel clones.

If you ever have to go to the hospital and have a CAT-Scan or a MRI, you are using equipment developed by us, even though the name on the box may be Phillips, General Electric, Toshiba, or various other companies. You can look http://www.analogic.com and see what we do for a living here.

The Sun driver interface has been constant. Unfortunately, you have to install it, meaning link it and reboot. When Installing a system, meaning the complete software package, the end-user's technician installs the OS from a CDROM. Then the application with its drivers are installed from another CDROM. This works on Suns and has been the De-facto standard way of doing things.

Linux was not suitable for the applications running on Suns until Linux provided the installable device driver. The ability to install a hardware-interface module into a kernel was my main selling point for using Linux to replace SunOs, and, indeed the whole Sun architecture.

Incidentally, the cost is the same. A CDROM for Solaris is essentially the same cost as a CDROM for Linux. Once you start distributing an operating system and supporting the distributors, a "free" operating system is no longer free.

By the time a decision was made to produce our new Exact Baggage Scanner, marketed by Lockheed-Martin, engineering management was dragging its feet on the use of Linux. They wanted something that was "everything to everybody", but didn't want the cost of using Suns. Further, it had to be completely under company control.

I was unable to convince anybody to use Linux so I had to write my own Operating System. It is called ARTOS (Analogic Realtime Operating System). Our Sky Computer Division, which produces the world's fastest (still) digital signal processor, made the high-speed stuff, a lowly Intel Pentium with my OS is used as the system controller, and an Alpha Workstation is used for the user interface.

When this was completed, we went on to producing our third generation CAT Scanner. This uses a Pentium as the main system controller and Linux as the operating system. The User Interface uses Windows-NT. It was felt that Linux was sufficiently well-hidden in the bowels of the machine so nobody would care.

The drivers in this machine comprise both block and character devices. One of major building blocks is the driver that interfaces to the Digital Signal Processor. This DSP board comprises up to 32 TMS-320C20 DSPs plus an i960 for interface. It is made by our CDA Division.

Completed data, available within a 32k window, a 512x512x16bit chunk, must be transferred to the User Interface within 1/4 second to make the specification. It does.

Now, our legal department has defined the criteria we must meet to use Linux. They presume that we will provide a "current distribution" of Linux to every end-user. They also defined that, since drivers may be deemed to modify the operating system, we have to provide driver source-code to the customer if they request it. Application code continues to be proprietary.

Changing the kernel interface to drivers is counter productive. In fact it makes the usual field installation impossible. The usual installation would automatically and transparently compile the interface modules, using the new Operating System. This is no longer possible because the compilation will fail.

Again, if Linux is to become other than an interesting experiment, one cannot change these interfaces without understanding the whole picture.

Distributors don't care. The more changes there are, the greater the obsolescence, the more money they make selling new boxes of CDROMs. Therefore there is no controlling negative feedback to be obtained from the distribution channel. You can reject what I say out-of-hand, and continue as an experiment, or you can listen and make a significant contribution to providing jobs worldwide.

It is, of course, possible to fragment Linux. A company could be started, called StableLinux that distributes only Linux n.n.n and performs bug-fixes and maintenance on that version only. This is not helpful to the greater Linux community. Instead, we need to minimize the changes that affect the interfaces to world-wide applications. Just as POSIX attempted to stabilize the API so that one could write "portable" code, the interface to hardware that hasn't even been invented yet has to be stable.

Chris Adams and Horst von Brand suggested that "current distribution" refered to even-numbered minor version numbers only. Horst expanded, "OK, "current distribution" means 2.2.x kernel today, and was 2.0 sometime back. It will be 2.4 in a few months time, and perhaps 2.6 in a year and a half. You are supposed to distribute the machine and source to drivers &c _when shipped_, I'd assume. Check the code, test it to breaking *and keep it*. Ship that to customers, and either offer upgrades to 2.4 if needed for some reason, or stay put."

Elsewhere, replying to Richard's original post, Jamie Lokier said, "If you need a stable API, you chose the wrong operating system. It's no secret that Linux APIs change. You can't blame the kernel developers for doing exactly what they said they will do. If you want, you can blame the people who incorrectly assumed the APIs would stay the same, for not investigating the obvious." And Ted added, "If you told your management that Linux kernel interfaces never change across versions, then you were sadly mistaken. However, the mistake is on your end, I'm afraid."

To this, Richard replied:

No. According to our Legal Department, to satisfy the GPL requirement that we provide source to the end-user, they required that we supply a "current" distribution of Linux if the end-user requests it.

This seemed, by them, to be an easy solution to possible problems. Unfortunately, for Engineering, this means that we have to keep everything "current" during development so that, by the time equipment is shipped, it will run with the "current" distribution (whatever this is).

The obvious solution, given these constraints, is that we just ignore all changes until shipping time, then attempt to compile with the latest distribution, fixing all the problems at once. However, we then end up shipping untested software which ends up being another problem. Checking to see if it "runs" isn't testing software in the cold cruel world of industry.

So, presently, I have 13 drivers I have to keep "current". Yesterday they all got broken again. A week before, half of them were broken because somebody didn't like a variable name!

That said, a major problem with changes that I see, is that the changes are made without the notion of a terminating condition. For instance, new parameters are being passed to existing interface functions.

If you are going to break an interface, you should plan on only breaking it once rather than opening the door for more changes and leaving it open. For instance, once you have to pass more than (depends upon the machine) about 3 parameters, it's best to put them all in a parameter- list (structure) and pass only the address of the parameter list (pointer).

From that time on, you only have to add structure members to the parameter list if you have to add changes. If I had seen these kinds of changes I would not have complained. It means I have to rework stuff only once.

So `read(f,.......)` should have been changed to `read(params *)` and you are done with it forever as long as you don't change structure member names and functions for kicks.

This time it was Alexander's turn to pick himself up off the floor; and in response to the first paragraph of Richard's post, said, "Oh. My. God. They are requiring you to do WHAT??? Do you mean that you really ship 2.3.x to your customers? Arrggh. "Source" == "source of what we are shipping". And not "anything that was written by other guys who started from the same source". It's utter nonsense. _No_ license can oblige you to include the modifications done by somebody else. Otherwise you'ld have those drivers in the main tree, BTW - _that_ much should be clear even for your LD." But David Lang put in, "he is not saying that he has to ship a 2.3 kernel, he is reacting to the fact that he will have to ship a 2.4 kernel. the blame for this lies squarly on the legal department who decided that they had to ship a "current" disto. There is some semblance of reason for this as they want to try and limit the support costs by not using "obsolete" versions, but given the way many of the major distros patch the kernel before shipping it you still may have problems. The answer is to figure out some way to educate the legal department to allow for a more gradual change."

 

4. TESO Security Alert
9 Jan 2000 - 11 Jan 2000 (10 posts) Archive Link: "Linux Kernel 2.0.x/2.2.x local Denial of Service attack"
Topics: BSD, Networking, Security
People: Sebastian KrahmerAlan CoxVictor KhimenkoAlexey Kuznetsov

Sebastian Krahmer from the TESO group posted another security advisory. This one included some code to exploit the hole, and summarized, "A weakness within the Linux 2.0.x and Linux 2.2.x kernels has been discovered. The vulnerability allows any user without limits on the system to crash arbitary processes, even those owned by the superuser. Even system crashes can be experienced."

Victor Khimenko replied that this problem had been known for at least a year and had been discussed many times on linux-kernel; he added unhappily that there was still no solution in the main tree. Alexey Kuznetsov replied that the problem had existed for at least 5 years, and Alan Cox added, "I know no single Unix like OS I can't bring down if I dont have resource limiting. Also for that matter Ive yet to meet one I can't kill even with resource limiting in place." Victor then quoted Sebastian as saying in private email:

Oh, I didn't knew that. I know that this is no common malloc() bomb problem, and we haven't heard about it, so we want to make it public, even if it is known to the kernel developers. A bit pressure to the admins side could not be wrong to use resource limits.

Btw, any BSD we tried on doesn't suffer from this or similar problems.

Sebastian objected to his private email being posted out of context to a public forum, and went on to explain:

my point was, I believe in both, full-non-disclosure (keeping something to just yourself or to a very small group) or full-disclosure (sharing it with everyone at the same time).

The point now is, that many Linux distributions ship with no resource limitations activated by default, and a lot of administrators don't know about them or how to enable them. By raising public attention to this problem you bring many administrators to raise the barrier by enforcing resource limits, which is good.

Creating this pressure is often seen in a bad way by both developers ("we developers want to be notified first to fix it") or by companies, of which some ignore security issues until enforced by customers to fix them (hi bill).

However, I think of this pressure as a necessary thing, and that was what I meant in the mail to Khimenko.

On the other hand, Unix wasn't build for DoS-users, and I'm sure Alan is able to crash mostly anything. But using resource limits anyway is a good thing and any admin should use them.

He continued, "Another thing which Khimenko "accused" me of was that this has been known on the kernel mailing list for ages. I remember on our first advisory that the TCP spoofing vulnerability in Linux <= 2.2.12 kernels was known to the developers also, but no one outside the mailing list ever noticed it, if we wouldn't have published an advisory about it (we found it independently though), no one would know it today. So please don't blame me that I cannot read 250++ mails per day just to ensure we don't release something already known to some people."

Alan agreed that raising public awareness of the problem was a good thing; and that TESO could not be expected to comb through linux-kernel just to make sure the exploits they discovered were truly unknown.

 

5. linux-kernel Mailing List Problems
11 Jan 2000 - 12 Jan 2000 (4 posts) Archive Link: "[OT]mailing list delay"
People: Matthias AndreeDavid S. MillerAndreas Tobler

Andreas Tobler reported a 5 to 6 hour delay receiving linux-kernel mail. A couple people confirmed experiencing the same thing, and Matthias Andree speculated, "I recall that vger deferred incoming mail because it was short of disk space some days ago. No inbound mails, no outbound mails." David S. Miller explained, "There was a disk space issue on vger, this clogged up the queue for about 12 hours yesterday."

 

6. Relaxing Of US Crypto Laws
11 Jan 2000 - 17 Jan 2000 (15 posts) Archive Link: "2.4 and Strong Cryptography..."
Topics: BSD: OpenBSD, Patents
People: Michael H. Warfield

Michael H. Warfield gave a pointer to a draft Encryption Export Regulations (http://www.cdt.org/crypto/admin/991217draftregs.shtml) . He quoted the paragraphs relevant to Open Source:

SEC. 740.13 TECHNOLOGY AND SOFTWARE < UNRESTRICTED (TSU)
(e) Unrestricted Encryption Source Code

  1. Encryption source code controlled under 5D002 which would be considered publicly available under Section 734.3(b)(3) and which is not subject to an express agreement for the payment of a licensing fee or royalty for further commercial production or sale of any product developed with the source code is released from EI controls and may be exported or re-exported without review under License Exception TSU, provided you have submitted written notification to BXA of the Internet address (e.g. URL) or a copy of the source code by the time of export. Submit the notification to BXA and send a copy to ENC Encryption Request Coordinator (see Section 740.17(g)(5) for mailing addresses).
  2. You may not knowingly export or re-export source code or products developed with this source code to Cuba, Iran, Iraq, Libya, North Korea, Sudan or Syria.
  3. Posting of the source code on the Internet (e.g., FTP or World Wide Web site) where the source code may be downloaded by anyone would not establish "knowledge" as described in subparagraph (2) of this section. In addition, such posting would not trigger "red flags" necessitating the affirmative duty to inquire under the "Know Your Customer" guidance provided in Supplement No. 3 to Part 732.

He added:

You'll notice that the second paragraph is the stock "restricted countries" list and the third paragraph is a "safe haven" clause for ftp/http posting.

This basically says that crypto source code which is unencumbered may be exported merely by notifying them of the URL (mailto URL's????) where it is available from. No review, no approval, no license, no key length silliness, and no inherited encumberances. :-)

I won't post the whole $#@$#@ thing (since you can read it at the CDT site anyways) but for things like "Idea" and "RSA", which ARE encumbered by patents, similar clauses exist at 740.17(a)(5) which say basically the same thing.

This is scheduled to become finalized on January 14. Everything I have heard indicates that there will be no significant changes at this point and these will be the new regulations and will be finalized on schedule.

If these regs get finalized and are in the form we now expect them to be in, can we get the paperwork filled and get IPSEC (and other crypto goodies like ppdd) into the 2.4 kernel? KLIPS (from IPSEC) would be a wonderful win! That would put us up with OpenBSD with integrated IPSEC (OK, IKE, aka pluto, still needs improvement - but that's not a kernel issue).

We can also begin to lobby the distro makers for bundling hardened crypto like PGP, GPG, CFS, TCFS, SSH, etc, etc, etc, as quickly as possible. The faster it's there and the faster it spreads the better we can seal this deal and make it done!

 

 

 

 

 

 

We Hope You Enjoy Kernel Traffic
 

Kernel Traffic is grateful to be developed on a computer donated by Professor Greg Benson and Professor Allan Cruse in the Department of Computer Science at the University of San Francisco. This is the same department that invented FlashMob Computing. Kernel Traffic is hosted by the generous folks at kernel.org. All pages on this site are copyright their original authors, and distributed under the terms of the GNU General Public License, version 2.0.