Kernel Traffic #89 For 16 Oct 2000

By Zack Brown

linux-kernel FAQ (http://www.tux.org/lkml/) | subscribe to linux-kernel (http://www.tux.org/lkml/#s3-1) | linux-kernel Archives (http://www.uwsg.indiana.edu/hypermail/linux/kernel/index.html) | kernelnotes.org (http://www.kernelnotes.org/) | LxR Kernel Source Browser (http://lxr.linux.no/) | All Kernels (http://www.memalpha.cx/Linux/Kernel/) | Kernel Ports (http://perso.wanadoo.es/xose/linux/linux_ports.html) | Kernel Docs (http://jungla.dit.upm.es/~jmseyas/linux/kernel/hackers-docs.html) | Gary's Encyclopedia: Linux Kernel (http://members.aa.net/~swear/pedia/kernel.html)

Table Of Contents

Mailing List Stats For This Week

We looked at 1145 posts in 4566K.

There were 373 different contributors. 171 posted more than once. 134 posted last week too.

The top posters of the week were:

 

1. 'soft updates', Journalling, Crashing Virtual Machines; Linus On Current VM vs. Classzone
1 Oct 2000 - 9 Oct 2000 (49 posts) Archive Link: "Soft-Updates for Linux ?"
Topics: BSD, Disk Arrays: RAID, FS: InterMezzo, FS: JFS, FS: ReiserFS, FS: XFS, FS: ext3, Ioctls, User-Mode Linux, Virtual Memory, Web Servers
People: Robert RedelmeierDaniel PhillipsAlbert D. CahalanAndreas DilgerAlexander ViroLars Marowsky-BreeAlan CoxErik AndersenDavid WoodhouseChristoph HellwigBert HubertRik van RielLinus TorvaldsAndrea ArcangeliAndi Kleen

Robert Redelmeier asked if Kirk McKusick's 'soft updates' would make it into the kernel in 2.5; he explained, "S-U brings considerable benefits akin to JFS for crash protection to *BSD systems. Essentially, they are ordered disk writes that makes sure data gets on the disk before metadata is altered. They go through the buffering system, so performance isn't bad." Daniel Phillips replied that the soft updates algorithm looked like it would indeed improve filesystem robustness, but that he'd be focusing on ext3 for his own work. He added that for proper testing of soft updates, journalling, or anything else that tried to protect data across system crashes, it was important to reboot many, many times. He explained, "Otherwise, you are just fooling yourself and everybody else. How many crashes does it take to find that one little window of vulnerability that comes up every 10,000 crashes normally but suddenly starts coming up every time just because your customer uses their system a different way?" Robert replied that he still preferred soft updates because of its simplicity as compared with journalling ( "with a full load of uncertainties" ). But he agreed very strongly with the idea of heavy testing via crashes. But he added, "The problem is the reset button is only connected to the CPU and the hard disk will probably continue to write out sectors from it's hw buffer. OTOH, I don't like the idea of pulling the plug too often. It's very hard on the hardware. I'd expect a mechanical disk failure before 10,000 cycles."

Albert D. Cahalan suggested, "The nice way to develop this code is with a block device that discards all writes after a timer goes off." Andreas Dilger described a patch he'd written "to the loopback device which allows you to discard I/Os going to disk. You can either activate it via an ioctl from user space, or via a function call in the kernel." He went on:

You can also make reads fail, but this was not very useful for me, because it caused the ASSERTs in ext3 to oops. Also the read "failures" are not the same as the real thing, so it may not be a valid test. They only return a zero'd page, rather than really causing a non-up-to-date page.

I used it quite a bit when developing the orphan code for ext3, and for testing journal integration in InterMezzo. You can use it for testing a loopback file, or loopback mount a block device, but as with regular loopback devices, there is a 2GB limit.

I posted it to fsdevel a few months ago, but I have also uploaded it to: ftp://ftp.stelias.com/pub/adilger/loopdiscard-2.2.16.patch ftp://ftp.stelias.com/pub/adilger/loop_discard.c

The loop_discard.c program simply calls the ioctl to enable or disable I/O on the specific loop device. Unconfiguring the loop device also resets the I/O status.

Also in reply to Albert, Daniel suggested:

Somebody should write a crash-simulator along those lines and do some real head-to-head comparisons amongst the various candidates for the "most reliable filesystem" crown. Doing a million simulated crashes under controlled conditions might not catch every possible thing that could go wrong but it would sure help remove the subjective opinion element. I think it would spur on development too because we'd have a real yardstick to measure progress against.

The tricky part of the crash simulator would be recovering the resources the filesystem was using and convincing the VFS to let go of the the partition. If you could return the system to a stable state you could do many, many more test runs in the same time. Maybe VMWare could help here.

Andi Kleen replied that user-mode Linux might be a better choice, since it was easy to crash intentionally, and was also free, unlike VMWare. Daniel agreed, and added, "This looks like a really lightweight solution for 90% of my filesystem development. I'm starting to get a little bored with the crash and burn, repeat as necessary approach :-)" He commented, "I think VMWare is really a good product, and it's there now. They make it really easy for Windows users to get free. I support them. If they are smart they will open up eventually, and if they are not smart, well, that's what natural selection is for."

Elsewhere, Alan Cox replied to Robert's initial post, saying he thought everyone was heading toward journalling rather than soft updates or any other solution. But Alexander Viro replied, "Not everyone. S-u is not a subset of journalling and IMNSHO it's way cleaner. Kirk's code can't be directly ported - too different VFS behind it, but the concept is sound and well-explored (thanks to Kirk). It can be implemented right now, but I hate to introduce the third patch with large VM changes - 3-way merge (VM parts of s-u/ext3/reiserfs) will suck rocks through the straw. Let's wait until VM infrastructure will be in place."

Ernesto Vargas also asked which journalling filesystems were most likely to succeed in 2.5, and Lars Marowsky-Bree replied:

ext3 is stable on my laptop.

reiserfs is stable at SuSE on a 250 GB RAID with 2.2 million files.

XFS has IMHO the best chance to surpass both in server environments, and GFS is also a very good candidate for (some) clusters.

Alan also replied to Ernesto, "jffs is in 2.4 (but its a log structured fs for flash memory not generic) ext3 and reiserfs are both being used in production boxes as add ons" but Erik Andersen reported, "Unless someone fixed it recently, when jffs was ported to 2.4, it lost the ability to run on block devices and can currently only operate via the mtd layer. In the 2.0.x kernels, it could target block devices." David Woodhouse replied:

It's not that hard to fix it. People are already looking at implementing write gathering into 512-byte sectors _anyway_, in order to support NAND flash devices.

Patches willingly accepted, although I'm not intending to make any more changes to the stable JFFS code which is now in -test9. Except possibly to remove the CONFIG_EXPERIMENTAL dependency.

Elsewhere, also in reply to Robert's initial post, Christoph Hellwig said he didn't think soft updates would be a big player either. He explained, "There are lots of interesting journaling filesystems for Linux aviable (xfs, ext3, jfs, reiserfs, nwfs(?), vxfs(commercial)). And a least a few will be merged in 2.5. For people that disklike the disadvantages of journaling fses, Daniel Philips is working on tux2 which is in some respect similar to su (always writing comsitatn metadata to disk without a journal) but the implementation seems much more interesting ;)" And Bert Hubert reported that Linus Torvalds had told him he was not opposed to adding journalling filesystems to 2.4.x, though probably not in 2.4.0; he added, "If Rik gets some kind of memory pressure callback API in the kernel, there is no theoretical reasons why the journalling filesystems couldn't be merged safely."

Rik van Riel replied, "Once the VM is stable with the current feature set and OOM handling has been added, I'll probably look at the support code for the journaling filesystems. (the other VM things are nice too, but can wait until 2.5)" The rest of the discussion focussed on the virtual memory subsystem, and at one point Linus revealed some of his thinking on the "new VM versus Andrea Arcangeli's classzone VM" debate. While criticizing Rik's VM, saying, "The global freepages number is NOT USABLE. Never will be. Because it's fundamentally a non-valid number to use - it has nothing to do with any reality, and never will," he immediately went on to say, "This is exactly my argument and beef with Andreas zone patches. They simply cannot work in the end because they use global information that should not be used."

Both Andrea and Rik disagreed with Linus' assessment, saying that they were not relying too heavily on global data, and each gave code listings to illustrate their point. There was no reply.

 

2. Assigning 'nice' Values For Disk And Network Activity
1 Oct 2000 - 2 Oct 2000 (8 posts) Archive Link: "Disk priorities..."
Topics: Disks: SCSI
People: Linda WalshRik van RielMarc LehmannAlexander Viro

Linda Walsh suggested applying 'nice' values to disk and network accesses. She explained, "I was on a server with 4 CPU's but only 2 SCSI disks. Many times I'll see 4 processes on disk wait, 3 of them at a cpu-nice of 19 while the foreground processes get bogged down by the lower priority processes due to disk contention." Rik van Riel loved the idea and asked for patches. He added that for disk niceness, "it would be trivial to adjust the maximum elevator sorting latency according to the niceness of the process. I have no idea how much this would help, though ..." Alexander Viro pointed out that by the time the elevator came into play, the process that initiated the disk activity could be dead, leaving no information as to its niceness value. Rik agreed this would be a problem, and they went back and forth on it a bit. At one point Marc Lehmann remarked, "OS2 had a lot of these things in their scheduler, but, according to subjective reports from a lot of people, it didn't seem to work very well (it slowed downt he scheduler considerably without ever working great)."

 

3. Virtual Memory Saga Continues
2 Oct 2000 - 4 Oct 2000 (6 posts) Archive Link: "[PATCH] fix for VM test9-pre7"
Topics: FS: ext2, SMP, Virtual Memory
People: Rik van RielChristoph Rohland

Rik van Riel announced:

The attached patch seems to fix all the reported deadlock problems with the new VM. Basically they could be grouped into 2 categories:

  1. __GFP_IO related locking issues
  2. something sleeps on a free/clean/inactive page goal that isn't worked towards

The patch has survived some heavy stresstesting on both SMP and UP machines. I hope nobody will be able to find a way to still crash this one ;)

A second change is a more dynamic free memory target (now freepages.high + inactive_target / 3), this seems to help a little bit in some loads.

If your mailer messes up the patch, you can grab it from http://www.surriel.com/patches/2.4.0-t9p7-vmpatch

Linus, if this patch turns out to work fine for the people testing it, could you please apply it to your tree?

Christoph Rohland reported, "When I do the same stresstest with mmaped file in ext2 the machine runs fine but the processes do not do anything and vmstat/ps lock up on these processes," but there was no real discussion.

 

4. Pressure On The New VM
2 Oct 2000 - 4 Oct 2000 (5 posts) Archive Link: "TODO list for new VM (oct 2000)"
Topics: Big Memory Support, Clustering, OOM Killer, Virtual Memory
People: Rik van RielLinus TorvaldsBen LaHaise

Rik van Riel posted his ToDo list for the new virtual memory subsystem:

for kernel 2.4, necessary:

  • out of memory handling [integrate the OOM killer, 10 minutes work]
  • fix the highmem deadlock, where the swapper cannot create low memory bounce buffers OR swap out low memory because it has consumed all resources [old bug, already reported with 2.4.0-test6, probably before]

for kernel 2.4, really wanted:

  • page->mapping->flush() callback in page_launder(), for easier integration with journaling filesystems and maybe the network filesystems [about 30 minutes of work on the VM side]

for kernel 2.4, wanted:

  • maybe rebalance the swapper a bit ... we do page aging now so maybe refill_inactive_scan() / shm_swap() and swap_out() need to be rebalanced a bit

for kernel 2.5: (maybe available as patch for 2.4 ???)

  • physical->virtual reverse mapping, so we can do much better page aging with less CPU usage spikes
  • better IO clustering for swap (and filesystem) IO
  • move all the global VM variables, lists, etc. into the pgdat struct for better NUMA scalability
  • (maybe) some QoS things, as far as they are major improvements with minor intrusion
  • thrashing control, maybe process suspension with some forced swapping ?
  • include Ben LaHaise's code, which moves readahead to the VMA level, this way we can do streaming swap IO, complete with drop_behind()

Linus Torvalds came down on him, with:

Why do you apparently ignore the fact that page-out write-back performance is horribly crappy because it always starts out doing synchronous writes?

I pointed out previously in a private email that page_launder() must be buggy as it stands now, you seem to have ignored that part (and the test-program that shows 1MB/s writeout speeds due to it) completely.

The whole _point_ of the new VM was performance. Without that, the new VM is pointless, and discussing TODO features is equally pointless.

Rik replied that the page-out write-back problem was fixed in a patch he'd sent in the day before, though he warned against applying the fix immediately, since he was working on an updated patch. Linus didn't reply.

 

5. Small Patches Rejected From Stable Series (For Now)
2 Oct 2000 - 5 Oct 2000 (4 posts) Archive Link: "[PATCH] 2.2.18pre13: Small patches from Andrea"
Topics: Virtual Memory
People: Chip SalzenbergAlan Cox

Chip Salzenberg offered:

I suppose I should let Andrea submit these, but he has such a huge patch collection (thank you!) that I thought it might be useful to pick out some of the smaller ones that would be less controversial for inclusion in the main kernel.

  • nanosleep-4
    Provide nanosleep usec resolution so that a signal flood doesn't hang glibc folks that correctly trust the rem field to resume the nanosleep after a syscall interruption. (without the patch nanosleep resolution is instead 10msec on IA32 and around 1msec on alpha)
  • tsc-calibration-non-compile-time-1
    TSC calibration must be dynamic and not a compile time thing because gettimeofday is dynamic and it depends on the TSCs to be in sync.
  • IO-wait-2
    Avoid spurious unplug of the I/O queue.
  • buf-run_task_queue
    Avoid spurious unplug of the I/O queue. (again!)
  • account-failed-buffer-tries-1
    Account also for failed buffer tries during shrink_mmap.
  • overcommit-1
    Make sure to not understimate the available memory (the cache and buffers may be under the min percent).

Alan Cox replied, "Im intentionally avoiding these right now. The 2.2.18 kernel has a very large amount of updates to drivers/extra functionality. I don't want to mix any of that with core internal changes of any kind. The VM fixes in paticular look good but would be an invitation to disaster to merge this release."

 

6. More Work On Framebuffer Boot Logo For 2.2 And 2.4
2 Oct 2000 - 4 Oct 2000 (6 posts) Archive Link: "[patch] Make linux logo centered, add margins, etc. for 2.2.17"
Topics: Framebuffer
People: Torrey HoffmanJason McMullan

Torrey Hoffman announced his first Linux kernel patch submission:

This is a patch to linux-2.2.17.

As you all probably know, the current framebuffer driver (fbcon.c) displays an 80x80 pixel penguin logo at the top left of the screen.

This patch modifies fbcon.c to display the linux logo centered horizontally, with optional margins (LOGO_MARGIN) above and below. The boot console displays in the remaining space.

It also cleans up the code a little to move the LOGO_W and LOGO_H defines to the linux_logo.h header file, where they ought to be.

I have successfully used this code to display a large (472x320) logo with the vesa framebuffer on i386 during boot. That only requires replacing the include/linux/linux_header.h file.

This patch is currently untested on anything else, and I would be interested in bug reports.

Note that using large logos can dramatically increase the size of your zImage kernel. Also I'm not 100% confident the patch is correct, as I'm not a kernel guru (yet). (Why does fbcon_show_logo() have a loop that looks at smp_num_cpus?)

Anyway, if you find this interesting, you may also like to know that I have updated the "glogo" gimp plugin, (which is GPL'ed and copyright (C) 1998 Jens Ch. Restemeier <jrestemeier@currantbun.com (mailto:jrestemeier@currantbun.com) >) to support:

  • this modified linux_logo.h header format
  • logos of variable size, instead of just 80x80 pixels
  • gimp 1.1.26

If you want my version of glogo, just ask me. If this patch is considered for inclusion in the official kernel, when I've recovered from the shock I'll try to coordinate with with Jens Restemeier to keep the "official" glogo up to date.

Jason McMullan replied:

I've also developed a patch (posted just now to lkml), that does nearly the same thing for the 2.4.x series kernels. The major difference is that I use a PPM file (and have a conversion utility in linux/scripts/ called ppmtolinuxlogo.c), and mine can be configured out of the kernel from a Config.in.

Please feel free to rip out the ppmtolinuxlogo.c and put it in your patch, if you would like. Also, you may want to look at the Makefile dependencies and Config.in/Configure.help changes to improve your patch.

I also have a 2.2.x version of my patch, but I can forward you that in private email if you would like. Very little changed.

Later, Torrey posted an updated patch to correctly handle multi-CPU machines. Jason also posted his patch against 2.4.

 

7. Previously Used Filesystem Algorithms Patented By Vendor?
3 Oct 2000 - 6 Oct 2000 (25 posts) Archive Link: "Tux2 - evil patents sighted"
Topics: BSD: FreeBSD, Disk Arrays: RAID, Microsoft, Patents, Web Servers
People: Daniel PhillipsMarty FoutsChris GoodJeff V. MerkeyThomas GraichenVictor Yodaiken

Daniel Phillips reported trouble on the horizon:

Thomas Graichen forwarded me some interesting information from the freebsd-fsdevel list regarding 3 patents held by Network Appliance, Inc., Santa Clara, CA that seem to describe much of the mechanism that underlies Tux2. I haven't heard anything from any representative of Network Appliance, which I find very curious because they must certainly have heard of Tux2 by now. But of course when I do hear from them they will want something, and I will want them to FOAD.

http://patent.womplex.ibm.com/details?&pn=US05819292__
http://patent.womplex.ibm.com/details?&pn=US05963962__
http://patent.womplex.ibm.com/details?&pn=US06038570__

It is important that all technology used in GPL software be free of patent restrictions. Unfortunately for Network Appliances, I developed all the essential concepts they describe in 1989 (the RAID optimization excepted, see below for what I think about that) and implemented them in a production system. In other words, I've got prior art; their patents are worthless. Furthermore, I developed the entire Tux2 design and implemented most of it before I ever even heard of their software, much less their patents. And on top of that, other people also have prior art (check out Auragen, if you don't know what it is, ask Victor Yodaiken).

OK, I sense there's going to be a fight, because Network Appliance is a profit-making corporation and they would be remiss if they didn't try to defend their IP. Did I mention that software patents are evil? Did I mention that software patents make people behave in evil ways?

I'm not going to change my course at all, I'm determined to bring this better idea to Linux in a free and open way. I will continue to develop it until it's finished. Oh, and the phase tree algorithm is fundamentally superiour to their WAFL algorithm, as I will demonstrate next week in Atlanta.

I invite anyone who's interested to email me and help out. Are you a patent lawyer that likes to work for free? *Especially you*, please email me.

Now let me state my position on patents:

  • Patents are evil
  • Software patents are especially evil
  • Patents, and especially software patents, constitute nothing less than government-sponsored theft of property that properly belongs to humanity.
  • If we did not have any form of patent, humanity would be better off.
  • If we did not have any form of patent, the world economy would benefit. Yes, that means corporations too.
  • If we did not have any form of patent, *most voters would benefit* <-- pay close attention to this one
  • Patents are anti-capitalist: they interfere with the proper functioning of the market economy. Patents on business methods are already rearing their ugly head.
  • It's getting worse. If the current trend continues, you will soon see the life of patents being extended, you will see patents being granted in areas that were previously considered off-limits, and you will see countries outside the U.S. being pressured into supporting the patent system in various ways.
  • We can't change the world overnight, but we do already possess the power, if we excercise it, to send the laws that gave birth to software patents back into the cesspool they crawled out of.
  • In spite of the popular myth about the lone inventor who strikes it rich, the only real beneficiaries of patents are corporations. Yes, a few lone inventors strike it rich, but not enough to undo the damage done to humanity in general. Most lone inventors just get ripped off by people who prey on them and their dreams.
  • If all patents were to vanish today and never come back research in general would accelerate, not slow down. Linux is proof of that.
  • Lawyers built the patent system. Tim O'Rielly once asked a patent lawyer how he would feel if other lawyers could patent legal arguments and charge him money to use those arguments in court. Though he tried to twist out of answering that one, eventually he had to admit that he had no answer. This lawyer IIRC is the director of the U.S. Trade and Patent office.

OK, I'll stop ranting now. I knew it was going to happen, and not only that, this is going to happen more and more until the evil patent system is uprooted and composted.

Now, the specifc discussion:

US patent 5,819,292 "Method for maintaining consistent states of a file system and for creating user-accessible read-only copies of a file system";

ApplDate 1993-06-03 >- Four years after I did my work.

"A method is disclosed for maintaining consistent states of a file system. The file system progresses from one self-consistent state to another self-consistent state. The set of self-consistent blocks on disk that is rooted by a root inode is referred to as a consistency point. The root inode is stored in a file system information structure. To implement consistency points, new data is written to unallocated blocks on disk. A new consistency point occurs when the file system information structure is updated by writing a new root inode into it. Thus, as long as the root inode is not updated, the state of the file system represented on disk does not change."

*** I did all of this in my database program, which in fact implemented a complete filesystem on top of the existing filesystem. This was 1989.

"The method also creates snapshots that are user-accessible read-only copies of the file system. A snapshot uses no disk space when it is initially created. It is designed so that many different snapshots can be created for the same file system. Unlike prior art file systems that create a done by duplicating an entire inode file and all indirect blocks, the method of the present invention duplicates only the inode that describes the inode file. A multi-bit free-block map file is used to prevent data referenced by snapshots from being overwritten on disk."

*** They come close to winning one here, but their multibit free map loses it for them. I do it with a single bit, and I should patent the way I do that (GPL of course). Even better, I recently figured out how to make the Tux2 snapshots *read/write*, and even *exchange* and *rotate* snapshots. Take that you evil intellectual property barons.

patent 5,963,962 "Write anywhere file-system layout";

"The present invention provides a method for keeping a file system in a consistent state and for creating read-only copies of a file system. Changes to the file system are tightly controlled. The file system progresses from one self-consistent state to another self-consistent state. The set of self-consistent blocks on disk that is rooted by the root inode is referred to as a consistency point. To implement consistency points, new data is written to unallocated blocks on disk. A new consistency point occurs when the fsinfo block is updated by writing a new root inode for the inode file into it. Thus, as long as the root inode is not updated, the state of the file system represented on disk does not change."

*** Again, I did every last bit of this in 1989.

"The present invention also creates snapshots that are read-only copies of the file system. A snapshot uses no disk space when it is initially created. It is designed so that many different snapshots can be created for the same file system. Unlike prior art file systems that create a clone by duplicating the entire inode file and all of the indirect blocks, the present invention duplicates only the inode that describes the inode file. A multi-bit free-block map file is used to prevent data from being overwritten on disk. "

*** Same thing again. They lose.

US patent 6,038,570

"Method for allocating files in a file system integrated with a RAID disk sub-system" (which isn't a patent on WAFL by itself, it's a patent on the way we try to arrange to write as many blocks in a stripe as possible).

*** Well, I'd have to call this one particularly obvious. In any event, if they want to argue I think they'll find themselves in a position of having to kiss their main patents bye-bye.

Marty Fouts replied:

I would refer anyone interested in 'prior art' in patents to http://www.ipmall.fplc.edu/ipcorner/bp98/welch.htm especially the brief discussion on what 'prior art' is to the patent office. Also, for those who believe that similar concepts will void patents, I would suggest a search of the IP literature on the topic of 'narrowly defined.'

As to whether or not Network Appliance's patents would hold up in court, I offer two contradictory opinions:

Factoid: 90% of all patents are never challenged, while 80% of those that are are overturned.

and

"Going into court is throwing the dice."

To Marty's factoid, Daniel replied, "In otherwords, 2% of patents are successfully defended, just enough to keep the serfs in line."

Chris Good also replied to Daniel's initial post, saying that it seemed the two algorithms were sufficiently different not to cause a problem; also that Network Appliances catered to the BSD market and would not like to alienate the free software community; and also, "Netapp are a pretty nice bunch and chasing someone doing GPL code isn't their style." Daniel replied, "Netapp may be great guys but they have still claimed to own something that properly belongs to me and the rest of humanity." But he added:

I retract anything I may have said that might reflect negatively on Netapp. I haven't met them, I know very little about them.

I wasn't calm yesterday. I suspected such patents might exist ever since I first heard of WAFL last spring. When I actually saw the patent abstracts I became angry, not angry at Netapps but at the whole barbaric patent system. Netapps had no choice but to apply for their patent and I have no choice but to confront it.

If they are really great guys then let them prove it by licencing their patents for unrestricted use in GPL-compatible code. They don't have a lot to lose: my work is superior and GPL, and directly based on work I did in 1989. They should realize that their main patents aren't worth much now except to lawyers, so they might as well collect some good karma by making their licence GPL-compatible.

It was suggested to me privately that I contact Netapp and show them my algorithm. That seems to me to be a very good idea.

Elsewhere at some point, Jeff V. Merkey said, "I am having Andrew McCullough review these patents to determine if there are any infringement issues that may affect us. Whomever is concerned her, if it would not be too much trouble, please forward what documentation and patent no.'s to wandrew@timpanogas.org and copy me at jmerkey@timpanogas.org and we will forward them to Malinkrodt & Malinkrodt in Salt Lake City. I'll pay them to do a patent infringment analysis, and post their analysis to interested/affected parties." Daniel gave links to the abstracts for Method for maintaining consistent states of a file system and for creating user-accessible read-only copies of a file system (http://164.195.100.11/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=/netahtml/srchnum.htm&r=1&f=G&l=50&s1='5819292'.WKU.&OS=PN/5819292&RS=PN/5819292) , Write anywhere file-system layout (http://164.195.100.11/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=/netahtml/srchnum.htm&r=1&f=G&l=50&s1='5963962'.WKU.&OS=PN/5963962&RS=PN/5963962) , and Method for allocating files in a file system integrated with a RAID disk sub-system (http://164.195.100.11/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&u=/netahtml/search-adv.htm&r=1&f=G&l=50&d=PALL&p=1&S1=6038570&OS=6038570&RS=6038570) . Leter on, Jeff explained, "I've allocated $20,000 US for this, but i doubt you will use all of it. Linux IP issues affect all of us since we ship Linux, so I am happy to pick up the tab. Tux is hot stuff, and we plan to use it, along with all the other great Linux stuff. Consider it our part to help Linux. The structure of TRG is modeled a lot like Microsoft -- we are actually a law firm thinly disguised as a software company (just like MS) snicker... snicker .... :-)"

 

8. 'kampd' Hogging The CPU?
3 Oct 2000 - 9 Oct 2000 (7 posts) Archive Link: "kapmd cpu usage"
People: Stephen RothwellJeff Garzik

Geoffrey Gallaway noticed that in 2.4.0-test9, the 'kampd' daemon seemed to be using 70% - 80% of his CPU. This didn't seem right to him, and he posted to the list. Stephen Rothwell and Jeff Garzik said this was normal, and didn't mean what he thought it meant. As Stephen put it, "the kapmd is doing the job of the idle loop. The processor is almost always asleep, but the time just gets accounted to kapmd." Jeff added, "Richard, please add this to the FAQ."

 

9. Support For CS89x0-Based PCMCIA Cards
5 Oct 2000 - 6 Oct 2000 (5 posts) Archive Link: "[PATCH] Support for CS89x0 based PCMCIA cards"
People: Peter De SchrijverAndrew MortonJason GunthorpeDavid Hinds

Peter De Schrijver posted a patch and announced, "Attached you will find a patch which adds support for CS89x0 base PCMCIA cards such as the IBM EtherJet. The code is based on the existing CS89x0 driver. Tests where done on an IBM A20m with the IBM EtherJet card. Future improvements include better integration with the non PCMCIA CS89x0 code and patches for other cards based on the same chipset." Andrew Morton was very impressed, but added:

Did you know that Danilo Beuche has written a Card Services driver for this device? An old version of that driver currently resides somewhere in the `contrib' area of http://pcmcia-cs.sourceforge.net (sorry, I don't have the exact URL - the internet has stopped). His current page is at http://www.first.gmd.de/~danilo/pc-driver/ and his driver seems fairly current for kernel 2.2.

Could I suggest that you review Danilo's driver against yours? He may have additional device support, etc.

Once you've done that I'd suggest that we use a common header file between the ISA and PCMCIA drivers. I'm open to suggestions on whether you think a common .c file is appropriate.

Unfortunately your work is based on an older version of the 2.4 driver and it doesn't include some EEPROM and IRQ work from Jason Gunthorpe.

More unfortunately, the 2.4.0-test9 cs89x0.c isn't up-to-date. I have a few minor changes here which I wasn't planning on submitting until post-2.4.0.

I'd also suggest that we ask David Hinds to review your driver - I'm not at all familiar with the Card Services interfaces.

Anyway, I'll send you the diffs against the current cs89x0.c and we can work this off-list a bit, see where it ends up.

Peter replied, "I knew about the existance of Danilo's driver, but I didn't know there was a 2.4.x release. As I wanted to use the card under 2.4 and I thought it would be better to share as much code as possible between drivers for different cards based on the CS89x0 chipset, I made a driver based on the 2.4.0-test9 CS89x0 driver for ISA cards." He agreed with the rest of Andrew's comments, and they took the discussion off-list.

 

10. 'Phase Tree' Algorithm Defined In Preparation For Legal Action
5 Oct 2000 - 6 Oct 2000 (6 posts) Archive Link: "Phase tree algorithm defined"
Topics: FS: NTFS, FS: ext3, Patents, Web Servers
People: Daniel PhillipsRichard Moore

The introduction of a 'Phase tree' algorithm was first covered in Issue #80, Section #1  (5 Jul 2000: ext3-0.0.2f Released; Consistency Checkers; New "Phase Tree" Algorithm) . This week, Daniel Phillips learned of patents that might affect his work (see Issue #89, Section #7  (3 Oct 2000: Previously Used Filesystem Algorithms Patented By Vendor?) ), and now announced:

I have finally produced something resembling a formal definition of the phase tree algorithm. As you will see, this algorithm is somewhat subtle, and not easy to express in clear simple terms. But I think that I have in fact expressed it clearly in simply. If I have not, I wish very much to be told so, and why.

You can get a copy here:

http://innominate.org/~phillips/tux2/phase.tree.algorithm.txt

Please, if you are especially anal and nasty and have little regard for anyone's feelings, read this and complain about every little thing that is wrong with it, and I will greatly appreciate that. I will also appreciate comments of the form 'you left out this or that', or 'this part sounds like so much bafflegab' and so on.

Richard Moore was impressed, but suggested adding some diagrams to the description, to illustrate the structure. He added, "Also, I'd like to understand how the Phase Tree differs from other tree schemes used by files systems, for example the Modified Patricia Tree used by HPFS and NTFS. It wasn't quite clear to me how the advantages of consistency are obtained, but diagrams might help." Daniel replied that he'd go back and try to make the document more clear; and added that he was currently preparing slides for the Atlanta Linux Symposium. In terms of how the phase tree algorithm might compare to those used by HPFS and NTFS, Daniel replied, "Um, I haven't got a clue because I don't know anything about the internals of either of them. Do you have some relevant details you can supply?" Richard wasn't sure of the details, but hoped to meet up with Daniel at ALS.

 

11. 'procfs' Interface
5 Oct 2000 (3 posts) Archive Link: "procfs info"
Topics: FS: procfs
People: Jeff GarzikGeorge Anzinger

George Anzinger asked if there were any docs on how to code for 'procfs', and Jeff Garzik replied:

There is no documentation for the -exported- procfs interface as far as I know. As for internal interfaces, who knows what you are asking...

Here's a rough outline: (maybe somebody should clean this up and stick it into Documentation/*)

* Drivers without MAJOR /proc interfaces should stick their procfs files/directories into /proc/driver/*

* Use proc_mkdir to create directories. For symlinks, proc_symlink, for device nodes, proc_mknod. Note that only proc_mknod takes a permission (mode_t) argument. If you need special permissions on directories, use create_proc_entry with S_IFDIR in mode_t arg. Otherwise directories will be mode 0755.

* Use create_proc_read_entry for your procfs "files." For anything more complex than simply reading, use create_proc_entry. If you pass '0' for mode_t, it will have mode 0644 (ie. normal file permissions).

* Use remove_proc_entry for removing entries.

* Pass NULL for the parent dir, if you are based off of /proc root.

* You don't need to keep around pointers to your procfs directories and files. Just call remove_proc_entry with the correct (full) path, relative, to procfs root, and the right thing will happen.

Cheesy init example:

if (!proc_mkdir("driver/my_driver", NULL))
     /* error */
if (!create_proc_read_entry("driver/my_driver/foo", 0, NULL, foo_read_proc, NULL))
     /* error */
if (!create_proc_read_entry("driver/my_driver/bar", 0, NULL, bar_read_proc, NULL))
     /* error */

Cheesy remove example:

remove_proc_entry ("driver/my_driver/bar", NULL);
remove_proc_entry ("driver/my_driver/foo", NULL);
remove_proc_entry ("driver/my_driver", NULL);

In the above examples, I'm pretty sure that the proc_mkdir call, and final remove_proc_entry, can be skipped, too....

 

12. Duplicate Messages Appearing On 'linux-kernel'
5 Oct 2000 - 6 Oct 2000 (8 posts) Archive Link: "Majordomo Problems?"
People: Matti AarnioFrank van ViegenErik MouwJames W. LaferriereTorben Mathiasen

Mark Post reported receiving multiple copies of email sent to 'linux-kernel', and James W. Laferriere, Sven Krohlas, Torben Mathiasen and Erik Mouw reported the same problem. Erik pegged the problem as a mail loop, and Matti Aarnio also said, "The list has been experiencing loop via somebody. The likely suspect is now deleted from the list, and it remains to be seen if that helped." Finally, Frank van Viegen reported, "Sorry, I caused this problem. The repeated messages were send out by a mail processing script I've actually been using for quite some time, but demonstrated to contain some 'hidden features' when confronted with LK-mail." He apologized for the inconvenience, and the thread ended.

 

13. 'minixfs' Exploit; Some Discussion Of Function Return Values
8 Oct 2000 (7 posts) Archive Link: "2.4.0-test9: minixfs causing oopsen when out of inodes"
Topics: FS: ext2
People: Russell KingLinus TorvaldsDaniel PhillipsAlexander Viro

Russell King reported an exploit to cause an oops when using 'minixfs'. He described:

I can cause them by doing the following:

  1. Create a small minix ramdisk
  2. mount it
  3. fill it up touching lots of files until it complains about "No space left on device"
  4. touch another file - it still complains about no space, and no oops.
  5. touch *the same file* again as you did in (4), and receive an oops for your efforts.

Note that you can repeat (4) as many times as you like, and it will always complain with the same error, but no oops. Also note that in step 5, if you stat the file, you also get an oops.

He targetted the problem as being an uninitialized return value, and posted a patch to fix it. Linus Torvalds replied:

Patch looks fine, and applied.

HOWEVER, it does show once again, that the way these errors are passed around is just fundamentally broken. It would probably be better to follow the _real_ rules for error handling, which are:

ALWAYS _RETURN_ THE ERROR.

Having return values passed by an argument pointer is broken. ALWAYS.

This, btw, is why Linux returns error numbers as -Exxx instead of using "-1" and "errno" - I dislike the latter enormously.

This is also why the VFS layer tends to use ERR_PTR/PTR_ERR/IS_ERR: it makes it very easy to pass back error information, and it makes it very hard to do it wrong. I suspect both minix and ext2 would be better off using that convention instead.

It's not worth changing at this point, but for future reference it would probably be much preferable to return the error code instead of the horrible "error value through pointer access" method, which is usually rather inefficient too.

Daniel Phillips suggested, "It would be nice if we could return a struct consisting of the error and result." There were several replies. Alexander Viro said it would be very inefficient on some platforms, and added, "Functions returning structures are in a very dark area of C - let somebody else break their necks debugging the compilers." Linus also replied to Daniel:

Struct returns are used by the page table handling, and it works. There it's used for other reasons (better encapsulation and type-checking).

However, it generates reasonably efficient code only for the single-value struct case (the case for most of the page tables out there), and the syntax ends up being absolutely horrible for multi-values. Not worth it.

 

14. First Pre-Release Before 2.4.0-test10
9 Oct 2000 (1 post) Archive Link: "test10-pre1"
Topics: Disks: IDE, Kernel Release Announcement, OOM Killer, SMP, USB, Virtual Memory
People: Linus TorvaldsArnaldo Carvalho de MeloIvan KokshayskyRik van RielRandy DunlapAndrzej KrzysztofowiczVojtech PavlikRussell KingRoman ZippelBrian GerstTorben MathiasenAndre HedrickJean TourrilhesJaroslav KyselaRoger Larsson

Linus Torvalds announced 2.4.0-test10-pre1, saying:

Largely VM balancing and OOM things (get rid of the VM livelock that existed in test9), and USB fixes.

And a number of random driver fixes (SMP locking on network drivers, what not).

  • pre1:
    • Roger Larsson: ">=" instead of ">" to make the VM not get stuck.
    • Gideon Glass: brw_kiovec() failure case oops fix
    • Rik van Riel: better memory balancing and OOM killer
    • Ivan Kokshaysky: alpha compile fixes
    • Vojtech Pavlik: forgotten ENOUGH macro in via82cxxx ide driver
    • Arnaldo Carvalho de Melo: acpi resource leak fix
    • Brian Gerst: use mov's instead of xchg in kernel trap entry
    • Torben Mathiasen: tlan timer being added twice bug
    • Andrzej Krzysztofowicz: config file fixes
    • Jean Tourrilhes: Wavelan lockup on SMP fix
    • Roman Zippel: initdata must be initialized (even if it is to zero: gcc is strange)
    • Jean Tourrilhes: hp100 driver lockup at startup on SMP
    • Russell King: fix silly minixfs uninitialized error bug
    • (various): fix uid hashing to use "uid_t" instead of "unsigned short"
    • Jaroslav Kysela: isapnp timeout fix. NULL ptr dereference fix.
    • Alain Knaff: fdformat should work again.
    • Randy Dunlap: USB - fix bluetooth, acm, printer, serial to work with urb->dev changes.
    • Randy Dunlap: USB whiteheat serial driver firmware update.
    • Randy Dunlap: USB hub memory corruption and pegasus driver update
    • Andre Hedrick: IDE Makefile cleanup

There was no reply.

 

 

 

 

 

 

We Hope You Enjoy Kernel Traffic
 

Kernel Traffic is grateful to be developed on a computer donated by Professor Greg Benson and Professor Allan Cruse in the Department of Computer Science at the University of San Francisco. This is the same department that invented FlashMob Computing. Kernel Traffic is hosted by the generous folks at kernel.org. All pages on this site are copyright their original authors, and distributed under the terms of the GNU General Public License, version 2.0.