Kernel Traffic #92 For 6�Nov�2000

By Zack Brown

linux-kernel FAQ ( | subscribe to linux-kernel ( | linux-kernel Archives ( | ( | LxR Kernel Source Browser ( | All Kernels ( | Kernel Ports ( | Kernel Docs ( | Gary's Encyclopedia: Linux Kernel (

Table Of Contents


Many thanks go to Ariel Faigon, for a patch to the KC compiler, to make the printer-friendly pages not display duplicate URLs from hrefs. Thanks, Ariel!

Mailing List Stats For This Week

We looked at 1144 posts in 4839K.

There were 366 different contributors. 174 posted more than once. 141 posted last week too.

The top posters of the week were:

1. Large Memory Support For Intel Systems

11�Oct�2000�-�25�Oct�2000 (31 posts) Archive Link: "large memory support for x86"

Topics: Big Memory Support, Virtual Memory

People: Tigran Aivazian,�Richard B. Johnson,�Jonathan George,�Jeff Epler,�Oliver Xymoron

Kiril Vidimce asked about Intel systems with greater than 4G of RAM. In particular, he wanted to know if a process could actually allocate such large quantities, and whether (since high-memory support was only an option in kernel compiles, and was not simply part of a standard compile) there was any performance penalty to consider. Tigran Aivazian replied, "Linux does support 64G of physical memory. My machine has 6G RAM and runs absolutely nice and smooth as it should. Everything "just works"." [...] "As for PAE, yes it does incur penalty of about 3-6% of performance "overall". By overall performance I meant the unixbench numbers. I have published the numbers comparing PAE/4G/nohighmem kernels on the same machine sometime ago... So, it only makes sense to enable PAE if you have more than 4G of memory." Oliver Xymoron also replied to Kiril, saying that allocating such large amounts of memory required doing some tricks, since Intel systems still used a flat 32-bit (4G) address space. Richard B. Johnson qualified, "per process. Which means, in principle, that one could have 100 processes that are accessing a total of 400 Gb of virtual memory." Jonathan George suggested to Kiril, "You can, of course, bank switch memory by using a shared segment and cloning additional process heaps. Obviously a _single_ 32bit address space can only access 4GB at a time." Jeff Epler offered his explanation:

Pointers are still 32 bits on x86, and the visible address space for any particular process is still somewhat less than 4G.

I believe that if you select Linux on Alpha that you can have more than 4G per process, but that may or may not be true.

What the support for >4G of memory on x86 is about, is the "PAE", Page Address Extension, supported on P6 generation of machines, as well as on Athlons (I think). With these, the kernel can use >4G of memory, but it still can't present a >32bit address space to user processes. But you could have 8G physical RAM and run 4 ~2G or 2 ~4G processes simultaneously in core.

There may or may not be some way to support an abomination like the old "far" pointers in DOS (multiple 4G segments), but I don't think it has been written yet.

2. Some Benchmarks Comparing 2.2 With 2.4

23�Oct�2000�-�24�Oct�2000 (7 posts) Archive Link: "LMbench 2.4.0-test10pre-SMP vs. 2.2.18pre-SMP"

Topics: Networking, Virtual Memory

People: Linus Torvalds,�Jeff Garzik,�Chris Evans

Jeff Garzik ran 'lmbench' to compare 2.4.0-test10-pre3 and 2.4.0-test10-pre4 with 2.2.18-pre17, and reported the results. Chris Evans took a look at the data and remarked that for local communication latencies, 2.4 was much slower in pipe/AF UNIX latencies, to the point of being broken. But for file and virtual memory latencies, mmap() latencies in 2.4 were only 1/7 of what they'd been in 2.2; and for local communication bandwidth, he noted that TCP bandwidth had improved in 2.4, while pipe bandwidth had slacked a bit.

Linus Torvalds replied to Chris' characterization of AF UNIX latencies as "broken", saying:

Not really.

The issue is that under 2.4.x, the Linux scheduler is very actively trying to spread out the load across multiple CPU's. It was a hard decision to make, exactly because it tends to make the lmbench context switch numbers higher - the way lmbench does context switch testing is to pass a simple packet back and forth between pipes, and it's actually entirely synchronous. So you get the best lmbench numbers when both the receiver and the sender stay on the same CPU's (mainly for cache reasons: everything stays on the same CPU, no spinlock movement, and no pipe data movement from one CPU to another).

This was something people (mainly Ingo) were very conscious of when doing scheduler changes: lmbench was something that we all ran all the time for this, and it was not pretty to see the numbers go up when you schedule on another CPU.

But in real life, the advantage of spreading out is actually noticeable. In real life you don't tend to have quite as synchronous a data passing: you have the pipe writer continue to generate data, while the pipe reader actually _does_ something with the data, and spreading out on multiple CPU's means that this work can be done in parallell.

Oh, well.

(I don't actually know why AF UNIX went up, but it might be the same issue. The new networking code is fairly asynchronous, which _really_ improves performance because it is able to actually make good use of multiple CPU's unlike the old code, but probably for similar reasons as the pipe latency thing this is bad for AF UNIX. I don't think anybody has bothered looking that much: a lot more work has been put into TCP than into AF UNIX).

The the file and VM latency improvement, Linus said, "This is due to all the VM layer changes. That mmap latency thing is basically due to the new page-cache stuff that made us kick ass on all the benchmarks that 2.2.x was bad at (ie mindcraft etc)."

3. Cleaning Up Internal Data Structures

24�Oct�2000�-�30�Oct�2000 (37 posts) Archive Link: "PATCH: killing read_ahead[]"

Topics: SMP

People: Martin Dalecki,�Rik van Riel,�Jeff Garzik,�Linus Torvalds,�Ingo Oeser,�Jeff V. Merkey,�Alexander Viro

Martin Dalecki posted a lengthy patch, and remarked, "Please have a look at the following patch and feel free to be scared by the fact how UTTERLY BROKEN and ARBITRARY the current usage of the read_ahead[] array and during the whole past decade was! If you really care about clean internal interfaces this should be one of those prio number ONE targets to shoot at..." Jeff V. Merkey was skeptical that such large patch might break things, espeicially this late in the development series. He tested it out and felt that there were some sort of problems on SMP machines. Elsewhere, Rik van Riel said, "Ideally we should (IMHO) get rid of all MAX_BLKDEV arrays. They take up too much memory on small systems and aren't big enough for big systems..." and Martin agreed. Jeff Garzik offered, "I agree with you and Rik that this array needs to go away... but ripping out the feature is not the answer, IMHO." Linus Torvalds replied, "Actually, the _real_ answer is to make fs/block_dev.c use the page cache instead - and generic_file_read() does read-ahead that actually improves performance, unlike the silly contortions that the direct block-dev read-ahead tries to do." Ingo Oeser replied, "If we had a paper about the page cache this would be easy." And Jeff V. M. said, "I hope we are not doing something stupid here, like breaking the f*!%cking page cache again. I've finaly got all the bugs out of NWFS on 2.4.0-test9, and have waded through the breakage of the past two testX releases of 2.4." There followed a medium to long implementation discussion, with numerous patches and technical points, primarily involving Rik, Jeff V. M. and Alexander Viro.

4. Data Loss For Big Files Over NFS In 2.2 And 2.4

25�Oct�2000�-�26�Oct�2000 (13 posts) Archive Link: "nfsv3d wrong truncates over 4G"

Topics: FS: NFS

People: Matti Aarnio,�Andrea Arcangeli,�Trond Myklebust

Andrea Arcangeli noticed that in 2.2 and 2.4, nfsd kernel-based servers would wrongly truncate files when using offsets greater than 4G. He posted a short patch to the server-side code, but Matti Aarnio didn't think it was any of the server's business. He said, "Let the CLIENT to handle the O_LARGEFILE testing, and let the SERVER to just assume it being the situation." Andrea replied, "I don't follow. The patch avoids to lose the high 32bit of information during the setattr call. I'm not limiting anything, I'm just allowing the server to see the whole information that cames from the client." Matti had also asked if the O_LARGEFILE flag could be passed over NFSv3 in the first place, and Trond Myklebust replied, "All NFSv3 operations are 64-bit and LFS-compliant. There's therefore no need for an O_LARGEFILE flag."

5. 'Shared Memory' Listed In '/proc/meminfo/index.html' But No Value Calculated

26�Oct�2000 (6 posts) Archive Link: "2.4test9-pre5 shared memory?"

People: Craig Schlenter,�Stephen Clark,�Steven Cole

Stephen Clark noticed that in 2.4.0-test9-pre5, '/proc/meminfo/index.html' listed the value of shared memory as 0. Steven Cole confirmed this on 2.4.0-test10-pre5, and Craig Schlenter said, "This has been "broken" since early in 2.3.x (x==13 I think). Apparently it is costly to keep these sorts of stats so it's not done anymore." Stephen Clark suggested that shared memory shouldn't be listed at all in that case, and Craig replied, "Probably not. There may be tools that rely on it existing that may break if it goes away altogether." This made sense to Stephen Clark, and the thread ended.

6. Possible Bug In VIA vt82c686a Chip

26�Oct�2000�-�30�Oct�2000 (31 posts) Archive Link: "Possible critical VIA vt82c686a chip bug"

Topics: Disks: IDE, Disks: SCSI, PCI

People: Vojtech Pavlik,�Bart Hartgers,�Yoann Vandoorselaere,�Martin Mares,�Shane Shrybman,�Richard B. Johnson,�Crutcher Dunnavant

Vojtech Pavlik reported that his system time would go crazy during heavy disk activity, on his VIA SuperSouth (vt82c686a) chip (ISA bridge revision 0x12, silicon rev CD) on a FIC VA-503A rev 1.2; Yoann Vandoorselaere, Shane Shrybman, and Crutcher Dunnavant all confirmed the problem, and Vojtech asked for some information from the 'lspci' command. Yoann complied, and Vojtech remarked, "Oh, this is a newer revision than mine (silicon CF or CG) - I'd expect that one not to have the problem ... well, it seems it's more widespread than I expected."

Elsewhere, under the Subject: Re: Possible critical VIA vt82c686a chip bug (private question) ( , some more information from Yoann (his was a SCSI system) led Vojtech to conclude, "If it's caused by SCSI as well (might be), then it's not caused by heavy IDE activity but rather than that it could be heavy BusMastering activity instead (The IDE chip does BM as well). I'm still wondering if it could be a Linux kernel bug (bad/concurrent accesses to the i8253 registers), this has to be checked." Bart Hartgers also suspected the chip might not be the problem, and remarked, "I ran into something similar a while ago, when I mixed the two arguments to an outb in a driver, and ended up writing MYPORT into the timer instead of 0x40 into MYPORT." He asked how sure Vojtech was that it was a chip bug, and Vojtech replied, "I'm *not* sure. It just looks like a reasonable explanation. It doesn't happen on Intel chips and older VIA chips, it only happens on new VIA chips, and the code is the same all the time. Also, it happens both with 2.2 and 2.4 kernels ..."

Richard B. Johnson gave a good try at identifying the problem, and thought he'd found a spot in the IDE code where the timer was getting confused, but Vojtech replied that that particular code was '#if'ed out by the preprocessor, "So this is not our problem here. Anyway I guess it's time to hunt for i8259 accesses in the kernel that lack the necessary spinlock, even when they're not probably the cause of the problem we see here." At one point, Yoann reported some test results, in which he found that the problem was present even when the IDE subsystem had been disabled. He concluded it was definitely not an IDE problem, and Vojtech said, "So now it seems that possibly enough PCI traffic / busmastering traffic can cause the problem ..." He wondered aloud whether the problem would also occur when there was plenty of RAM still available, and Yoann confirmed that it would indeed, remarking, "my system was loaded, but was usable (at least until the problem occured)..."

Elsewhere, Martin Mares suggested, "what about trying to modify your work-around code to make it attempt to read the timer again? This way we could test whether it was a race condition during timer read or really timer jumping to a bogus value." But Vojtech eliminated that possibility, saying:

Actually if I don't reprogram the timer (and just ignore the value for example), the work-around code keeps being called again and again very often (between 1x/minute to 100x/second) after the first failure, even when the system is idle.

When reprogramming, next failure happens only after stressing the system again.

So it's not just a race, the impact of the failure on the chip is permanent and stays till it's reprogrammed.

At one point Yoann suggested that VIA must not be aware of the problem, since it persisted across so many different chip versions. He concluded that the problem would probably not be seen under Windows, since VIA would almost certainly have been made aware of it in that case. And Vojtech ended the thread with, "It can't happen under Windows, because Windows timer runs at 18 Hz (timer programmed to 65535), while Linux uses 100 Hz (timer programmed to approx 11920), so when the timer unprograms itself due to the bug to 65535, only Linux notices it, Windows can't."

7. Some Explanation Of Kernel Naming Conventions

27�Oct�2000 (2 posts) Archive Link: "Full preemption issues"

People: George Anzinger,�Linus Torvalds

George Anzinger asked about naming conventions in the kernel. Soecifically, he asked, "We note that the kernel uses "_" and "__" prefixes in some macros, but can not, by inspection, figure out when to uses these prefixes. Could you explain this convention or is this wisdom written somewhere?" Linus Torvalds replied, "The "wisdom" is not written down anywhere, and is more a convention than anything else. The convention is that a prepended "__" means that "this is an internal routine, and you can use it, but you should damn well know what you're doing if you do". For example, the most common use is for routines that need external locking - the version that does its own locking and is thus "safe" to use in normal circumstances has the regular name, and the version of the routine that does no locking and depends on the caller to lock for it has the "__" version."

8. 'ide-patch' For 2.2.18

27�Oct�2000�-�29�Oct�2000 (5 posts) Archive Link: "[ANNOUNCE] ide-patch for 2.2.18(pre)"

Topics: Disk Arrays: RAID, Disks: IDE, FS: NFS

People: Bartlomiej Zolnierkiewicz,�Andre Hedrick,�Adrian Bunk

Bartlomiej Zolnierkiewicz announced:

I have ported ide-patch to 2.2.18-17 and I'm now backporting 2.4.0 changes. New VIA, SLC, OSB4 drivers and MANY other things are already there.I hope that final 2.2.18-ide-patch will have IDE functionality equal to this in 2.4.0-test10...

Here is a snapshot (it's not thoroughly audited and tested):

And please cut that bullshit about ide-patch 2.2.x being unmantained. I don't use 2.2.x kernels anymore so I don't do ide-patches for pre kernels. But there will be patches for stable 2.2.x. (Although it's a real pain - I hate doing backporting instead of new stuff).

Eyal Lebedinsky asked how this would behave with the RAID patch at, but there was no reply. Adrian Bunk modified Bartlomiej's patch to go cleanly against 2.2.18pre18, and gave a pointer to his version ( , but there was no reply. Someone else asked if this could be backported to 2.2.17, but there was no reply.

Andre Hedrick replied to Bartlomiej's initial post, with, "The point is that I have stopped with the backport because of 2.4.0 push, and I was waiting on you to pick it up again." There was no reply, but for more on the situation of the IDE patch, see Issue�#91, Section�#3� (15�Oct�2000:�'IDE-patch' In 2.2) .

9. Possible GPL Violations In Kernel Source

28�Oct�2000�-�30�Oct�2000 (8 posts) Archive Link: "Linux-2.4.0-test9 not Open Source"

Topics: Networking, Patents

People: Mark Spencer,�David Woodhouse,�Gregory Maxwell,�Jeff V. Merkey,�Alan Cox

Mark Spencer found this suspicious-looking text in the NFTL (NAND Flash Translation Layer) driver source:

The contents of this file are distributed under the GNU Public Licence version 2 ("GPL"). The legal note below refers only to the _use_ of the code in some jurisdictions, and does not in any way affect the copying, distribution and modification of this code, which is permitted under the terms of the GPL.

Section 0 of the GPL says:

"Activities other than copying, distribution and modification are not covered by this License; they are outside its scope."

You may copy, distribute and modify this code to your hearts' content - it's just that in some jurisdictions, you may only _use_ it under the terms of the licence below. This puts it in a similar situation to the ISDN code, which you may need telco approval to use, and indeed any code which has uses that may be restricted in law. For example, certain malicious uses of the networking stack may be illegal, but that doesn't prevent the networking code from being under GPL.

In fact the ISDN case is worse than this, because modification of the code automatically invalidates its approval. Modificiation, unlike usage, _is_ one of the rights which is protected by the GPL. Happily, the law in those places where approval is required doesn't actually prevent you from modifying the code - it's just that you may not be allowed to _use_ it once you've done so - and because usage isn't addressed by the GPL, that's just fine.


LEGAL NOTE: The NFTL format is patented by M-Systems. They have granted a licence for its use with their DiskOnChip products:

"M-Systems grants a royalty-free, non-exclusive license under any presently existing M-Systems intellectual property rights necessary for the design and development of NFTL-compatible drivers, file systems and utilities to use the data formats with, and solely to support, M-Systems' DiskOnChip products"

A signed copy of this agreement from M-Systems is kept on file by Red Hat UK Limited. In the unlikely event that you need access to it, please contact for assistance.

Mark felt that the ISDN case was not a GPL violation, "because the authors of the code do not place any additional restrictions on the GPL whatsoever, they simply bring it to your attention that using an un-certified ISDN stack may be illegal in some countries." But he went on:

let's look at the rest of the NFTL restriction. I've already brought this to the attention, of course, of RMS and ESR.

Richard believes that this violates the GPL because it places additional restrictions not found in the GPL.

In any case, it seems pretty obvious that this restriction violates section 6 of the Open Source Definition which states:

"The license must not restrict anyone from making use of the program in a specific field of endeavor...."

In this case, the field of endeavor is to use it with another vendor's product.

David Woodhouse (the author of the code in question) and Alan Cox explained that the text was not an additional restriction, and was only informational. As David put it:

I am the author of the code in question. I have placed it under GPL. I do not place any additional restrictions on the GPL whatsoever, I simply bring it to your attention that using it on some hardware may be illegal in some countries.

Besides, you misquoted. What the GPL in fact says is:

"You may not impose any further restrictions on the recipients' exercise of the rights granted herein."

So in order for it to be a problem:

  1. _I_ must impose the restriction. Not your local laws.
  2. The right in question (in this case, the right to use the NFTL code) must be 'granted [t]herein'.

So the NFTL is fine on both counts, then, because:

  1. I don't.
  2. It isn't (see below...)

Reading on will explain my answer to #2...

"Activities other than copying, distribution and modification are not covered by this License; they are outside its scope."

That's good. Not only is the right to use not explicitly granted therein, but the GPL is fairly explicit about the fact that it's not relevant.

Immediately after that quote, it does say:

"The act of running the Program is not restricted, ..."

Note the wording "is not", not "must not be" - it is saying that the GPL does not of itself place restrictions on the act of running the program - which serves to verify what it said above, about being _only_ to do with "copying, distribution and modification".

He went on:

I have discussed this with Richard before. I disagree, for the reasons stated above:

  1. The licence does _not_ restrict the use of the code. Your local laws do. This is _precisely_ the same as the ISDN situation. Also
  2. The right to use isn't 'granted [t]herein' anyway.

The last conversation I had with Richard on this topic, IIRC, ended in him all but admitting that the GPL doesn't in fact prevent this. Of course he didn't actually _say_ that, but he did fall back to claiming that it was his _intention_ that was legally binding, not the text of the licence.

AFAIK his statement is incorrect, except where the actual text of the licence is ambiguous. I see no ambiguity in the sentence "Activities other than copying, distribution and modification are not covered by this License; they are outside its scope."

David continued, "As Linus accepted the code after I notified him of the situation, I infer that he shares my opinion. Note that even in the case of an ambiguity in the text of the GPL, I believe in this case that it it Linus' intention, not RMS', which would be relevant." And concluded:

Don't get me wrong - I detest the practice of patenting software, but I don't believe that the existence of a patent should prevent us from using the algorithm either

  1. Where it is legal to do so - which is either in the Free World where the insane patent doesn't apply, or on DiskOnChip hardware, for which M-Systems have granted permission. or
  2. Even where it is not legal, if someone wishes to challenge the patent or 'call the bluff' of the patent-holder.

Gregory Maxwell felt that the code was indeed in violation of the GPL, claiming, "There is a clear ability here for the author of the driver and m-systems to conspire to retroactively revoke anyones privilege to use, modify, or distribute the stock kernel because of this code." Alan replied that he didn't see how this could be done, and Jeff V. Merkey agreed, saying:

Alan is right. The way it's worded, they could never show a case for "irreparable harm" to any sitting judge in the US. This means they could say "we've changed our minds, and revoke this persons or that person's license" but given it's been released with this language, no case for harm or damages, or even a petition for injunctive relief would have a snowball's change in hell of succeeding.

If you release code under the GPL, you basically waive any rights to seek enforcement because in the US, you must be able to show "irreparable harm" from some parties use of the code. It's tough to do this if you've given the code away (which the GPL does) without a contractural requirement for "consideration" ($$$). The GPL in the strictest legal sense is the ultimate IP legal virus because it not only removes the basis for damages claims for use of present code released under it's terms, but since it covers derivative works, it's effect contaminates all future incarnations of the code.

It's true with how this is worded, the party could come back and attempt to modify the scope, but they would be hard pressed to make a case for an injunction to halt someone's use of the code.

Elsewhere, David asked Gregory, "I'll assume for the moment that I'm liable to suffer some form of brain h�morrhage and go along with this dastardly plan - so enlighten me. How would I conspire with M-Systems to do so?" And Alan replied, "Not a lot. Even if you joined M-Systems you could make no difference. In fact as it stands now M-Systems are the one bunch of people in the world who cannot in fact distribute the code because they would be placing their own restriction on it or have to grant patent rights ;)"

Sharon And Joy

Kernel Traffic is grateful to be developed on a computer donated by Professor Greg Benson and Professor Allan Cruse in the Department of Computer Science at the University of San Francisco. This is the same department that invented FlashMob Computing. Kernel Traffic is hosted by the generous folks at All pages on this site are copyright their original authors, and distributed under the terms of the GNU General Public License version 2.0.