Kernel Traffic #111 For 16 Mar 2001

By Zack Brown

linux-kernel FAQ (http://www.tux.org/lkml/) | subscribe to linux-kernel (http://www.tux.org/lkml/#s3-1) | linux-kernel Archives (http://www.uwsg.indiana.edu/hypermail/linux/kernel/index.html) | kernelnotes.org (http://www.kernelnotes.org/) | LxR Kernel Source Browser (http://lxr.linux.no/) | All Kernels (http://www.memalpha.cx/Linux/Kernel/) | Kernel Ports (http://perso.wanadoo.es/xose/linux/linux_ports.html) | Kernel Docs (http://jungla.dit.upm.es/~jmseyas/linux/kernel/hackers-docs.html) | Gary's Encyclopedia: Linux Kernel (http://members.aa.net/~swear/pedia/kernel.html) | #kernelnewbies (http://kernelnewbies.org/)

Table Of Contents

Introduction

Many, many thanks go out to Stephane Miller, Nick Moffitt, Pierre F. Maldague, and Noel Koethe. Nick has been hosting KT/KC on kt.zork.net as a mirror for quite awhile, and then as the primary site when I left Linuxcare. Zork in turn sits on Stephane's network. Lately, the traffic has been getting to be a bit too much for her connection, and it became clear that something had to be done. Noel, already running his own KT/KC mirror, and Pierre both volunteered to participate in a DNS round-robin, which is now in full effect. Yay! These folks are donating bandwidth, time, and real effort to keeping this site up and running. THANKS!!!

If you have some bandwidth to spare and would like to participate, please contact me (mailto:zbrown@tumblerings.org) and we'll work out the details.

Mailing List Stats For This Week

We looked at 1146 posts in 5081K.

There were 437 different contributors. 203 posted more than once. 172 posted last week too.

The top posters of the week were:

 

1. Patch To Improve Virtual Memory Throughput
27 Feb 2001 - 6 Mar 2001 (26 posts) Archive Link: "[patch][rfc][rft] vm throughput 2.4.2-ac4"
Topics: BSD: FreeBSD, Virtual Memory
People: Mike GalbraithRik van RielChris EvansLinus TorvaldsMarcelo Tosatti

Mike Galbraith posted a patch to remove some code from the Virtual Memory subsystem, after he noticed bad throughput in code that tried to avoid input/output. He suggested, "IMHO, any patch which claims to improve throughput via code deletion should be worth a little eyeball time.. and maybe even a test run ;-)" Rik van Riel replied soberly:

Before even thinking about testing this thing, I'd like to see some (detailed?) explanation from you why exactly you think the changes in this patch are good and how + why they work.

IMHO it would be good to not apply ANY code to the stable kernel tree unless we understand what it does and what the author meant the code to do...

Mike agreed completely, and said he hadn't meant that his patch should be blindly integrated. He gave some technical explanation of what his patch did, and added:

What the patch does is simply to push I/O as fast as we can.. we're by definition I/O bound and _can't_ defer it under any circumstance, for in this direction lies constipation. The only thing in the world which will make it better is pushing I/O.

If you test the patch, you'll notice one very important thing. The system no longer over-reacts.. as badly. That's a diagnostic point. (On my system under my favorite page turnover rate load, I see my box drowning in a pool of dirty pages.. which it's not allowed to drain)

Marcelo Tosatti replied skeptically that Mike's idea of pushing I/O as fast as possible would help with I/O-bound cases, but not necessarily with other cases. Mike replied that he suspected other cases wouldn't be harmed, but Marcelo wasn't convinced. Also, Rik was reluctant even to test a patch that "throws the random junk at the elevator all the time, while my code only bothers the elevator every once in a while." Rik felt his own method "should make it possible for the disk reads to continue with less interruptions."

At this point Chris Evans accused Rik of doing extremely speculative design work for the VM ( "Oh dear.. not more "vm design by waving hands in the air"." ) Rik replied, "Actually, this was more of "vm design by looking at what the FreeBSD folks did, why it didn't work and how they fixed it after 2 years of testing various things"." At one point Linus Torvalds said:

Note that the Linux VM is certainly different enough that I doubt the comparisons are all that valid. Especially actual virtual memory mapping is basically from another planet altogether, and heuristics that are appropriate for *BSD may not really translate all that better.

I'll take numbers over talk any day. At least Mike had numbers, and possible explanations for them. He also removed more code than he added, which is always a good sign.

In short, please don't argue against numbers.

Rik replied, "I'm not arguing against his numbers, all I want to know is if the patch has the same positive effect on other workloads as well..." After this the thread slowed down and gradually petered out, with Mike planning to test some ideas/patches from Rik and Marcelo.

 

2. Linux Vs. FreeBSD Networking Performance
28 Feb 2001 - 5 Mar 2001 (22 posts) Archive Link: "What is 2.4 Linux networking performance like compared to BSD?"
Topics: BSD: FreeBSD, FS: NFS, Networking, SMP, Virtual Memory
People: Hans ReiserTodd UnderwoodNathan DabneyDavid WeinehallAlan CoxTigran Aivazian

Hans Reiser voiced a concern:

I have a client that wants to implement a webcache, but is very leery of implementing it on Linux rather than BSD.

They know that iMimic's polymix performance on Linux 2.2.* is half what it is on BSD. Has the Linux 2.4 networking code caught up to BSD?

Can I tell them not to worry about the Linux networking code strangling their webcache product's performance, or not?

Todd Underwood replied that in his experience, TCP and UDP networking under 2.4 showed a dramatic improvement over 2.2; he offered some numbers, "with the acenic gig-e driver on PIII-933 UP (66MHz x 64bits PCI) we are getting 993 Mb/s with 2.4.0 with jumbo frames (about 850 Mb/s with standard ethernet frames). the best number we got with 2.2 was about 650 with jumbos and 550 with standard." Hans replied, "The problem is that I really need BSD vs. Linux experiences, not Linux 2.4 vs. 2.2 experiences, because the webcache industry tends to strongly disparage Linux networking code, so much better isn't necessarily good enough." Nathan Dabney recommended checking out http://www.swelltech.com/pengies/joe/squidtuneup/t1.html, which "contains some decent squid performance hints for 2.2+Squid." Hans again pointed out that there were no Linux vs. FreeBSD numbers on that site; David Weinehall threw up his hands and said, "You know Hans, both Linux v2.4 and *BSD are free. Install a copy of each and run a couple of benchmarks. I seem to recall that you have a knack for running benchmarks... You can't always rely on having others getting all the information for you." And Alan Cox said to Hans:

I dont think raw network data helps. 2.2 and FreeBSD are basically the same speed for raw networking in the general case. So if someone was seeing real Linux/BSD differences Im concerned it might be a driver but also that it might not have been networking differences but perhaps VM or disk I/O performance. Clearly they saw something since its rather hard to mess up that kind of measuring. I wonder if it was networking though.

The extreme answer to the 2.4 networking performance is the tux specweb benchmarks but they dont answer for all cases clearly.

Elsewhere, Tigran Aivazian said to Hans, "exactly what you want to measure? I have UP, 2way-SMP and 4way-SMP machines all of which have at least Linux+FreeBSD installed. All my tests so far (e.g. comparing NFS servers or filesystems etc) showed Linux (2.4) to be a lot faster than FreeBSD in all areas. However, to get specific answers you need to ask specific questions. Ask and you shall receive." Later, Hans thanked Tigran, saying, "you helped me move the client past the Linux vs. BSD issue."

 

3. Memory Allocation Design In 2.4
1 Mar 2001 - 5 Mar 2001 (18 posts) Archive Link: "Kernel is unstable"
People: Andrea ArcangeliAlan CoxLinus Torvalds

Ivan Stepnikov posted a bug report, and after some debugging discussion Andrea Arcangeli said, "it's pretty obvious the clever vma merging is broken in 2.4." But he replied to himself 4 minutes later, saying, "It's not broken, it's not there any longer as somebody dropped it between test7 and 2.4.2, may I ask why?" Alan Cox replied, "Linus took it out because it was breaking things." Andrea replied that it might have had bugs, but was still a useful, well designed feature, worth fixing. But Linus Torvalds replied:

The locking order was rather nasty in it (mapping->i_shared_lock and mm->page_table_lock), and made a lot of the code much less readable than it should have been. And because none of the callers could know whether the vma existed after being merged, they ended up doing strange things that simply aren't necessary with the much simpler version.

This, coupled with the fact that many merges could be done trivially by hand (much faster), made me drop it. There were a few places where it was used where I couldn't make myself be sure that the locking was right: I could not prove that it was buggy, but I couldn't convince myself that it wasn't, either.

Note how do_brk() does the merging itself (see the comment "Can we just expand an old anonymous mapping?"), and that it's basically free when done that way, with no worries about locking etc. The same could be done fairly trivially in mmap too, but I never saw any real usage patterns that made it look all that worthwhile (The only "testing" I did was really running normal applications and then checking how many merges could be done on /proc/*/maps. Under normal load I did not see very many at all - I had something like six missed merges while running my normal set of applications (X, KDE etc). Others can obviously have very different usage patterns.). Handling the mmap case the same way do_brk() does it would fix the behaviour of this pathological example too..

Also note that the merging tests were not free, so at least under my set of normal load the non-merging code is actually _faster_ than the clever optimized merging. That was what clinched it for me: I absolutely hate to see complexity that doesn't really buy you anything noticeable.

 

4. Intel's E1000 Ethernet Card Under 2.0
1 Mar 2001 - 5 Mar 2001 (15 posts) Archive Link: "Intel-e1000 for Linux 2.0.36-pre14"
Topics: BSD, Networking
People: Ofer FrymanRichard B. JohnsonMatthew JacobJes Sorensen

Ofer Fryman compiled the driver for Intel's E1000 ethernet card, and loaded it successfully into 2.0.36-pre14. But he reported, "With the E1000_IMS_RXSEQ bit set in IMS_ENABLE_MASK I get endless interrupts and the computer freezes, without this bit set it works but I cannot receive or send anything." Someone said that Intel refused to provide documentation for any of the ethernet cards, and Richard B. Johnson replied, "Intel has been a continual contributor to Linux and BSD. Somebody is not getting to the right person. There are lazy people at all companies." He posted some email addresses of Linux-friendly Intel employees, and added, "Maybe you can ask one of them for the information you need? You just need to find an advocate at a big company." But Matthew Jacob said:

Sorry, I don't believe that that this is correct in this case. I spoke on the telephone with the "Manager for Open Source Systems", and the concept of releasing documents to that a driver could be written whose source would be available was a concept too far. He kept on asking about NDAs- I kept on saying, yes, I'll sign an NDA (presumably so knowledge of advanced features, such as VLAN taggging, e.g., would not be released if they did not want it to be)- but the basic driver source would have to be OPEN! (this was for *BSD, but that's the same as linux in this case- we *all* want the damned source open). No meeting of minds. I have been trying this on and off for two years so that I can properly support the Wiseman && Livengood chipsets in *BSD. No luck, ergo, reverse engineering of what little they release with the Linux driver is the order of the day still. The Linux driver, btw, is pretty clearly a port of an NT driver- which is quite amusing.

FWIW.....I just think that the overall company policy within Intel, much like that of NetApp and others, is, "Open Source? Well, maybe, err,umm.. "... It's just not that important to them (as a company, they think). That said- if you can get access to said documentation (which I understand comes in a certain notebook that indicates releasing outside of Intel is a firing offense)- more power to you!

Richard replied:

The way I've gotten so-called proprietary information in the past is to let the world know that "boneserver.analogic.com" 204.178.40.210 is an open ftp site in which I don't even log what's uploaded and downloaded.

I check it once or twice a week to see if somebody has sent me anything of interest. Sometimes, persons unknown to me, have deposited information that I need.

Now I seem to need some programming information on the Intel e-1000. I'll keep you informed if anything turns up.

Several folks also suggested that Ofer upgrade to 2.2; as Jes Sorensen put it, "the scalability of 2.0.x means there is really no good reason to spend time porting GigE drivers to it." Ofer took this advice, but did not have any appreciable success. However, a few days later he actually got the driver working under 2.0! At one point he remarked, "You are right there are no specs on Intel's web site, nor did anyone in Intel answered any of my e-mails or returned any of my calls."

 

5. Swap Minimums And Swap Partition Size Limits On Big RAM Systems
2 Mar 2001 - 5 Mar 2001 (11 posts) Archive Link: "2.4 and 2GB swap partition limit"
People: Matt DomschRogier WolffChristoph RohlandMatti AarnioAndries BrouwerStephen Tweedie

Matt Domsch from Dell asked, "Linus has spoken, and 2.4.x now requires swap = 2x RAM. But, the 2GB per swap partition limit still exists, best as we can tell. So, we sell machines with say 8GB RAM. We need 16GB swap, but really we need like an 18GB disk with 8 2GB swap partitions, or ideally 8 disks with a 2GB swap partition on each. That's ugly. Is the 2GB per swap partition going to go away any time soon?" William T Wilson saw no need for a 2xSwap requirement, and Matt gave a link to Kernel Traffic's coverage of a previous discussion, in Issue #104, Section #2  (7 Jan 2001: Greater 2.4 Swap Requirements) . He added, "We've also seen (anecdotal evidence here) cases where a kernel panics, which we believe may have to do with having 0 < swap < 2x RAM. We're investigating further." Rogier Wolff had mentioned earlier, "Actually the deal is: either use enough swap (about 2x RAM) or use none at all," and Matt now replied in this same post, "If swap space isn't required in all cases, great! We'll encourage the use of swap files as needed, rather than swap partitions. But, if instead you *require* swap = 2x RAM, then the 2GB swap size limitation must go." Christoph Rohland replied:

No it is not strictly required.

But still the 2GB limit is annoying and together with the arch-independent maximum number of swap partitions/files it is pretty dumb.

So I would propose to first make a small patch to make MAX_SWAPFILES arch-dependent and bigger. (x86 would allow a muc higher MAX_SWAPFILES)

For 2.5 we could perhaps think about a new swapfile layout which allows bigger partitions.

The discussion began to peter out around here, but not before Matti Aarnio said:

The i386 actually support up to 4*16 = 64 swap files (or partitions) with this SWP_TYPE() definition, while include/linux/swap.h does define MAX_SWAPFILES to be 8 ... If that were a pointer array to kmalloc()ed blocks, the limit could be much higher. Indeed I think this is the only *static* limit anywhere in the current swap code.

Similarly it supports 2^24 PAGES of swap at i386 per file/partition. ( 16 million pages of 4k each = 64 GB -- should be enough ;) ) ( That would require vmalloc() to allocate 32 MB block, though. That might not be possible at every occasion -> swapon may fail. )

The more I read the documentation (= source and its comments), the more I am inclined to think that the beast *will* work with swap-partitions (and files!) larger than 2G.

Stephen Tweedie did this 'SWAPSPACE2' work for 2.4 series, what he might tell ? Is it really just a matter of fixing the mkswap utility ? Was Stephen just conservative saying: "Don't go over 2G" (I haven't tested it)

Reviewing thru the architectural definitions of these SWP_***() macroes, the shifts used for SWP_OFFSET seem to vary in between 7-12 and for Alpha and MIPS64: 40. Indeed things are not very easy to understand with 64 bit architectures. It looks like those architectures use the low 32 bits of swp_entry_t for something, while most use at most couple of bits.

Oh, even those 64-bit system seem to give at least 24 bits for PAGE_SIZE 'offset'. The lowest bitcount for 'offset' seems to be at s390 which gives "only" 2^20 * 4k pages, or 4 GB per swap file/partition. (SWP_OFFSET() shifts with 12, which is same as PAGE_SHIFT for the machine. Why SPARC64 uses PAGE_SHIFT in its own unique way, that I don't know.)

Somehow I suspect that the makers of each architecture port have not quite understood what the swp_entry_t bits are used for, and have blindly presumed them to be related to PAGE_SIZE ...

He concluded that the existing format was fine, but Andries Brouwer replied:

No, the present definition is terrible.

Read the mkswap source. A forest of #ifdefs, and still sometimes user assistance is required because mkswap cannot always figure out what the "pagesize" is.

There are two main problems:

  • "new" swap is hardly larger than "old" swap
  • the unit in which new swap is measured is a mystery

So, the next swap space has (i) a signature "SWAPSPACE3", (ii) (not strictly necessary) a size given as a 64-bit number in bytes. Moreover, the swapon call must not refuse swapspaces that are larger than the kernel can handle.

The thread ended inconclusively.

 

6. PPP Over Ethernet
5 Mar 2001 (2 posts) Archive Link: "How-To for PPPoE in v2.4.x?"
Topics: Networking
People: Jeremy Jackson

Steve Snyder couldn't find much clear information on getting PPP over Ethernet working under 2.4, and Jeremy Jackson replied:

I have been using PPPoE in the 2.4.0 kernel for about 2 months now. It's very nice. I used

http://www.math.uwaterloo.ca/~mostrows/

just grab the tarball and compile. I bet it will work under 2.4.2 also.

End of thread (tm).

PPP over Ethernet was first covered (very briefly) in KT in Issue #33, Section #17  (24 Aug 1999: PPP Over Ethernet) . Some patches emerged for 2.3 in Issue #35, Section #28  (11 Sep 1999: Ramdisk Fix For 2.3.18) . It came up again in Issue #37, Section #15  (26 Sep 1999: PPP Over Ethernet) and was not mentioned again until Issue #65, Section #3  (15 Apr 2000: Status Of PPP Over Ethernet) .

 

7. Status Of Hot-Plugging PCI Adaptors
5 Mar 2001 (2 posts) Archive Link: "Kernel support for hot-plugging PCI adapters"
Topics: Hot-Plugging
People: Jeff Garzik

For a recent discussion on this issue, see Issue #107, Section #8  (6 Feb 2001: Hotplugging With Regular PCI Cards) . This week, Duane Grigsby asked about the status of hot-plugging PCI adaptors, and Jeff Garzik explained:

For devices, the support is already there. See Documentation/pci.txt. Look for 'probe', 'id_table', etc.

I don't think there is support in the current tree for a controller that supports physical hotplugging of PCI adapters, yet. Compaq has a driver outside the tree to do such a thing (needing only very minor kernel patches), see http://opensource.compaq.com

 

8. Massive Filesystem Corruption In 2.4.3
5 Mar 2001 - 8 Mar 2001 (20 posts) Archive Link: "Linux 2.4.3"
Topics: Disks: SCSI
People: Linus TorvaldsDavid WeinehallRichard B. Johnson

Richard B. Johnson reported massive filesystem corruption under 2.4.3; the kernel had even trashed filesystems that had not been mounted at the time. He urged folks owning BusLogic SCSI controllers to avoid 2.4.3; Linus Torvalds took a look, and replied:

Anybody who has any ideas or input, please holler. There are no actual BusLogic controller changes in the current 2.4.3-pre kernels at all, so there's something else going on.

There's a new aic7xxx driver there - did you enable support for that? I wonder if there could be some inter-action: the aic7xxx driver tries to probe every PCI SCSI controller because they are basically hard to ID any other way (no single vendor/id combination, or even a simple pattern). But it has some rather careful internal logic to filter out all non-aic7xxx controllers, so this really doesn't look likely.

If you didn't compile aic7xxx in, the only other SCSI change (apart from a lot of spelling fixes in comments etc) is some trivial error handling, like changing scsi_test_unit_ready to not have a result buffer (because it doesn't have a result except for the regular sense buffer). Which again certainly shouldn't be able to matter at all.

A few folks reported similar problems, and there was a bit of peripheral discussion. After a few days passed with nothing conclusive coming through the list, David Weinehall asked for an update. Richard indicated that the problem had not been solved, and the thread ended.

 

9. Still Not Ready For 2.5
6 Mar 2001 (9 posts) Archive Link: "Patch submissions"
Topics: Virtual Memory
People: Alan CoxRik van RielJeff GarzikKurt GarloffLars Marowsky-Bree

Alan Cox announced:

I'm getting a notable increase in people sending me patches that do major things and should be 2.5 stuff. Please if you want to rewrite the VM completely, redesign the scsi layer and the like wait until 2.5.

Right now I'm only collecting patches that are driver bugfix/updates, arch specific updates/fixes or bugfixes (not feature adds) for the core kernel code.

Anything else goes in the bitbucket

Lars Marowsky-Bree asked when 2.5 would fork, but no one gave an estimate. Rik van Riel suggested in a different vein:

VM folks can post their patches to linux-mm@kvack.org, where we can play with things until 2.5 is forked.

I agree with Alan that we should keep all experimental stuff out of 2.4, probably even out of linux-kernel ...

Kurt Garloff objected to this, saying he wanted experimental stuff to keep coming to linux-kernel. He also drew a distinction between experimental stuff in the basic subsystems, which he agreed was not 2.4 material, and experimental drivers or devices that were not supported before. In the latter case, he said, the new code could quite possibly be added to the kernel during 2.4; Rik agreed with this, but pointed out that if all discussions of experimental code were moved onto linux-kernel from the various smaller mailing lists, it would triple the list volume. Jeff Garzik replied, "Every patch doesn't need to go to lkml, but keeping linux-kernel folks updated on experimental issues is always IMHO a good idea. Otherwise, interested folks who don't have time to find out about and subscribe to 1000 other lists are kept informed."

 

10. Reinitializing Modules After APM Suspend
6 Mar 2001 - 7 Mar 2001 (6 posts) Archive Link: "Forcible removal of modules"
People: Thomas HoodJohn FremlinAlan CoxJeff Garzik

Thomas Hood pointed out, "Sometimes modules need to be reloaded in order to cause some sort of reinitialization (of the driver or of the hardware) to occur. Sometimes this has to be done every time a machine is suspended." John Fremlin replied, "Why not set up the device driver to handle" [power management] "events itself. See Documentation/pm.txt under Driver Interface. I have a race free version of pm_send_all if you want it." Jeff Garzik pointed to a similar feature in 2.4.3-pre3, and John replied, "Looks like Alan Cox got his version in kernel first." And Alan also replied to Jeff, "Mine is race free for the basics, his is a far far more elegant solution to the whole problem space. It might be 2.5 stuff but its definitely a good idea."

 

 

 

 

 

 

We Hope You Enjoy Kernel Traffic
 

Kernel Traffic is grateful to be developed on a computer donated by Professor Greg Benson and Professor Allan Cruse in the Department of Computer Science at the University of San Francisco. This is the same department that invented FlashMob Computing. Kernel Traffic is hosted by the generous folks at kernel.org. All pages on this site are copyright their original authors, and distributed under the terms of the GNU General Public License, version 2.0.