Kernel Traffic #183 For 8 Sep 2002

By Zack Brown

Table Of Contents

Mailing List Stats For This Week

We looked at 1469 posts in 7526K.

There were 420 different contributors. 219 posted more than once. 167 posted last week too.

The top posters of the week were:

1. IRQ Balancing For Various Systems

13 Aug 2002 - 29 Aug 2002 (42 posts) Subject: "[PATCH] NUMA-Q disable irqbalance"

Topics: SMP

People: Martin J. BlighLinus Torvalds

Martin J. Bligh posted a patch, and explained, "This patch is from Matt Dobson. It disables irq_balance for the NUMA-Q and makes it a config option for everyone else. This is needed for NUMA-Q to work, since the irq_balance code assumes a logical flat apic addressing mode that's not true in all cases. We created a config option since irq_balance makes performance significantly worse for some workloads. Says it's against 2.5.25, but applies to and works on 2.5.31" Linus Torvalds was unhappy with a negative configuration option that asked to turn something off rather than on; but he added, "since IRQ balancing is practically required on P4-SMP, I really don't think a CONFIG option works. It needs to be configured in on any kernel that expects to use P4's in an SMP configuration." Martin felt this wouldn't work because it would be difficult to get it to work properly on all P4 systems. In a later post he added, "Forcing it on for every machine just because P4s are borked sounds wrong" [...] "we really do need to disable it for many machines. Getting rid of the negative config option is easy though." But Linus disagreed, saying the need for IRQ balancing should be determined dynamically at run-time, not as an option. Martin replied, "But if it's good for P4, and bad for P3 (at least for some workloads), surely this leads to the conclusion that it should be a config option (probably defaulting to being on)? If you can see another way to solve the conundrum ...." and Linus replied:

But this is exactly the kinds of cases that config options do _not_ work well for.

There are tons of reasons to run the same kernel on a multitude of machines, even ignoring the issue of things like installers etc.

We had this CONFIG_xxxx disease when it came to SSE, we had it when it came to TSC, etc. And in every case it ended up being bad, simply because it's not the right interface for _users_.

So this is why I think the IRQ balance code has to be there, regardless, and then it gets turned on dynamically for when it is needed (or turned off when it hurts, whatever). But it should _not_ be a CONFIG option.

Martin posted a very short patch and said, "OK ... you're right ;-) This is bad, especially for distribution kernels. So I just need something to stop the NUMA-Q crashing. Can I have the appended? Please, please, please? ;-) It just adds an if switch to irq_balance which the compiler optimises away anyway. Not a whiff of a config option. Tested on 2.5.31."

2. Status Of IPv6

18 Aug 2002 - 29 Aug 2002 (37 posts) Subject: "2.4 and full ipv6 - will it happen?"

Topics: Networking

People: Tomasz TorczDavid S. MillerAlexey Kuznetsov

Tomasz Torcz remarked, "Some time ago Linux was first OS to have full RFC complaint IPv4 stack. Linux still has superior networking, but protocol of the future is IPv6. IPv6 stack in mainline is currently far from perfect. There is a hope, however. Full IPv6 stack is beeing mantained by USAGI project. It's clear, that USAGI's project will be integrated into mainline kernel. What worries me - it's planned for 2.7, what is _BAD_ and late. IMO, it can be included in any time. The sooner is better. Marcelo - would you include full IPv6 stack in 2.4.20 if you get patches? Please - it's important for Linux to be network OS choice in future. It's barely possible with current IPv6 implementation." David S. Miller replied:

based upon previous attempts to get them to merge their work into the mainline, we believe at this point that they actually enjoy being a totally seperate project and not merging completely is a feature for them.

USAGI may only accept that comment, and the only way they may disprove it is to merge their code to us as we have continually requested them to do so.

In my opinion, USAGI has been given more than adequate opportunities to merge their entire work into the mainline. Alexey Kuznetsov has asked them repeatedly over the years to merge with him, yet they always fail to do so completely. Occaisionally one or two trivially bug fixes they are able to merge, but otherwise their efforts always fall short.

They claim they wish to merge so badly, yet act in opposite manner. It is almost disgraceful and I am so tired of this continual public propaganda that tries to make it look as if Alexey and myself are to blame for this.

Tomasz replied that his impression was that Alexey Kuznetsov and USAGI had been cooperating, but David said shortly, "Alexey is asking USAGI folks for patches, they are not responding."

Elsewhere, someone said the IPv6 code had been working will on his/her system for years. David replied:

The keyword is "you", you are using is locally at your site.

There are zero backbone ipv6 routers, everyone is still tunneling or has a custom network layout for their usage.

A number of folks pointed out that there were some backbone IPv6 routers in use at the current time; and the discussion skewed off into the various possible futures, whether IPv6 would become popular, whether there were other alternatives that might delay things. The bottom line seemed to be that no one really knew, but that everyone was in favor of IPv6.

3. Adding 'localconfig' To Automate .config Choices

24 Aug 2002 - 2 Sep 2002 (17 posts) Subject: "[RFC] make localconfig"

Topics: Kernel Build System

People: Andrew RodlandZwane MwaikamboGiacomo Catenazzi

Someone suggested a 'localconfig' Makefile target, that would analyze the current running system and produce a .config file to match the current hardware as closely as possible. He/she added that the recommendation would be to run 'make menuconfig' afterwards just to confirm the accuracy of the .config file. Andrew Rodland replied, "It turns out that the autoconfigure script included in CML2 is actually an adaptation of kautoconfigure (Giacomo Catenazzi <>,, just tweaked to use CML2 and python... a slightly older version that uses sh (well, bash) is still available. The ruleset is something like 8 months old by now, but the features provided are really pretty nifty. I used it once and it worked very nicely. I don't know if it was the, erm, downfall of CML2 that killed this project, but I wouldn't mind seeing it come back."

Elsewhere, Zwane Mwaikambo remarked, "For this kind of thing, code talks. Otherwise no one will take heed." A couple posts later, Giacomo Catenazzi replied, "I have some bash code (to probe hardware), an hardware/driver database and a python script to partly generate the database direct from kernel sources. It should be in and in"

4. Comparing 2.4 VMs With VM Regress

25 Aug 2002 - 29 Aug 2002 (5 posts) Subject: "2.4.19 Vs 2.4.19-rmap14a with anonymous mmaped memory"

Topics: Virtual Memory

People: Mel GormanDaniel Phillips

Mel Gorman reported:

I ran a brief series of tests on a small crash box. the intention was to see what sort of figures and conclusions could be gathered with VM Regress in it's current public release. VM Regress is the beginnings of a tool that ultimatly aims to answer questions about the VM by testing and benchmarking individual parts of it. The conclusions drawn here are extremly ad-hoc so take them with a very large grain of salt.

4 tests were run on each machine each related to anonymous memory used in a mmaped region. Two reference patterns were used. smooth_sin and smooth_sin-random . Both sets show a sin curve when the number of times each page is referenced is graphed (See the green line in the graph Pages Present/Swapped). With smooth_sin, the pages are reffered to in order. With smooth_sin-random, the pages are referenced in a random order but the amount of times a page is referenced.

Both patterns are tested with 2,000,000 page references made to a mmaped region. The first memory mapped region is 25000 pages large, about the size of physical memory on the machine. The second was with 50000. Unfortunatly detailed statistical information is unavailable, but some conclusions can still be drawn. Statistical information is aimed to be available at least by 0.9

Test 1 - smooth-sin_25000

Behaviour is pretty much comparable. The average page access times look roughly the same so at the very least the performance is similiar. rmap14a did perform faster but hte test wasn't long enough to be conclusive. All in all, when enough physical memory is avilable, rmap14a and stock will perform roughly the same with a linear reference pattern and enough memory is available.

Test 2 - smooth-sin-random_25000

here, the average performanceremains roughly the same. It is interesting to note that rmap14a had periodic large access times to pages and it's unclea. Despite this, rmap14a still completed the test faster. So again, with enough memory available, the performance remains roughly the same even with a relatively random page reference pattern

Test 3 - smooth_sin_50000

This test is interesting. Remember that the references are linear in memory. At about the 1,000,000 page reference, physical memory is exhausted. Both tests completed in the same time so in "raw performance" they would appear the same but not so. The time access graph shows that for most of the test, rmap14a performed much better on average except for the occasional large spikes. At the end, it degrades very quickly but is still faster than the stock kernel about about 300000 microseconds to access a page which the unscaled graphs show

This would appear consistent with reports that the stock kernel degrades slowly where rmap seems to fall apart really quickly in some situations.

It is suspected that the large periodic spikes are where the proper page to select out is found but it's pure guesswork and VM Regress is not at the point where it can investigate more.

The second point of note is the present pages at the end of the test. stock makes no attempt to keep certain pages in memory. When physical memory is out, it swaps out enitre processes unconditionally. rmap14a tries to keep the proper pages in memory and the page reference vs presense graph shows that it did. stock has a large block of pages present, rmap14a had swapped out some pages from the beginning of the test.

In this case, stock just happened to swap out correctly because the pages remove were not going to be used again in this particular case

Test 4 - smooth_sin-random

With this test, the page references are in random order so determining which page to remove is much more difficult. rmap14a completed this test almost 10 minutes quicker than stock.

The average time for the stock kernel is consistently bad. I am guessing that this is because the kernel consistently ends up swapping out the entire process. rmap has periods of quick accesses with unfortunatly large spikes because it is trying to keep the right pages in memory and a lot of the time gets it right. This is better than stock kernel which never keeps the right pages in memory.


It is hard to draw solid conclusions because large gaps still exist in the data but some can be drawn. I am sure an experienced VM developer will be able to draw much more reliable conclusions :-)

First, when enough physical memory is available, rmap and stock perform more or less the same so appreciatable overhead is not introduced for normal anonymous memory use.

Second, when memory is tight, the type of memory reference behaviour will determine how good or bad the two will perform. With a strictly linear pattern, stock will perform better because it just dumps all the old pages en-mass. I seriously doubt this reference is common.

For other patterns with large anonymous page use, rmap is more likely to perform better because it tries to keep anonymous pages in memory. Even with a totally random pattern, it'll perform reasonably well.

Lastly, it is obvious from the tests that for deciding which page to swap, age is more important than frequency but that is already known. The page age graphs are on the way and will be available in VM Regress 0.7

Daniel Phillips asked, "Could you please provide pseudocode, to specify these reference patterns more precisely?" Mel replied:

Rather than providing pseudo code, here is a link to the actual function that generates the smooth_sin references

It is really crude and written to generate any type of data until I found the time to generate more realistic data which is a project in itself. Anyone who wants to generate better data only has to edit the file.

It takes there inputs

references - number of references to generate range - the size in pages of the region to reference output - the output filename

the function has three parts

part 1: Plot a sin wave so that the sum of all the integer values of each part of it would generate enough references to satisify at least half of the requessted number
part 2: Starting at the beginning of the range, reference each page in a linear pattern until all the required references are generated
part 3: Dump all references to disk

now that I think of it, it would have made more sense to begin with the linear reference pattern and then generate the sin curve but seeing as this pattern is nothing resembling real life, I didn't worry about it too much. It is probably something I should change as it would illustrate better what pages are kept in memory.


This is a perl script for randomizing an input file. It takes an input file generated by the smooth_sin function and outputs a randomized version of it. It is pretty simple

  1. For each input reference, output a random number between 0 and range followed by the input reference
  2. Sort the file numerically with sort. This will efficively randomize the input
  3. Reread the randomized input and strip away the generated random number

Daniel replied, "The perl script that writes tables isn't too informative without knowing how the tables are used. Pseudocode that says exactly what your final reference pattern is would be a lot more useful. Just leave out the part about generating the tables and express it as if you were computing the distribution at the same time as generating the references, unless it's really impossible to do that. I don't think it's impossible to do that in this case." Mel said:

I guessed that after I thought about it for a while and reworked the algorithm for 0.7. To make things easier again, I added a new graph to the reports which is in 0.7 called "Page Index Reference over Time"


It is the second graph. At the beginning, it is at the 0th page and it moves through the address space over time. A totally random one would make this graph look like noise. The graph should give a good idea how memory was referenced.

In 0.6 and with these tests, it would have been a similar curve except the last page would have been hit around 40000 references before the end of the test. After that, the pages were referenced in a linear pattern which was a mistake after reviewing it a bit.

If people are still interested, I'll run a full set of tests again on 2.4.19 and 2.4.19-rmap14a with 0.7 and post up the results complete with the page reference information so you don't have to guess this time. It takes about a full day to run a complete series. Any taker?

5. Extending The Kernel API To Handle 64-Bit Values

27 Aug 2002 - 29 Aug 2002 (9 posts) Subject: "atomic64_t proposal"

People: Dean NelsonDavid S. MillerAndi Kleen

Dean Nelson proposed the creation of a 'atomic64_t' variable, to be a 64-bit version of the existing 'atomic_t' variable; and to enable the various macros in the kernel to operate on either variable. Andi Kleen felt that it would be much cleaner to just provide a common kernel API across all architectures, by adding some new macros that dealt explicitly with 64-bit values. David S. Miller also didn't like to see the data types handled transparently. Dean replied, "Your point about a common kernel api (across all architectures) is valid and leads me to reconsider the use of common macros for the two atomic types. So I guess I would lean in the direction you suggested of separate macros (atomic64_add/sub/read etc.) for the atomic64_t type." But he added, "I have no plans on implementing this for anything but the IA-64 linux kernel. But its api should be discussed and approved (or disapproved) by this list. The implementations for the other platforms can come as other people feel so moved to do them."

6. Kernel 2.5.32 Announced; IDE Breakage; Keyboard Beep Breakage

27 Aug 2002 - 31 Aug 2002 (36 posts) Subject: "Linux v2.5.32"

Topics: Disks: IDE, Disks: SCSI, FS, Hyperthreading, Kernel Release Announcement, USB

People: Linus TorvaldsUdo A. SteinbergAlexander ViroAndre HedrickVojtech PavlikGerhard MackJos HulzinkAlan CoxRandy HronMikael Pettersson

Linus Torvalds announced 2.5.32 (see ChangeLog ( ), and explained:

Delayed by various issues (including a HT-only MTRR bug that Ingo finally chased down and that kept me chasing shadows for days). As a result, this is fairly big..

Most noticeable is the (already discussed) IDE revert, and the threading updates. The input layer switch-over may also end up being a bit painful for a while, since that not only adds a lot of config options that you have to get right to have a working keyboard and mouse (we'll fix that usability nightmare), but the drivers themselves are different and there are likely devices out there that depended on various quirks.

The AIO core code from Ben got merged, and Al worked on cleaning up the gendisk stuff from a number of drivers that were missed last time. And the usual USB updates..

Oh, and various architecture updates (sparc64, ppc64, ia-64).

Udo A. Steinberg took a look, and reported, "It looks like the kernel is trying to read partition tables on IDE cdrom drives in SCSI emulation mode - and failing." Andre Hedrick said he'd get to it as soon as he and Alan Cox finished some work they were doing for the 2.4 kernel. And Alexander Viro said that the 2.5 IDE merge was broken with regard to partitioning. He'd put up a patch-set to fix it, but was waiting for an acknowledgement from Alan. Alan replied that he was too busy with the Red Hat beta to work on 2.5 stuff for another few days, but that he'd look at it then. Alexander replied:

OK. Here's the contents of patchset, patches will go in followups (and not Cc'd to l-k)

  1. moves stuff from ide_register_subdriver() (associating drive with high-level driver) to ide-probe.c, so that remaining stuff can be safely called in parallel with IO on other drives [Andre]
  2. finishes introduction of ->reinit() - Jens had missed MOD_DEC_USE_COUNT on several exits from ide-cd one and forgot to remove the loop from ide-floppy ide-tape and ide-scsi ones ;-) (->reinit() is the body of loop in ->init() - stuff that should be done one drive; in 2.5.32 ide-disk one is OK, ide-cd is OK modulo minor bugs and in the rest it's a copy of ->init())
  3. puts drives on cyclic lists - per-driver ones for drives that had been claimed by high-level drivers and ata_unused for unclaimed drives. We put drives on ata_unused in the very end of ideprobe_init() and then move them to drivers' lists as they are claimed.
  4. checks for media type, ->driver_req, etc. are moved from the ide_scan_devices() to ->reinit(). ide_scan_devices() had lost first two arguments (it will completely disappear later).
  5. duplicate calls of ide_cdrom_init(), idedisk_init(), etc. are removed from ide_init_builtin_drivers() (they were called both from there (i.e. from ide_init()) and later as module_init() for high-level drivers).
  6. loops in ide_cdrom_init()/ide_cdrom_exit(), etc. are pulled into ide_register_module()/ide_unregister_module() resp.
  7. ->owner added to ide_driver_t. MOD_INC_USE_COUNT/MOD_DEC_USE_COUNT taken out of ->reinit(). ide_reinit_drive() turned into "call ->reinit() for all high-level drivers that are registered until somebody claims the drive" (instead of open-coded variant in 2.5.32; cleaner and works correctly for modular drivers).
  8. That's the central part of the series:

    • ide_reinit_drive() turned into ata_attach(). Said beast takes a drive, tries to feed it to high-level drivers and drops it on the ata_unused if nobody claims the sucker. IOW, that's what ide_register_module() used to do, but for a single drive.
    • ideprobe_init() calls ata_attach() instead of putting on ata_unused.
    • ide_register_module() eliminated. Some of the callers do not need it anymore, some (ide_replace_subdriver()) actually want ata_attach(drive).
    • ide_scan_devices() is gone. There were two remaining callers - in ide_register_module() and ide_unregister_module(). The former had been turned into "put driver on the list, empty ata_unused into temporary list and call ata_attach() on all drives there". The latter is "remove driver from the list, call ->cleanup() and ata_attach() for all drives" (->cleanup() gives the drive up, ata_attach() gives the remaining drivers a shot for that drive; if nobody claims it - it's put on ata_unused).

  9. ->init() for high-level drivers is never called (other than as module_init() when they are initialized). Method removed, instances cleaned up.
  10. instead of messing with ide_module_t, we put ide_driver_t themselves on a (cyclic) list - said list being the only use of ide_module_t for high-level drivers. ide_register_module()/ide_unregister_module() takes ide_driver_t now (renamed to ide_register_driver()). /proc/ide/drivers switched to use of that cyclic list and uses seq_file instead of old home-grown code.
  11. 2.5 bits:

    • add_gendisk()/del_gendisk() moved into ->reinit() and ->cleanup() of ide-{disk,cd,floppy} - i.e. moments when high-levle driver claims/gives up a drive.
    • register_disk() also shifted into ->reinit().
    • consequently, revalidate_drives() is gone (it did messy postponed rereading of partition tables; not needed anymore). Ditto for ide_geninit().
    • regular 2.5 changes in ->revalidate() and BLKRRPART handling - same as all other block devices.

The reason why we couldn't do just #11 and be done with that is simple - high-level drivers were rude and considered drives fair game as soon as they had been probed. That is *wrong* - we might be still doing "unsafe" work with the interface (or related interfaces) and any regular IO at that point is a Bad Thing(tm). As the result we had to use very odd logics in partition handling, registering, etc.

New variant lets the probing code to decide when it's safe to put the drives in circulation - no high-level driver will see a drive until ata_attach() is called. Which puts the knowledge of ordering between configuring and normal IO into the probing code where it belongs. High-level drivers don't have to think about it anymore - as soon as drive is given to them it's safe to do IO on it.

Ordering issues between configuring different interfaces, etc. still remain where they were - that's a separate story and it belongs to the low-level driver cleanups. Moreover, place where we are calling ata_attach() is very conservative - we do _all_ configuring of interfaces and then call ata_attach() on everything we'd found. Eventually, low-level drivers should be able to do "configure our group of interfaces, then call ata_attach() on their drives", but that, again, is a separate story - one I'd happily leave to the folks who do cleanup of low-level drivers. All ordering issues with high-level drivers are reduced to one rule: don't call ata_attach() on a drive before it's safe to get IO on it.

The patchset doesn't fix all problems with the driver - code that went into 2.5 had been derived from 2.4.19 and several megabytes of fixes went into 2.4.20-ac since then. However, these fixes are mostly in low-level drivers, so they shouldn't cause problems with porting and I'll happily leave that fun to Jens when he comes back.

Randy Hron reported no more breakage using this patch-set, and Andre added, "Yep, that has been verified and there are more extentions needed to bring up support for all archs. I will send them to Al and Alan first and post them here too shortly I hope."

Elsewhere, Mikael Pettersson reported that 2.5.32 wouldn't give a keyboard beep with CONFIG_KEYBOARD_ATKBD=y. Vojtech Pavlik replied, "2.5.32 still has quite complex input core config options - sorry, my fault, and I'll fix it soon. You have to enable CONFIG_INPUT_MISC and CONFIG_INPUT_PCSPKR." Mikael confirmed that this worked, but Gerhard Mack asked, "That begs the question: How do I input using the PC speaker ?" And Jos Hulzink said, "Easy :) A speaker is also a microphone...2.5.32 will go into the history books as the kernel that implemented voice recognition for all AT class computers..."

7. Some IDE Developer Interaction

27 Aug 2002 - 30 Aug 2002 (19 posts) Subject: "ide-2.4.20-pre4-ac2.patch"

Topics: Disks: IDE, PCI

People: Andre HedrickAlan CoxJoel BeckerRoman ZippelAlexander ViroJeff Garzik

Andre Hedrick announced:

This is out and has been forwarded to AC for review.

Joel Becker
Nick Bellinger
Alan Cox
Peter Denison
Jeff Garzik
Benjamin Herrensch
Roman Zippel
Alexander Viro

Others helped with ideas and concepts.

This should work on all archs in IDE, there may be other issues which causes compile failures but should not be related to IDE.

This shall have something in it soon, as I am reviewing the pieces to pick up and play catch up in 2.5 learning curve as to beat the Halloween DEAD-DATE.

Alan Cox replied:

Rejected. I found several errors, a couple of strange reverts and some files being moved to clearly wrong places. It also mixes up multiple changes.

Andre to make this work I need

For example I've got files you moved and changed, looking at that in diff is a right pita. I've got a big diff with errors in it (eg gayle in ppc) I can't easily be sure I can cleanly drop parts of.

Lets start with the file moving. Send me a diff for the Config/Makefile and a lit of the files to move and where. Gayle I think should be m68k not ppc (actually Im pretty sure), CMD640 is PCI so why file it in legacy. "legacy" I took to mean pre PCI rather than "I think its junk" 8)

Andre said:

Deal, undoing the moves.

Parsing out all the summitted stuff first for send. Then the breakdown of the rest.

Gemme a bit to catch on to your request, Viro is trying to teach the ways of mad patcher and not the patch bomber.

8. ATI Framebuffer Problems In 2.5.31

27 Aug 2002 - 30 Aug 2002 (5 posts) Subject: "still ati fb errors with 2.5.31, thought patch applied"

Topics: Framebuffer

People: Paul MackerrasJames Simmons

Clemens Schwaighofer reported compile-time errors when compiling 2.5.31, when the compile reached the aty128fb driver. James Simmons replied that this driver had not been ported to the new API, and would need to be ported before it would work again. Paul Mackerras said he'd sent in a patch that should have been included already, but then replied to himself, "But of course those error messages were *with* my patch. I just cross-compiled a kernel for i386 and got the same errors. Here is a patch to go on top of my other patch which should fix things, though I haven't tried running it on an x86 box yet." Clemens replied that Paul's new patch allowed the kernel to compile and boot, but had other problems such as strange font colors. The discussion ended there though.

9. Anycast Support For IPv6

28 Aug 2002 - 30 Aug 2002 (4 posts) Subject: "[PATCH] anycast support for IPv6, linux-2.5.31"

Topics: Capabilities, Networking, SMP

People: David StevensPekka Savola

David Stevens of IBM announced:

Below is a patch relative to the mainline 2.5.31 code for an implementation of anycast support for IPv6. This code was submitted and accepted in the USAGI tree last Fall. Below is a high-level description of the implementation:

  1. The API:

    Although the RFC's liken anycasting to ordinary unicasting, I think it's more appropriate to tie it closely to particular applications, so I've chosen an API similar to multicasting. So, rather than having a permanent anycast address associated with the machine, particular applications that use anycasting can join or leave "anycast groups," and the machine will recognize the anycast addresses as its own when one or more applications have joined the group.

    So, for example, someone using anycasting for DNS high availability can add a join to the anycast group in the server and as long as the DNS server is running, the machine will answer to that anycast address. But the machine will not respond to anycasts when the service that's using it isn't available, so a broken server application that has exited won't deny that service if there are other working members of the anycast group on other hosts.

    I don't know if that's controversial or not-- the RFC's are written more from the external context, but seem to imply a model along the lines of using "ifconfig" to add anycast addresses. I think that model doesn't fit the best uses of anycasting, but I'd like to hear your thoughts on it.

    The application interface for joining and leaving anycast groups is 2 new setsockopt() calls: IPV6_JOIN_ANYCAST and IPV6_LEAVE_ANYCAST. The arguments are the same as the corresponding multicast operations. The kernel keeps a reference count of members; when that goes to zero, the anycast address is not recognized as a local address. While nonzero, the host listens on the solicited node for that address, sends advertisements in response to solicitations (with override=0) and delivers packets sent to the anycast address to upper layers.

    There's also an in-kernel interface described below, which is used by IPv6 mobility, for example.

  2. Security Model:

    RFC 2373 states:

    "An anycast address must not be assigned to an IPv6 host, that is, it may be assigned to an IPv6 router only."

    This patch violates this in 1 special case, and I'll explain why.

    a) The restriction on host use of anycast is to avoid carrying individual host routes for anycast addresses spread out among multiple physical networks. I think the initial application sets are exactly things that won't be on off-the-shelf routers (high availabily servers (DNS, http, etc) and mobile IPv6) and the particular cases don't have the problem of requiring host routes or participation in the routing system. They use anycast addresses with a prefix common to a unicast address on the system, so ordinary routing gets you to the right network, anyway, and there's no external penalty on the routing system for using those types of anycast addresses. For that reason, I allow anycast addresses that match an existing unicast prefix even on hosts.

    Finally (for security considerations), I had to choose whether anycast should require root privilege or not. Multicasting does not, but it'd obviously be a spoofing issue if an application joined an "anycast" that was actually the unicast address of another machine on that network. On the other hand, it's handy for non-root users to be able to make use of anycasting where that use doesn't pose any security risks.

    The code below allows non-root users to join anycast groups that have matching prefixes (don't require special-route propagation) with existing unicast addresses, and require root (really "CAP_NET_ADMIN") and a router for off-link anycasts (disallowed completely on hosts). I think that should be extended to require CAP_NET_ADMIN for any anycasts (even on-link ones) that are not well-known anycasts (to avoid the spoofing of on-link unicast addresses).

  3. The Implementation:

    The code maintains a list of anycast addresses that are in use for a given interface. The code is a modifed version of the existing multicast code, with some things cleaned up, and operations on the anycast list instead of the multicast list. Because the anycast address list is separate from the ordinary address list, anycast addresses in general won't be selected as a source address, or available for inappropriate uses. Protocols (like ICMP ECHO) that respond by swapping the source and destination address have a separate check for anycasts and set the source to zero in that case-- allows IPv6 to choose the outbound source address.

    The code has the setsockopt() interface for joining and leaving anycast groups, but does not yet have changes needed for UDP and TCP to work with them. TCP is problematic, because the PCB lookup mechanism relies on the destination address which must change-- it should be disallowed initially. UDP may work with an INADDR_ANY-bound listener, but I haven't made changes to support it yet. It will probably use the anycast address as the source, so it'll need a modification similar to what I've done with ICMP, but should be straightforward. Ultimately, I think we want to allow binding to anycast addresses as well.

    Our immediate application is mobile IPv6, so this patch doesn't include any of the upper-layer changes that may be needed for general application support.

    For in-kernel use, applications (like mobile IPv6) can call join and drop functions for anycast addresses, and a function that checks if a device is in an anycast group (if dev == 0, checks if any device is in that group).

    They are (similar to multicast functions):

    int ipv6_dev_ac_inc(struct net_device *dev, struct in6_addr *addr)
    - add "addr" as an anycast address on "dev"
    int ipv6_dev_ac_dec(struct net_device *dev, struct in6_addr *addr)
    - remove "addr" as an anycast address on "dev"

    these use reference counts, so only the first call to "inc" for a particular address will add a new address, and only when all references are removed via "dec" will the address be removed as a local address.

    The function:

    int ipv6_chk_acast_addr(struct net_device *dev, struct in6_addr *addr)

    returns true if "addr" is an anycast address on "dev", false otherwise. If "dev" is 0, it searches all devices for "addr".

    Those 3 functions provide the in-kernel interface.

  4. Things of Note:

    I think we want the ip6_addr_type() to check *only* the well-known anycasts, since it seems inappropriate to me that that function should be searching linked lists of anycast addresses. It would also need a "dev" argument it doesn't have now, since anycast addresses, like unicast and multicast addresses, in this implementation are associated with particular devices. Use of those address on other devices should not return type ANYCAST, but should for the device that has the anycast address. So, in most cases, ipv6_chk_acast_addr() and not ipv6_addr_type() will be more appropriate.

    ipv6_addr_type(), with modifications included for reserved anycast addresses, will still be useful for cases where the address is known to *always* be an anycast (for example, disallowing reserved anycasts through "ifconfig" being set as an ordinary address), but for the lower-level code, it'll usually need a per-device check. So, I recommend we keep both, and use ipv6_chk_acast_addr() to answer if it is a configured anycast address, use ipv6_addr_type() to answer if the address is reserved for anycast (whether configured or not).

    That's what this code does.

  5. Testing:

    I wrote programs to join and leave anycast groups and I checked through the /proc/net interface (file "anycast6") the presence of the groups. I've used network sniffers to watch the neighbor discovery sequence and verify the override bit is cleared, and I've tested with multiple hosts in the anycast group talking to an unmodifed host that pings the anycast address. I also verified that the existing code handles "override=0" correctly (it does).

    In addition, our mobile IPv6 team has used the code to test the use of anycasting for Dynamic Home Agent address discovery, with several different topologies and configurations.

    We've done tests with uniprocessor and SMP kernels on multiprocessor machines.

  6. TODO:

    I think the next steps are to flesh out the UDP part so ordinary user-level applications can make full use of anycasting.

Pekka Savola suggested first writing an Internet Draft describing the proposed API, and David replied:

I don't disagree with that, for informational purposes, but it doesn't conflict with the RFC's, which of course don't cover API's, and don't specify any interface for anycasting.

However, my primary goal is to get anycasting support with an in-kernel interface in 2.5 before the freeze. :-) I used the setsockopt() API for testing, and left it in the patch for others to do the same. Though I think it's the right approach, for the reasons I mentioned, I'd rather see that portion pulled from the patch if it's controversial, than have the in-kernel interface and anycasting proper delayed over that.

The one use of anycast I'm aware of right now is for IPv6 mobility, which needs the in-kernel interface. The user-level interface is important for future applications, and a reference-counted setsockopt() interface doesn't mean we can't also have an ip/ifconfig interface for permanent anycast addresses, too (the required anycast addresses in this patch are permanent, for example). So I don't see it as committing to one choice, but having in-kernel anycast support (soon) I think is the more important first step.

10. Prefix List Support For IPv6

28 Aug 2002 - 29 Aug 2002 (4 posts) Subject: "[PATCH] IPv6 Prefix List support for 2.5.31"

Topics: Networking

People: Krishna Kumar

Krishna Kumar announced:

This patch implements Prefix List support in IPv6. The reasons for the patch are :

This code has both been tested within IPv6 and with Mobile IPv6. It has also been integrated into the USAGI kernel.

11. Keyboard-Activated Screen Blanking

28 Aug 2002 - 29 Aug 2002 (2 posts) Subject: "Blank now key"

People: Pavel Machek

Pavel Machek posted a patch and explained, "Being able to "blank now" is very important for handheld devices (where screen can eat more than 50% of total power), and it is just nice everywhere else (also saves a little power). Please apply." Someone else liked the patch and said it would also be good for systems that were usually up with the monitor powered down all the time.

12. Status Of i845mp Chipset Support In 2.4

28 Aug 2002 - 29 Aug 2002 (2 posts) Subject: "i845mp support: 82845 (Brookdale) 82801BAM/CAM"

Topics: PCI

People: Alan Cox

Andreas Kerl asked when support for i845mp would be adopted into the kernel, and Alan Cox replied, "Eventually. I have to get Marcelo all the pci updates and a couple of pci bug fixes before I can feed him the pci_enable_bars ide fix. He has some of the bits now."

13. Status Of e1000 In 2.4

29 Aug 2002 (5 posts) Subject: "e1000 in 2.4?"

People: David S. MillerRoy Sigurd Karlsbakk

Roy Sigurd Karlsbakk asked if anyone was working on porting the Intel e1000 drive from 2.5 back to 2.4, and David S. Miller replied that the code was already in the 2.4.20-pre tree; and Adriano Galano gave a link to the Sourceforge project page ( .

14. VM Regress 0.7 Released

29 Aug 2002 (1 post) Subject: "VM Regress 0.7"

Topics: SMP, Virtual Memory

People: Mel Gorman

Mel Gorman announced:

Project page:

This is the fourth release of VM Regress. It is a regression, benchmarking and test tool for the Linux VM in it's early stages. This will be the last release for a while as funding in my University is a bit tight these days. Mine is due to run out in a few months and I have to re-prioritise what I'm doing unfortunately. I hope to get back working on this once I have secured external funding to work full time on VM management in general and Linux in particular.

This release has at least one major bug fix. It would have been triggered by an SMP machine running an alloc or page faulting validation test. I haven't heard any reports and I haven't triggered it myself but I'm pretty sure it would deadlock. The project will now compile cleanly against late 2.5.32 which is the first 2.5 kernel since 2.5.28 it compiled against. I haven't managed to test with a 2.5.x kernel but there is no reason it shouldn't work. I'd be interested in hearing any success/failure stories with 2.5.x

Perl scripts are now provided to run each test and benchmark, produce a report, graph vmstat output etc so running tests is a lot easier. It will load/unload modules as necessary to run the test. This reduces a lot of the drudge work involved with setting up a test. There is also scripts available for replotting graphs to a given scale so comparing vm's is a bit easier. man pages and online help is available for each of them.

The mmap module will now run read or write benchmarks on either anonymous or file backed maps. It produces graphs showing age of pages, reference counts, page present/swapped, what pages were referenced over time, vmstat output and some timing information. It still doesn't do statistical analysis but that was in the works for 0.9 . the data files are all preserved as .data files so any stats tool that can import space separated files can be used. Links to sample test output is on the webpage.

If I get back working on this, 0.8 will have the simulated webserver originally outlined by Rik Van Riel. Most of what is needed is already there with the mmap module uses.

Documentation is reasonably up to data and provided with the package. If people have suggestions or reports, send them on and I'll add them to the ToDo list. I'll continue to work on this periodically.

Full changelog for 0.7

Version 0.7

15. Porting Sound Drivers To New Locking System

29 Aug 2002 - 30 Aug 2002 (43 posts) Subject: "1/41 sound/oss/maestro3.c - convert cli to spinlocks"

Topics: SMP, Sound: OSS

People: Alan CoxTomas Szepe

Someone submitted a large number of patches against 2.5.32, converting almost all remaining OSS sound drivers to use spin_lock_irqsave() and other functions instead of the outdated cli() under SMP. Tomas Szepe pointed out some formatting inconsistancies with his/her patch, but Alan Cox said, "When you've ported that much code to a new locking mechanism then you can moan. If he wants to take on the old OSS code and making it work in the 2.5 universe as far as I (as the ex OSS code maintainer) am concened he can format it how he likes."

16. Linux 2.4.20-pre5-ac1 Released

30 Aug 2002 - 1 Sep 2002 (4 posts) Subject: "Linux 2.4.20-pre5-ac1"

Topics: Disks: IDE, I2C, PCI, Sound: i810

People: Alan CoxBjorn HelgaasManfred SpraulAndreas SchwabSteven ColeLinus Torvalds

Alan Cox announced:

[+ indicates stuff that went to Marcelo, o stuff that has not, * indicates stuff that is merged in mainstream now, X stuff that proved bad and was dropped out, - indicates stuff not relevant to the main tree]

Resync and collect up the main stuff. The IDE stuff Andre sent me isn't in - its going back for another debug phase before its considered. Caution is still advised with the IDE and ide-scsi is known to cause crashes.

Linux 2.4.20-pre5-ac1

        Resync with 2.4.20pre5
o       Fix IDE compile                                 (me)
o       Update defconfig                                (Niels Jensen)
o       Various warning fixes                           (Niels Jensen)  
+       Remove epat debug printk that escaped           (Moritz Barsnick) 
o       Fix PPC build for pre4-ac                       (Ben Herrenschmidt)
o       Fix hang in Matrox DRM                          (Jonny Strom)
o       Backport 2.5 LDT allocation improvements        (Manfred Spraul)
+       Lp tidy and printk levels               (Lucas Correia Villa Real)
o       Update yenta region size patch                  (Manfred Spraul)
+       Fix an i2c bus leak on the acorn pcf8583        (Silvio Cesare)
+       Fix e100 phy build                              (Linus Torvalds)
o       Further i810 audio updates                      (Juergen Sawinski)
+       Tidy ver_linux output with gcc 3.x              (Steven Cole)
o       ppp_generic fixes for building on boxes         (Bjorn Helgaas)
        with out* as macros
o       pdc4030 updates                                 (Peter Denison)   
+       Forte sound driver updates                      (Martin Petersen)
o       Fix AMD7441 PCI ID error
o       Tighten asm-ia64 io macros                      (Andreas Schwab)

17. Configuration Requirements For USB Mice

30 Aug 2002 (2 posts) Subject: "USB mouse in 2.4.19-pre4 vs later"

Topics: USB

People: Tim HabermannMario Mikocevic

Mario Mikocevic reported that his USB mouse stopped working in X Windows after he upgraded beyond 2.4.19-pre4 or -pre5. Tim Habermann replied, "I had the same issue migrating from 2.4.18 to plain 2.4.19. Activating the new option "HID input layer support" under USB support fixed this."

18. Linux 2.2.22-rc2 Released

30 Aug 2002 - 31 Aug 2002 (3 posts) Subject: "Linux 2.2.22rc2"


People: Alan CoxTomas SzepeJulian AnastasovKeith Owens

Alan Cox announced:

This is going straight to rc1 because it contains a lot of security fixes for local security problems found by Silvio's audit Solar Designer and a couple of other folks. The other stuff is minor and is the entire 2.2 pending queue anyway.

Special thanks go to Openwall who did pretty much all of the security backporting work. This is mostly their kernel update not mine.

o       Fix isofs over loopback problems                (Balazs Takacs)
o       Backport 2.4 shutdown/reset SIGIO from 2.4      (Julian Anastasov)
o       Fix error reporting in OOM cases                (Julian Anastasov)
o       List a 2.2 maintainer in MAINTAINERS            (Keith Owens)
o       Set atime on AF_UNIX sockets                    (Solar Designer)
o       Restore SPARC MD boot configuration             (Tomas Szepe)
o       Multiple further sign/overflow fixes            (Solar Designer)
o       Fix ov511 'vfree in interrupt'                  (Mark McClelland)

Krzysiek Taraszka liked the patch but reported that PowerPC was still broken. He offered to send some quick but ugly patches while he worked on the better solution.

19. Various ARM Patches

30 Aug 2002 (1 post) Subject: "Patches..."

Topics: PCI

People: Russell King

Russell King announced:

I'm about to send out 8 patches:

These are patches that are in the ARM tree, and I consider them to be useful to others, bug fixes or compilation fixes that have been collected. All the above have been found not to be in 2.5.32.

Where applicable, they're copied to maintainers or Rusty's trivial patch address. However, if people want to pick off any of these patches and integrate them into their trees, and eventually push them towards Linus, that's fine by me.

Any that aren't picked up will be re-mailed at some point in the future (seems like its about once every 3 weeks to a month at the moment.)

In subsequent posts he explained that the keyboard patch was to handle the extra '#' key on the ARM numeric keypad. And the rdunzip patch ensured that the kernel reported failures when unzipping ramdisks.

20. PCI Ops Cleanups For 2.5.32

30 Aug 2002 (17 posts) Subject: "[BK PATCH] PCI ops cleanups for 2.5.32-bk"

Topics: FS: driverfs, Hot-Plugging, PCI, Version Control

People: Greg KHHanna LinderDavid Brownell

Greg KH announced:

Here are the pci_ops cleanups that were discussed on lkml last week. It removes a lot of code from the arch specific implementation of pci_*_config* functions, and removes lots of code from the pci_hotplug core (yes, the pci_hotplug code is still broken, I'm working on that next...)

These patches includes fixups for almost all of the different architecture specific code. I have a number of patches that I will send to some of the arch maintainers directly, that are not included in this bk tree.

I would like to thank Matt Dobson and Hanna Linder for doing lots of this work.

This series also includes a driverfs pci pool patch from David Brownell (as long as we are making pci changes...)

Pull from:

21. New VFS inode Cache Lookup Function

31 Aug 2002 - 3 Sep 2002 (9 posts) Subject: "[BK-PATCH-2.5] Introduce new VFS inode cache lookup function"

Topics: FS: NTFS, FS: ReiserFS, Version Control, Virtual Memory

People: Anton AltaparmakovDavid WoodhouseNikita Danilov

Anton Altaparmakov announced:

The below ChangeSet against Linus' current BK tree adds a new function to the VFS, fs/inode.c::ilookup().

This is needed in NTFS when writing out inode metadata pages via the VM dirty page code paths as we need to know whether there is an active inode in icache but we don't want to do an iget() because if the inode is not active then there is no need to write it... - I can just skip onto the next one instead... - If there is an active inode then I need to get the struct inode in order to perform appropriate locking for the write out to happen.

If there is something you don't like about this patch please let me know what it is, preferably with what you want instead so I can modify it...

Without such icache lookup functionality it is impossible to write inodes via the VM page dirty code paths AFAICS. - The only alternative I can see is to duplicate the whole icache private to NTFS so that I can perform the lookup internally but I think that is silly considering the VFS already keeps the inode cache...

David Woodhouse added that JFFS2 also needed this, and Nikita Danilov said the same of ReiserFS.

22. Watchdogging Out-Of-Filehandle Conditions

31 Aug 2002 (4 posts) Subject: "[patch 2.4.19] reboot on out-of-file handles"

Topics: FS: sysfs

People: Dr. David Alan GilbertAlan Cox

Dr. David Alan Gilbert announced:

Please find below a patch that adds the ability to panic if you run out of file handles (by setting /proc/sys/fs/file-max-panic to none-0). When combined with reboot-on-panic this means that a server might be able to get out of a service-gone-mad situation. It calls show_state before panic'ing to log what was going on - adding something similar which listed open filehandles would probably be advantageous.

The patch is against a clean 2.4.19 and was tested on sparc64.

Alan Cox replied that this was already possible in user-space as part of watchdog daemon processing. David agreed this was best, though a user-space solution seemed to preclude logging the state of the system at the time of failure. But Alan said, "If your daemon keeps a few open handles to reuse and the log file it can maybe do that when it spots the problem occurs and isnt bumping the watchdog."

23. Syscalltrack 0.74 Released

31 Aug 2002 (1 post) Subject: "ANN: syscalltrack 0.74, "Hyperactive Iguana" released"

Topics: FS: sysfs, SMP, User-Mode Linux

People: Muli Ben-YehudaMuli

Muli Ben-Yehuda announced:

syscalltrack-0.74, the 10th _alpha_ release of the Linux kernel system call tracker, is now available. syscalltrack supports version 2.4.x of the Linux kernel on the i386 and UML architectures. 2.5.x kernel versions should work as well, but did not receive the same extensive testing. Kernel 2.2.x is NOT supported in this release, due to technical difficulties. This release contains support for almost all system calls - more than 100 have been added since the last release.

Happy hacking and tracking!

New in version 0.74, "Hyperactive Iguana"

24. Linux 2.5.33 Released

31 Aug 2002 (4 posts) Subject: "Linux v2.5.33"

Topics: Disks: IDE, FS: JFS, FS: NTFS, Networking, USB

People: Linus Torvalds

Linus Torvalds announce Linux 2.5.33 (see the ChangeLog ( ), saying:

There's a fair amount of stuff in here again, but I'd personally like to have people who actually use that d*ng floppy driver please test it out. I finally broke down and tried to fix it, since it's been broken in 2.5.x for longer than most people care to remember.

I don't even have floppies to test with, I just verified that I could read two old backup disks, and one seemed fine, and the other read 90% of the thing, which was a lot more than I expected since they are both at least five years old. I've never had good luck with those unreliable 3.5" things, I'd rather have as little to do with them as possible.

Anyway, apart from floppies, this has the IDE organizational cleanups by Al, another merge with Andrew, and some new networking stuff (TCP segmentation offload onto network cards, and initial cut of SCTP support).

And NTFS, JFS, and of course USB updates. Oh, and some of the keyboard input stuff should fix some random breakage in the input switchover.

25. NTFS 2.1.0a For Linux 2.4.19 And 2.4.20-pre-BK

1 Sep 2002 (1 post) Subject: "ANN: NTFS 2.1.0a for Linux 2.4.19 and 2.4.20-pre-BK"

Topics: FS: NTFS, FS: ext3, Version Control

People: Anton AltaparmakovRichard Russon

Anton Altaparmakov announced:

The new NTFS driver 2.1.0(a) is now available for kernel 2.4.19. NTFS 2.1.0(a) implements the first steps towards file overwrite support.

Full and incremental patches are available from the Linux NTFS download page:
and from the Sourceforge project page (also older patches here):

If you use bitkeeper, you can get NTFS 2.1.0a by pulling from our bitkeeper repository (note this is based on Marcelo's current bitkeeper tree so it is at 2.4.20-pre5 at the moment and will move forward as Marcelo's repository moves forward):

The current code is relatively well tested both for mmap(2) and write(2) both using existing applications to randomly write to files and using custom programs to do specialized writes to test boundary conditions.

Still the code has only been run on two machines, so people trying it, please have backups! I am confident it won't eat your data, but I am not willing to guarantee it! I have put in an appropriately very scary config help message to scare off the casual user for the moment...

Features of NTFS 2.1.0(a)

It is now possible to write over existing files both with mmap(2) and write(2).

It is now possible to setup a loopback on an NTFS file and then you have full read/write access to the loopback device. You can create a Linux fs on the loop device for example and mount it.

This has been a much requested feature because it allows installation of Linux on an NTFS partition using the loopback trick, i.e. from windows one creates a large file on NTFS, then one boots Linux (from installation CD, rescue floppies or whatever) and as root does:

mount -t ntfs -o rw /dev/hda1 /mnt/ntfs
losetup /dev/loop0 /mnt/ntfs/some_dir/preprepared_large_file
mke2fs -j /dev/loop0
mount -t ext3 /dev/loop0 /mnt/new_root
mkdir old_root
<install Linux into /mnt/new_root>
umount /mnt/new_root
losetup -d /dev/loop0
umount /mnt/ntfs

From now on, you can boot Linux and using a minimal ramdisk loaded via floppy for example, one just needs to have something simillar to the following done:

mount -t ntfs -o rw /dev/hda1 /mnt/ntfs
mount -t ext3 -o loop /ntfs/some_dir/preprepared_large_file /mnt/new_root
cd /mnt/new_root
pivot_root . old_root
exec chroot . sh <dev/console >dev/console 2>&1
umount /old-root

[Note you probably cannot umount /old-root but it doesn't matter. It doesn't disturb anyone... You could always hide it inside root/old_root or something so users don't see it.]

I haven't actually tried to install Linux in the above way but Richard Russon (flatcap) tested the loopback/mke2fs/read-write stuff and it worked fine for him.

Limitations of NTFS 2.1.0(a) overwrite abilities

Anyone who tries this new code please let me know how you get on...

26. uCLinux Update

1 Sep 2002 (1 post) Subject: "[PATCH]: linux-2.5.33uc0 (MMU-less support)"

People: Greg Ungerer

Greg Ungerer announced:

The latest MMU-less patch set is up at:

Mostly changes to support 2.5.33. Much cleaning up of the old cli/sti (now gone).

27. New Kernel Debugging Code For x86

1 Sep 2002 (1 post) Subject: "[PATCH] kprobes for 2.5.33"

People: Rusty Russell

Rusty Russell announced, "This patch allows trapping at almost any kernel address, useful for various kernel-hacking tasks, and building on for more infrastructure. This patch is x86 only, but other archs can add support as required."

28. IPv4 Route Cache Lookup Enhancements

2 Sep 2002 (1 post) Subject: "[BKPATCH] lockfree rtcache lookup using RCU"

Topics: Networking, Real-Time, Version Control

People: Dipankar Sarma

Dipankar Sarma announced:

The lockfree ipv4 route cache lookup code is now available for pulling from -

It speeds up route cache lookup by 30-50% approximately as measured with synthetic benchmark suggested by Dave.

The entire discussion is in -

Changes from Linus' tree are -

ChangeSet@1.500, 2002-09-02 12:34:24+05:30,
Implemented lockfree lookup of routes from ipv4 route cache using RCU.

ChangeSet@1.499, 2002-09-02 11:54:55+05:30,
Add a lightweight read barrier (read_barrier_depends()) for data dependent reads. Also, add more explicit versions of barrier names like read_barrier()/write_barrier().

ChangeSet@1.498, 2002-08-27 00:27:13+05:30,
Add RCU infrastructure based on rcu_poll in -aa kernels with support for preemption and per-CPU queues.

29. Adding XFS To 2.5

3 Sep 2002 (1 post) Subject: "[PATCH] XFS filesystem support"

Topics: FS: XFS, Version Control

People: Christoph Hellwig

Christoph Hellwig announced:

The following patch adds the Config/Makefile/Documentation infrastructure for XFS to the current BK tree (or 2.5.33). The actual XFS code is to big to be posted as patch to lkml so I've uploaded a tarball at


that can be just unpacked in the toplevel kernel source directory. The patch and the unpacked tarball give a fully functional XFS filesystem driver, but no additional features that would require VFS changes.

This XFS code is very different from the latest official release (XFS 1.1). Namely it uses the generic I/O path, has lots of dead code removed that was needed in IRIX but superceeded by VFS check in Linux (e.g. the famous rename checks).

30. x86 BIOS Enhanced Disk Device (EDD) Polling

3 Sep 2002 - 4 Sep 2002 (7 posts) Subject: "[RFC][PATCH] x86 BIOS Enhanced Disk Device (EDD) polling"

Topics: Disk Arrays: RAID, Disks: IDE, Disks: SCSI, FS: driverfs, I2O, Ioctls, PCI, Serial ATA, USB, Version Control

People: Matt DomschAndre Hedrick

Matt Domsch said:

x86 systems suffer from a disconnect between what BIOS believes is the boot disk, and what Linux thinks BIOS thinks is the boot disk. BIOS Enhanced Disk Device Services (EDD) 3.0 provides the ability for disk adapter BIOSs to tell the OS what it believes is the boot disk. While this isn't widely implemented in BIOSs yet (thus shouldn't be completely trusted), it's time that Linux received support to be ready as BIOSs with this feature do become available.

EDD works by providing the bus (PCI, PCI-X, ISA, InfiniBand, PCI Express, or HyperTransport) location (e.g. PCI 02:01.0) and interface (ATAPI, ATA, SCSI, USB, 1394, FibreChannel, I2O, RAID, SATA) location (e.g. SCSI ID 5 LUN 0) information for each BIOS int13 device. The patch below creates CONFIG_EDD, that when defined, makes the calls to retrieve and store this information. It then exports it (yes, another /proc, glad to change it to driverfs or whatever else when that makes sense) in /proc/edd/{bios-device-number}, as such:

# ls /proc/edd/
80 81 82 83 84 85

# cat /proc/edd/80
host_bus_type: PCI 02:01.0 channel: 0
interface_type: SCSI id: 0 lun: 0

Warning: Spec violation. Key should be 0xBEDD, is 0xDDBE
Warning: Spec violation. Padding should be 0x20, is 0x00

# cat /proc/edd/81
host_bus_type: PCI 04:00.0 channel: 0
interface_type: SCSI id: 0 lun: 0

Warning: Spec violation. Device Path checksum invalid (0x4b should be 0x00).

In the above case, BIOS int13 device 80 (the boot disk) believes it is on PCI 02:01.0, SCSI bus 0, ID 0 LUN 0 (in this case it's an Adaptec 39160 add-in card). Likewise, device 81 believes it's at PCI 04:00.0, channel 0, ID 0, LUN 0 (a Dell PERC3/QC card). In both cases the BIOS vendors have some cleanup work to do, so I warn when they don't adhere to the spec.

It's possible to query device drivers from user-space (either via a SCSI ioctl, or IDE /proc/ide/*/config), to compare results to determine which disk is the boot disk.

At most 6 BIOS devices are reported, as that fills the space that's left in the empty_zero_page. In general you only care about device 80h, though for software RAID1 knowing what 81h is might be useful also.

The major changes implemented in this patch:
arch/i386/boot/setup.S - int13 real mode calls store results in empty_zero_page
arch/i386/kernel/setup.c - copy results from empty_zero_page to local storage
arch/i386/kernel/edd.c - export results via /proc/edd/

If you use this, please send reports of success/failure, and the adapter types and BIOS versions, to I'm keeping a tally at If built as CONFIG_EDD=m, please 'modprobe edd debug=1' and send those results - it's more verbose.

Patch below applies to BK-current 2.5.x. Also available in BitKeeper at

Andre Hedrick replied, "WOOHOO! This looks like some serious fun to make it go! Matt, how about a location for a normal patch for those of us who do not believe in BK." Matt said sure, and gave a link ( .

31. gcml2 Version 0.7.1 Released

3 Sep 2002 (1 post) Subject: "ANNOUNCE: gcml2 version 0.7.1"

Topics: Kernel Build System

People: Greg BanksRandy Dunlap

Greg Banks announced:

gcml2 is (among other things) a Linux kconfig language syntax checker. Version 0.7.1 is available at:


This is a bugfix release of gcml2. Thanks to Randy Dunlap in particular for reporting problems. Future announcements of minor releases will be on the kbuild-devel list only.

Here is a brief change log.

32. Kernel 2.5 Status For September 4, 2002

3 Sep 2002 (1 post) Subject: "[STATUS 2.5] September 4, 2002"

People: Guillaume Boissiere

Guillaume Boissiere announced:

The latest and greatest status update is available at:

Of note this week is the inclusion of SCTP (Stream Control Transmission Protocol) in 2.5.33.

33. Problem Report Status

3 Sep 2002 - 4 Sep 2002 (5 posts) Subject: "2.5 Problem Report Status"

Topics: Disks: IDE, FS: JFS, FS: driverfs, FS: ext2, Feature Freeze, Forward Port, Version Control

People: Thomas MolinaRobert LoveAxel SiebenwirthAndre HedrickLinus Torvalds

Thomas Molina reported:

The latest version of the followng problem report status page can be found at:


               2.5 Kernel Problem Reports as of 04 Sep
   Problem Title                  Status                Discussion
   schedule() with irqs disabled! open                  03 Sep 2002
   schedule in interrupt          No further discussion 2.5.31
   JFS oops                       No further discussion 2.5.31
   unmount oops                   No further discussion 2.5.31
   usb problem                    No further discussion 2.5.31
   pte.chain BUG                  No further discussion 2.5.31
   cciss broken                   proposed fix          2.5.31
   qlogicisp oops                 open                  01 Sep 2002
   qlogic error                   No further discussion 2.5.31
   kmap_atomic oops               No further discussion 2.5.31
   swap problem                   No further discussion 2.5.31
   oops in gpm.c                  No further discussion 2.5.31
   page allocation failure        No further discussion 2.5.31
   driverfs oops                  No further discussion 2.5.31
   2.5.32 reboot oops             open                  30 Aug 2002
   ext2 umount oops               open                  30 Aug 2002
   DEBUG_SLAB oops                open                  30 Aug 2002
   2.5.32-mm1 problems            open                  30 Aug 2002
   soft suspend problem           open                  30 Aug 2002

Andre Hedrick was pleased about the IDE reports, saying he'd been worried about migrating the code too quickly from 2.4 to 2.5; Robert Love said that the "schedule() with irqs disabled!" problem had a fix in Linus Torvalds' BitKeeper tree that would appear in 2.5.34, so it could be marked closed. He added, "Note this was never a problem - it was an informative debugging message that unfortunately happens much more often than anticipated." About the JFS oops, Axel Siebenwirth said:

Okay. My JFS oops which was about the same style as the one that occurred in 2.5 has gone away. Unfortunately I have not figured out yet how to get my completely normal PS/2 keyboard to work with current kernel 2.5.33 because of input driver options weirdness. So I could not test it yet.

But then I guess the 2.5 JFS oops should have gone as well. I cannot clearly state whether it was something that got fixed in JFS tree or if it was my upgrade of gcc, i.e. something got fixed in gcc and I do not want to test that.

Thomas replied, "I expect things to break, get fixed, and break again in a development series. I've seen it in the 2.1, 2.3 as well as 2.5. I expect it to smooth out once feature freeze happens."

34. Leonard Zubkoff Killed

4 Sep 2002 (1 post) Subject: "Leonard Zubkoff killed"

People: Larry M. Augustin

Larry M. Augustin reported that Leonard Zubkoff, a longtime contributor to free software, had died. Larry said:

Many of you may know Linux kernel developer Leonard Zubkoff (BusLogic and DAC960 maintainer, among other contributions). Leonard was killed recently in a helicopter crash. See,1413,163%257E6883%257E834332,00.html. Leonard was one of the smartest people that I know, and I consider myself lucky enough to have been privileged to work with him. He will be missed.







Sharon And Joy

Kernel Traffic is grateful to be developed on a computer donated by Professor Greg Benson and Professor Allan Cruse in the Department of Computer Science at the University of San Francisco. This is the same department that invented FlashMob Computing. Kernel Traffic is hosted by the generous folks at All pages on this site are copyright their original authors, and distributed under the terms of the GNU General Public License version 2.0.