Kernel Traffic
Home | News | RSS Feeds | Mailing Lists | Authors Info | Mirrors | Stalled Traffic

Kernel Traffic #284 For 17�Nov�2004

By Zack Brown

Table Of Contents

Mailing List Stats For This Week

We looked at 2980 posts in 16339K.

There were 583 different contributors. 332 posted more than once. 246 posted last week too.

The top posters of the week were:

1. Linux 2.6.9 Released

18�Oct�2004�-�31�Oct�2004 (145 posts) Archive Link: "Linux v2.6.9..."

Topics: Disks: IDE, FS: JFS, FS: XFS, Ioctls, Kernel Release Announcement, SMP, User-Mode Linux

People: Linus Torvalds,�Jeff V. Merkey,�Ingo Molnar,�David Weinehall,�Alexander Viro

Linus Torvalds announced Linux 2.6.9, saying:

despite some naming confusion (expanation: I'm a retard), I did end up doing the 2.6.9 release today. And it wasn't the same as the "-final" test release (see explanation above).

Excuses aside, not a lot of changes since -rc4 (which was the last announced test-kernel), mainly some UML updates that don't affect anybody else. And a number of one-liners or compiler fixes. Full list appended.

Jeff V. Merkey said:

Although we do not work with them and are in fact on the the other side of Unixware from a competing viewpoint, SCO has contacted us and identifed with precise detail and factual documentation the code and intellectual property in Linux they claim was taken from Unix. We have reviewed their claims and they appear to create enough uncertianty to warrant removal of the infringing portions.

We have identified and removed the infringing portions of Linux for our products that SCO claims was stolen from Unix. They are:

JFS, XFS, All SMP support in Linux, and RCU.

They make claims of other portions of Linux which were taken, however, these other claims do not appear to be supported with factual evidence.

Many, many kernel developers either laughed at him, insulted him, or told him they wouldn't relicense their code or remove the parts he claimed SCO was entitled to. The discussion meandered all over the place, Jeff trading blow-for-blow on many fronts, even carrying on part of the debate with Alexander Viro in the Cherokee tongue.

The vast majority reiterated that SCO had no legitimate claim and that Jeff was nuts. David Weinehall even quoted Judge Schoefield's written opinion:

In fact, however, Merkey is not just prone to exaggeration, he also is and can be deceptive, not only to his adversaries, but also to his own partners, his business associates and to the court. He deliberately describes his own, separate reality.

At some point in the course of the thread, Linus said:

can you please stop Cc'ing me on this thread?

No, nobody I know (certainly not me) is willing to re-license Linux under anything else than the GPL. Quite frankly, I suspect you'll have an easier time just rewriting the whole thing.

And no, the only offer from SCO I'm interested in is a public apology from Darl McSwine. Their made-up stories about copyright ownership weren't really that amusing a year ago, and now they're boring and stale.

So please just remove me from the cc, ok?

Close by, Ingo Molnar also said:

Jeff, you seem to have proven once more that you live in a fantasy world that has its own private rules of physics, ethics and rule of law. While this appears to be a dangerous phenomenon, it is fortunately a relatively rare one.

Linus has been intentionally, deliberately and maliciously lied to, smeared and mislead for more than 1.5 years. Linus has not mislead anyone, let alone lied to anyone. The so-called 'contamination' accusations that you repeated are just that: unfounded accusations. A simple question: do you know the concept of "truth"? Another simple question: do you even care about it? In the world i live SCO owes Linus more than just a simple apology. I personally find it admirable that the only thing Linus expects of SCO is a simple apology.

It's of note that the thread covered last issue (see Issue�#283, Section�#7� (25�Oct�2004:�A Little Bit Of SCO Status) ) actually took place after this thread, so that summary may be seen as the later result of the thread covered in this issue. Apparently Linus and Jeff spoke off-line and came to some sort of understanding.

The reason the threads were covered in the wrong order is that the one from last week was only a single post, from October 25, while the thread covered this week had an unrelated debugging thread attached to it as a reply, that continued well beyond October 25. Since I like to wait for threads to end before summarizing them, the earlier thread ended later than the later thread, and so was covered later in KT.

In theory this sort of thing can happen whenever a long thread and a short thread are close to the cut-off point for each week's Kernel Traffic. But in practice, it only really matters when the two threads are on the same or similar topics. If a thread about ioctl reorganization is summarized incorrectly before a thread on an IDE driver, chances are there will be no time-sensitive information between them. In cases like the current summary, however, with multiple threads on the same issue being covered in the wrong order, it can be confusing.

2. Status Of 'arch' Revision Control For The Kernel

19�Oct�2004�-�1�Nov�2004 (161 posts) Archive Link: "BK kernel workflow"

Topics: BSD, Version Control

People: Linux Torvalds,�Andrea Arcangeli,�Linus Torvalds,�Jeff Garzik,�Miles Bader,�Roman Zippel

In the course of discussion, Andrea Arcangeli argued that the proprietary BitKeeper license had a negative impact on the ability of Linux to develop. He felt it likely that if all developers had access to distributed revision control, the situation would be much better than it was. Linus Torvalds challenged:

nobody knows how the universe would look if the speed of light wasn't constant.

Your point is pointless. No such distributed revision control system exists. And without BK, the people who have worked on them wouldn't largely even understand what's wrong with CVS.

In fact, I find that people largely _still_ don't understand what's wrong with CVS, and are still trying to just make another CVS thing.

So give Larry the credit he deserves, even if you dislike the license.

Andrea pointed out that arch (tla) "exists and it's exactly as distributed as BK." Linus replied:

And I looked at it before starting BK. Trust me, it was nowhere _near_ usable, which was my point. Nothing you have described has existed for three years. Except for BK.

I doubt arch is there today either, but hey, if it displaces CVS, I certainyl won't complain. How are the gcc people doing with it?

Andrea replied:

gcc people are stuck with CVS AFIK. Apparently CVS is good enough for them.

arch isn't ready for prime time with the kernel. It would be ready if we were ok to limit it to say 5000 changesets and to obsolete the older changesets once in a while. the backend needs a rewrite to handle that.

Thanks to various improvements we did (I only did one that allows caching with hardlinked trees, Chris and others did more), probably arch would be already way faster than BK in a daily checkout checkin and cloning (nobody on the open source side can verify since we cannot use BK, AFIK Miles tried to buy a copy of BK but Larry refused to sell it, but I seriously doubt BK has such an advanced hardlinking cache mechanism like arch), but the very first setup on a new machine would be very inefficient (if compared to CVS) and the local copy of the repository would take more space (again if compared to CVS).

The user interface isn't nice either, it'd be nicer at least to avoid overlaps between commands.

I believe this all can be fixed, it just needs a critical mass of users and some big initial pain.

Jeff Garzik also commented that arch "doesn't scale or merge as well as BK though. I've told Larry that, if both BK and <open source tool> were completely equal in terms of function, I'd use the open source tool. Neither arch (scalability) nor subversion (scalability + stability) are there yet."

Miles Bader disputed the comment that arch didn't scale or merge as well as BitKeeper. He said:

Scalability I'm not sure about; BK's "you must inform BK before you change a file" model gives it a potential for being very quick at "tree-compare" operations -- but makes it more annoying for the user.

Merging also seems a bit hard to judge. From what I understand of BK, it has a much more limited merging model than arch does; to do a reasonable comparison, you'd have to see how well arch did if you limited yourself to that restricted model, and then give arch some more points for not forcing you to do so.

BK no doubt wins on the rough-edges-sanded-down front though; there are a _few_ advantages to commercial software...

Throughout the discussion, Linus showed no sign of considering anything other than full support of BitKeeper. At one point he said angrily:

Andrea, shut up.

It's not _your_ decision to make, or your decision to complain about. It's the developers decision. It was mine, Jeff's, David's, Andrew's... Not yours.

It's your decision is to not use BK. Fine. But then complaining when people decide to use the best tool available is fricking impolite. Not just to Larry, but to the people who made the choice.

You whine about BK taking rights away, but the fact is, BK is an _option_ for people to use. _YOU_ are the one trying to limit what people are supposed to do.

In short, BK isn't the problem. You are.

A couple of posts later he said, "Andrea doesn't actually do anything constructive when it comes to SCM. He just complains every time somebody says something positive about a product that (a) he didn't do anything for and (b) nobody forces him to use, and (c) there are no real alternatives for today (much less the three years ago he was whining about)." In the same post, he added:

No SCM is _ever_ going to be a quality manager.

And I also claim that people who think that "processes" are quality management (see iso9000, or Dilbert) are seriously mistaken too.

The thing that keeps up the general quality is _people_. Good people, who take pride in the quality. They end up being maintainers, not because they chose the job, but because people ended up chosing them, for better of for worse.

And the way to help those people is to make the day-to-day job easy, so that they can spend as much time as possible on the thing that matters: upholding good taste (and in the process keeping quality up). And that's where an SCM comes in - not as a primary source of quality, but as a way to keep track of the details, so that people can concentrate on what is important.

And the SCM doesn't have to be anything really fancy. It can be a few scripts to keep track of patches (that tends to grow and become slightly more sophisticated over time). I'm not saying that BK is "it". There's a number of BK users there, but clearly there are other ways to maintain patches too - and people use them.

But complaining when a maintainer uses a tool that suits him is _stupid_. It's arrogant to think that you can tell me how to do my work, but it's really stupid when you can't give any reasonable alternatives that would help me do it as efficiently.

And that's what Andrea is doing. Sure, BK is commercial, but dammit, so is that 2GHz dual-G5 too and that Shuttle box in my corner. They happen to be the tools I use for what I do. If Andrea told me that I should use a slower machine because that's what most people use, I'd consider him a total idiot. Similarly, when he complains that people use software tools that clearly _do_ make them more productive, I consider his complaint stupid.

There are other tools I use to make myself more productive. Many of them are open source. Some I wrote myself. But I still use "uemacs" and "pine" as part of my tool-chest, for example - and last I saw, they weren't open source either (but I hear that the uemacs author stopped caring, so that one might have been re-licensed).

Should I (or anybody else) ask Andrea's permission before we start using non-opensource tools? No. If Andrea were complaining about my "pine" usage, he'd be laughed off the planet. It may be ass-backwards and old and text-only, but the fact is, it's really none of his damn business, even though he can see the effects in every email I write in the headers.

Similarly, Andrea can see some of the effects of me using BK when he looks at the tar-balls and patches - syntactic markers that show that they have been generated by a person who uses BK. It's really _no_ different from the fact that I use pine to communicate. And no, neither BK nor pine are under an open source license. Deal with it.

Can Andrea point me to open-source tools and ask me politely whether I've considered them as alternatives? Hell yes. I encourage him to do so when something appears.

Miles corrected Linus' assertion that Andrea did nothing constructive. Miles said, "This is not true. Andrea has given some very useful input on the gnu-arch mailing list. He's definitely doing more than just complaining about BK." Roman Zippel also remarked to Linus, "nobody cares what you are using privately, but your decisions as kernel maintainer have an effect on other people, may this be the patches you include in the next release or the tools you distribute them with. In the end it's your decision what tools you use, if you think the advantages outweigh the license which goes contrary to the open devolopment process, that's fine, but so have other people the right to disagree with that decision. Maybe you could make some suggestion on how to articulate this more politically correct? Linus, what disturbs me here is that I don't see that you don't even try to acknowledge that the bk license might be part of problem, you don't mention the bk license with a single word. Nobody hates bk, that's a myth I'd expect from Larry but not from you. bk is a rather innocent and certainly useful tool, the annoying part are the business practices of its owner, who tries to push a licence into an environment, where it has to provoke rejection." Linus replied:

You don't like it, you don't use it. It's literally that simple.

This is the same thing as with the GPL. I absolutely _detest_ people who whine about the GPL - and there are more GPL haters out there than BK haters. It's _their_ problem.

EXACT SAME THING. Nobody has the right to whine about another persons choice of license. You have a choice: use it or don't. Complaining about the license to the author isn't part of it.

Larry can tell you that we've discussed the BK license in private, and he definitely knows that I'd really like for it to be an open source license. But I also suspect that Larry will tell you that I haven't been whining about it - I've been trying to come up with ways it could work out for him, considering that he's got employees to take care of, and I haven't been able to come up with anything that would convince him. Fair enough.

Because it really is his choice. Not mine. Not yours. Not Andrea's.

And dammit, that choice is as close to "sacred" as anything can get in software development as far as I'm concerned.

To paraphrase Voltaire - "I may disagree with your choice of license, but I shall defend to the death your right to choose it". That goes for Larry, and for the BSD people and for all the people who write software for a living using some really nasty licenses.

And the same thing goes for users. Anybody who tells me I can't use a program because it's not open source, go suck on rms. I'm not interested. 99% of that I run tends to be open source, but that's _my_ choice, dammit.

The flaming went on and on...

Personally, I have two remarks to make on this issue:

3. Better SMP Process Migration

20�Oct�2004�-�29�Oct�2004 (10 posts) Archive Link: "[PATCH, 2.6.9] improved load_balance() tolerance for pinned tasks"

Topics: SMP

People: John Hawkes,�Nick Piggin,�Ingo Molnar

John Hawkes said, "A large number of processes that are pinned to a single CPU results in every other CPU's load_balance() seeing this overloaded CPU as "busiest", yet move_tasks() never finds a task to pull-migrate. This condition occurs during module unload, but can also occur as a denial-of-service using sys_sched_setaffinity(). Several hundred CPUs performing this fruitless load_balance() will livelock on the busiest CPU's runqueue lock. A smaller number of CPUs will livelock if the pinned task count gets high. This simple patch remedies the more common first problem: after a move_tasks() failure to migrate anything, the balance_interval increments. Using a simple increment, vs. the more dramatic doubling of the balance_interval, is conservative and yet also effective." Ingo Molnar signed off on the patch, and Nick Piggin worked with John on a revised version of the patch. It was clear that one version or the other would be accepted.

4. Linux 2.4.28-rc1; Straggling Patches Considered For Inclusion

22�Oct�2004�-�28�Oct�2004 (23 posts) Archive Link: "Linux 2.4.28-rc1"

Topics: Big Memory Support, Disk Arrays: RAID, Disks: IDE, FS: devfs, Kernel Build System, Networking, PCI, Serial ATA, USB

People: Marcelo Tosatti,��zkan Sezer,�Thomas Gleixner,�Michael Frank,�Roger Luethi,�Geert Uytterhoeven,�Eric Sandeen,�Andre Hedrick,�Andrey Borzenkov,�Robert White,�Ivan Kokshaysky,�Joshua Kwan,�David Vrabel,�Eric Uhrhane,�Hilko Bengen,�Corey Minyard,�Dave Jones,�Jakub Bogusz,�Willy Tarreau

Marcelo Tosatti announced Linux 2.4.28-rc1, saying:

Here goes the first release candidate of v2.4.28.

It contains a small number of changes from -pre4, a couple of libata bugfixes, a PIIX IDE driver DMA bugfix, USB fixes, and some tmpfs corrections.

�zkan Sezer said:

There are many lost/forgotten patches posted here on lkml. Since 2.4.28 is near and 2.4 is going into "deep maintainance" mode soon, I gathered a short list of some of them. There, sure, are many more of them, but here it goes.

I think they deserve a re-review and re-consideration for inclusion.

The "list":

Marcelo took some of these on faith, but asked the specific authors for confirmation in a number of cases. Some folks confirmed, and Marcelo said the next -pre release would have a bunch of these updates.

5. Some Discussion Of The 2.6 Development Model

22�Oct�2004�-�29�Oct�2004 (121 posts) Archive Link: "My thoughts on the "new development model"(A bit late tho)"

People: Espen Fjellv�r Olsen,�Lee Revell,�Hua Zhong,�Diego Calleja,�Paul Fulghum,�Alan Cox,�Adrian Bunk,�Willy Tarreau,�William Lee Irwin III,�Rik van Riel,�Andrew Morton

Espen Fjellv�r Olsen disagreed with the entire direction of the Linux development model in 2.6; instead of adding new features, he said, "I think that 2.6 should be frozen from now on, just security related stuff should be merged." He added that a 2.7 branch should be forked off for new features. Clemens Schwaighofer agreed completely, but William Lee Irwin III felt that folks should just write code and do real work on the kernel, instead of arguing over numbering systems. Lee Revell added:

Part of the reasoning behind the new development model is that if you want a stable kernel, there are many vendors who will give you one. The new dev model is partially driven by vendors and developers desire to get their features into mainline quicker. There is an inherent stability cost associated with this, but the price is only paid by users who want stability AND the latest kernel. The big players all seem to agree that the new development model better suits users and their own needs. The distros are in a better position to determine what constitutes a stable kernel anyway, they have millions of users to test on. Let the vendors AND the kernel hackers do what they are each best at.

We need to continue the rapid pace of development because although Linux rules in the small to mid server arena there are other areas where MS and Apple are clearly ahead. If you want to make an omelette you have to break some eggs...

Elsewhere, Hua Zhong remarked, "The fact is, these days nobody wants to be a stable-release maintainer anymore. It's boring." But Diego Calleja came back with, "I doubt it. People like Alan Cox or Marcello have done it in the past, and I bet many others could do it." Paul Fulghum felt that folks like Alan and Marcelo "probably suffer emotional scars from the process. Taming the patch stream must be like drinking from a fire hose while herding angry, computer literate cats. Wearing, but not boring." Alan Cox confirmed, "For 2.2 certainly and I suspect for 2.4 it's also like that. The 2.6.x.[1-n] is more like distribution maintenance its about careful analysis and minimal changes." Close by, when Hua reiterated that no top developer would take on the task of maintaining a stable 2.6 tree, Alan replied, "I'll do it if Linus wants" , but nothing came of this.

Elsewhere, Adrian Bunk remarked, "2.6 is corrently more a development kernel than a stable kernel." And added, "Andrew+Linus should open a short-living 2.7 tree soon and Andrew (or someone else) should maintain a 2.6 tree with less changes (like Marcelo did and does with 2.4)."

Elsewhere, Willy Tarreau said, "Linux already got its reputation of a stable system from its production kernels, 2.0, 2.2 and 2.4 which are largely used in sensible environments. 2.6 is stable enough for most desktop usage and for end-users distros to ship it by default. This will encourage many more people to test it, send reports back and finally stabilize it so that one day it can finally be used in production environments. At first I was a bit angry that it had been declared "stable" a bit too early, but now, judging by the amount of people who use it only because their distros ship with it, I realise that indeed, it should have been declared "stable" earlier so that all the bug fixes you see now would be fixed by now." But William Lee Irwin III said, "The freezes from kernels past led to gross redundancy. Distros all froze at different points in time with numerous patches atop the then-mainline release. The mainline freeze was meaningless because the distros were all completely divorced from it, resulting in numerous simultaneously frozen trees with no outlet for forward progress." Elsewhere, Rik van Riel said he liked the way the 2.6 tree was currently being handled.

In general, developers closest to the development process were happiest with the status quo, while developers closer to the user end of things felt there should be changes. Linus did not weigh in, but several times Alan did volunteer or hint that he would be willing to maintain a stable 2.6 tree. No one mentioned that Andrew Morton is still technically the official maintainer, even though Linus puts out all the releases.

6. Linux 2.6.10-rc1 Released: The 'Woozy Numbat'; Some Numbering Considerations

22�Oct�2004�-�29�Oct�2004 (103 posts) Archive Link: "The naming wars continue..."

Topics: Disks: SCSI, Kernel Release Announcement, Serial ATA

People: Linus Torvalds,�Matt Mackall,�Con Kolivas,�Nick Piggin,�Bill Davidsen

Linus Torvalds announced Linux 2.6.10-rc1, saying:

I thought long and hard about the name of this release (In other words, I had a beer and watched TV. Mmm... Donuts), since one of the main complaints about 2.6.9 was the apparently release naming scheme.

Should it be "-rc1"? Or "-pre1" to show it's not really considered release quality yet? Or should I make like a rocket scientist, and count _down_ instead of up? Should I make names based on which day of the week the release happened? Questions, questions..

And the fact is, I can't see the point. I'll just call it all "-rcX", because I (very obviously) have no clue where the cut-over-point from "pre" to "rc" is, or (even more painfully obviously) where it will become the final next release.

So to not overtax my poor brain, I'll just call them all -rc releases, and hope that developers see them as a sign that there's been stuff merged, and we should start calming down and seeing to the merged patches being stable soon enough..

So without any further ado, here's 2.6.10-rc1 in testing. A fair number of patches that were waiting for 2.6.9 to be out are in here, ranging all over the map: merges from -mm, network (and net driver) updates, SATA stuff, bluetooth, SCSI, device models, janitorial, you name it.

Oh, and the _real_ name did actually change. It's not Zonked Quokka any more, that's so yesterday. Today we're Woozy Numbat! Get your order in!

Several developers felt there was value in putting out a 'pre' series before a 'rc' series, because the 'rc' series was by definition a 'release candidate'. Matt Mackall said, "the cut-over should be when you're tempted to rename it If you have no intention (or hope) of renaming 2.6.x-rc1 to 2.6.x, it is not a "release candidate"" He added, "What's the point? It serves as a signal that a) we're not accepting more big changes b) we think it's ready for primetime and needs serious QA c) when gets released, the _exact code_ has gone through a test cycle and we can have some confidence that there won't be any nasty 0-day bugs when we go to install on a production machine." Con Kolivas agreed with this, but added, "I have this feeling Linus is laughing at us when he debates these arguments." William Lee Irwin II felt the whole discussion was pointless, and that Linus had enough to do without worrying about these issues. To this, Linus replied:

Hey guys, calm down, I meant "naming wars" in a silly kind of way, not the nasty kind.

The fact is, Linux naming has always sucked. Well, at least the versioning I've used. Others tend to be more organized. Me, I'm the "artistic" type, so I sometimes try to do something new, and invariably stupid.

The best suggestion so far has been to _just_ use another number, which makes sense considering my dislike for both -rc and -pre.

However, for some reason four numbers just looks visually too obnoxious to me, so as I don't care that much, I'll just use "-rc", and we can all agree that it stands for "Ridiculous Count" rather than "Release Candidate".

More importantly, maybe we could all realize that it isn't actually that big of an issue ;)

Nick Piggin said:

Linus I agree it isn't a huge issue. The main thing for me is that I could just give a _real_ release candidate more testing - run it through some regression tests, make sure it functions OK on all my computers, etc. I expect this would be helpful for people with large sets of regression tests, and maybe those maintaining 'other' architectures too.

I understand there's always "one more" patch to go in, but now that we're doing this stable-development system, I think a week or two weeks or even three weeks to stabalize the release with only really-real-bugfixes can't be such a bad thing.

2.6.x-rc (rc for Ridiculous Count) can then be our development releases, and 2.6.x-rc (rc for Release Candidate) are then closer to stable releases (in terms of getting patches in).

Optionally, you could change Ridiculous Count to PRErelease to avoid confusion :)

Other than that I don't have much to complain about... so keep up the good work!

Bill Davidsen replied, "I do agree that the pre and rc names gave a strong hint that (-pre) new features would be considered or (-rc) it's worth doing more serious testing. If Linux doesn't like this any more, perhaps some other way to indicate the same thing would be desirable. I admit that the kernel has gotten so good that I only try -rc (by whatever name) kernel, I'm not waiting for the next big thing. I think that's really good, actually." And Linus replied:

Well, I actually do try to _explain_ in the kernel mailing list annoucements what is going on.

One of the reasons I don't like "-rcX" vs "-preX" is that they are so meaningless. In contrast, when I actually do the write-up on a patch, I tend to explain what I expect to have changed, and if I feel we're getting ready for a release, I'll say something like



trying to make ready for the real 2.6.9 in a week or so, so please give this a beating, and if you have pending patches, please hold on to them for a bit longer, until after the 2.6.9 release. It would be good to have a 2.6.9 that doesn't need a dot-release immediately ;)


which is a hell of a lot more descriptive, in my opinion.

Which is just another reason why the name itself is not that meaningful. It can never carry the kind of information that people seem to _expect_ it to carry.

7. Different Perspectives On The Status Of Real-Time

23�Oct�2004�-�28�Oct�2004 (21 posts) Archive Link: "[RFC][PATCH] Restricted hard realtime"

Topics: Microkernels: Adeos, Real-Time: RTAI, Real-Time: RTLinux, SMP

People: Paul E. McKenney,�Ingo Molnar,�Dimitri Sivanich,�Thomas Gleixner,�Jon Masters,�Karim Yaghmour,�Bill Huey,�Andrew Morton

The quest for a real-time kernel has been going on for years, with much work and many contributions by tons of people. And there are always new people coming up with ideas for how to do it better. This week, Paul E. McKenney proposed a mechanism to create real-time SMP systems by off-loading system-calls and other time-consuming operations to other CPUs. He offered up a partial patch to illustrate his ideas, acknowledging that there were many shortcomings: it hadn't been merged with existing real-time work by folks like Ingo Molnar; it hard-coded various things like which CPU was the designated real-time CPU; it only handled system calls, and not exceptions or traps; it was completely untested. But it was real code, and he concluded, "the idea is to provide an evolutionary path towards hard realtime in Linux. Capabilities within Linux can be given hard-realtime response separately and as needed. And there are likely a number of capabilities that will never require hard realtime response, for example, given current techological trends, a 1MB synchronous write to disk is going to take some time, and will be subject to the usual retry and error conditions. This approach allows such operations to keep their simpler non-realtime code."

The patch received a mixed reception. For one thing, as Ingo pointed out, "this has been implemented in a clean way already: check out the "isolcpus=" boot option & scheduler feature (implemented by Dimitri Sivanich) which isolates a set of CPUs via sched-domains for precisely such purposes. The way to enter such a domain is via the affinity syscall - and balancing will leave such domains isolated." Paul was happy to see this work, and rushed off to look at it. After a day of digging through Dimitri Sivanich's code, he said, "I haven't proven to myself that the isolcpus code gets rid of all of the cross-runqueue lock acquisitions, but it certainly gets rid of a large number of them. It doesn't seem to do system-call or exception-handler offload, but it does help me see how to do this sort of thing cleanly." He offered a minor patch to Dimitry, to remove an unnecessary #ifdef, and Dimitry replied, "this specific code wasn't part of my original patch, but after looking at it briefly, I believe this patch should make sense."

Elsewhere, Thomas Gleixner was skeptical about the whole thing. For one thing, he said, "I haven't seen an embedded SMP system yet. Focussing this on SMP systems is ignoring the majority of possible applications." He offered various existing uniprocessor alternatives; such as the dual kernel approach of RTLinux, the domain approach of RTAI/Adeos, and the in-kernel approach of KURT/Libertos. Paul found none of these satisfactory, and also offered a fourth possibility, of running "something like the Xen VMM, and have it provide a single OS with the illusion that there are two CPUs. As you say, the OS cannot be allowed to really disable interrupts, instead, the underlying VMM must track whether the OS thinks it has interrupts disabled on a given "CPU", and refrain from delivering the interrupt until the OS is ready. Of course, on a multithreaded CPU or SMP system, the VMM is not required." He added, regarding the whole idea of an embedded SMP system, "Seeing SMP support for ARM lead me to believe that this was not too far over the edge." Jon Masters replied:

They have an SMP reference implementation, however many folks don't actually want to go the dual core approach right now for embedded designs (apparently the increased design complexity isn't worth it). I've had protracted discussions about this very issue quite recently indeed. Others will disagree, I'm only basing my statement upon conversations with various engineers - I think your idea eventually becomes interesting, but now is not the right moment to be pushing it yet. People still don't want this now.

Talk to smartphone manufacturers who currently have dual ARM core designs, one running Linux and the other running an RTOS for the GSM and phone stuff, and they'll say they actually want to reduce the design complexity down to a single core. Talking to people suggests that multicore designs are good in certain situations (such as in the case above), but in general people aren't yet going to respond to your way of doing realtime :-) Yes you do have only one OS in there, maybe that would change opinion, but we're not quite at the point where everything is multicore so you're not going to convince the masses.

Having said all that, for a different perspective, I hack on ppc (Xilinx Virtex II Pro) kernel and userspace stuff for some folks that make high resolution imaging equipment, involving extremely precise control over a pulsed signal and data acquisition (we're talking nanosecond/microsecond precision). Since Linux obviously isn't capable of this level of deterministic response right now we end up farming out work to a separate core - it's unlikely your approach would convince the hardware folks, but I guess it might be tempting at some point in the future. Who knows.

Elsewhere, Karim Yaghmour said:

I've been trying not to get too involved in this, though I've been personally very interested in the topic of obtaining deterministic response times from Linux for quite some time. Ingo's work is certainly gathering a lot of interest, and he's certainly got the brains and the can-do mindset that warrant a wait-and-see attitude.

I must admit though that I'm somewhat skeptical/worried. The issue for me isn't whether Linux can actually become deterministic. This kernel has reached heights which many of its detractors never believed it could, it has come a long way. So whether it _could_ better/surpass existing RT-Unixes (such as LynxOS or QNX for example) in terms of real-time performance is for me in the realm of the possible.

That the Linux development community has to answer the question of "how do we provide deterministic behavior for users who need it?" was, as for the kernel developers of most popular Unixes, just a matter of time. And in this regard, this is a piece of history that is yet to be written: What is the _correct_ way to provide deterministic response times in a Unix environment?

Like in most other circumstances, the Linux development community's approach to this has been: show me the code! In that regard (and this is in no way criticism of anyone's work), Ingo's work has gathered a lot of interest not because it is breaking new ground in terms of the concepts, but largely because of its very rapid development pace. Let's face it, no self-respecting effort that has ever labeled itself as wanting to provide "hard real-time Linux" has been active on the LKML on the same level as Ingo (though many have concentrated a lot of effort and talent on other lists.)

Yet, I believe that this is a case where the concepts do actually matter a lot, and that no amount of published code will erase the fundamental question: What is the _correct_ way to provide deterministic response times in a Unix environment? I keep highlighting the word "correct" because it's around this word's definition that the answer probably lies.

Here are a number of solutions that some have found to be "correct" for their needs over time, in chronological order of appearance:

  1. Master/slave kernel (ex.: RTLinux)
  2. Dual-CPU (there are actually many examples of this, some that date back quite a few years)
  3. Interrupt levels (ex.: D.Schleef, B.Kuhn, etc.)
  4. Nanokernel/Hypervisor (ex.:Adeos)
  5. Preemption
  6. Uber-preemption and IRQ threading (a.k.a. preemption on acid) (ex.: Ingo, TimeSys, MontaVista, Bill)

My take on this has been that the "correct" way to provide deterministic response times in a Unix environment should minimize in as much as possible:

  1. the modifications to the targeted kernel's existing API, behavior, source code, and functionality in general;
  2. the burden for future mainstream kernel development efforts;
  3. the potential for accidental/casual use of the hard-rt capabilities, as this would in itself result in loss of deterministic behavior;

Also, it should be:

  1. architectured in a way that enables straight-forward extension of the real-time capabilities/services without requiring further modifications to the targeted kernel's existing API, behavior, sources, and functionality in general;
  2. truly deterministic, not simply time-bound by some upper limit found during a sample test run;
  3. _very_ simple to use without, as I said above, having the potential of being accidentally or casually used (such a solution should strive, in as much as possible, to provide the same API as the targeted Unix kernel);
  4. easily portable to new architectures, while remaining consistent, both in terms of API and in terms of behavior, from one architecture to the next;

From all the solutions that have been put forth over the years, I have found that the nanokernel/hypervisor solution fits this description of correctness best. The Adeos/RT-nucleus/RTAI-fusion stack is one implementation I have been promoting, as it has already reached important milestones. All that is needed for it to work is the necessary hooks for Adeos to hook itself into Linux by way of an interrupt pipeline; the latter being very simple, portable and non-intrusive, yet could not accidentally/casually be used without breaking. This interrupt pipeline is all that is required for the rest of the stack to provide the services I have alluded to in other postings by means of loadable modules, including the ability to transparently service existing Linux system calls via RTAI-fusion for providing applications with hard- rt deterministic behavior.

One argument that has been leveled against this approach by those who champion the vanilla-Linux-should-become-hard-rt cause (many of whom are now in the uber-preemption camp) is that it requires writing separate real-time drivers. Yet, this argument contains a fatal flaw: drivers do not become deterministic by virtue of running on an RTOS. IOW, even if Linux were to be made a Unix RTOS, every single driver in the Linux sources would still have to be rewritten with determinism in mind in order to be used in a system that requires hard-rt. This is therefore a non-issue.

Which brings me back to what you said above: "The problem is that the entire OS kernel must be modified to ensure that all code paths are deterministic." There are two possible paths here.


a) Most current kernel developers intend to eventually convert the entire existing code-base into one that contains deterministic code paths only, and therefore impose such constraints on all future contributors, in which case the path to follow is the one set by the uber-preemption folks;


b) Most current kernel developers intend to keep Linux a general purpose Unix OS which mainly serves a user-base that does not need deterministic hard-rt behavior from Linux, and therefore changes for providing deterministic hard-rt behavior are acceptable only if they are demonstrably minimal, non-instrusive, yet flexible enough for those that demand hard-rt, in which case the path to follow is the one set by the nanokernel/hyperviser folks;

So which is it?

Bill Huey gave a lengthy response to many of Karim's points; but the upshot was, "This is a non-issue. the uber-preemption folks will continue to do what they've/we've been doing and it just opens up more opportunities for dual-domain RT folks. One doesn't exclude from the other." Andrew Morton also replied to Karim, suggesting:

uber-preemption is the chosen way for the mainline kernel mainly because its mechanisms can be largely hidden inside (increasingly ghastly) header files and most developers just don't have to worry about it.

I have a sneaking suspicion that the day will come when we get nice sub-femtosecond latencies in all the trivial benchmarks but it turns out that the realtime processes won't be able to *do* anything useful because whenever they perform syscalls, those syscalls end up taking long-held locks.

Which does lead me to suggest that we need to identify the target application areas for Ingo's current work and confirm that those applications are seeing the results which they require. Empirical results from the field do seem to indicate success, but I doubt if they're sufficiently comprehensive.

Ingo offered a technical rebuttal to the idea that real-time processes would have problems that the benchmarks wouldn't reveal; the upshot being that he was confident the problems could be solved; though he admitted his email ignored some more difficult issues.

8. New Virtual cputime For Micro-Second Accounting

27�Oct�2004�-�28�Oct�2004 (5 posts) Archive Link: "[patch] cputime: introduce cputime."

Topics: User-Mode Linux, Version Control

People: Martin Schwidefsky,�Rik van Riel,�Andrew Morton

Martin Schwidefsky said, "after the three timer-header-cleanup patches have hit bitkeeper it's time for the next step: the cputime_t patch. We've been using this patch and the s/390 exploitation patch for micro-second based cpu time accounting for some time now and it seems rock solid. I didn't get a single bug report for it so far. Good for s/390, but now the question is what does the patch do to all the other architectures? 2.6.9 plus the cputime_t patch works fine on my thinkpad. Could you add this to -mm for broader testing please? The patch is cut against 2.6.10-rc1-mm1." His proposed Changelog entry said:

This patch introduces the concept of (virtual) cputime. Each architecture can define its method to measure cputime. The main idea is to define a cputime_t type and a set of operations on it (see asm-generic/cputime.h). Then use the type for utime, stime, cutime, cstime, it_virt_value, it_virt_incr, it_prof_value and it_prof_incr and use the cputime operations for each access to these variables. The default implementation is jiffies based and the effect of this patch for architectures which use the default implementation should be neglectible.

There is a second type cputime64_t which is necessary for the kernel_stat cpu statistics. The default cputime_t is 32 bit and based on HZ, this will overflow after 49.7 days. This is not enough for kernel_stat (ihmo not enough for a processes too), so it is necessary to have a 64 bit type.

The third thing that gets introduced by this patch is an additional field for the /proc/stat interface: cpu steal time. An architecture can account cpu steal time by calls to the account_stealtime function. The cpu which backs a virtual processor doesn't spent all of its time for the virtual cpu. To get meaningful cpu usage numbers this involuntary wait time needs to be accounted and exported to user space.

Rik van Riel replied approvingly, "This will be useful for User Mode Linux, Xen and iSeries too." Andrew Morton also gave the patch a try, and liked it.

9. Some Discussion Of Binary Firmware

27�Oct�2004�-�28�Oct�2004 (7 posts) Archive Link: "Intel also needs convincing on firmware licensing."

Topics: BSD: OpenBSD

People: Han Boetes,�Gene Heskett,�Dax Kelson,�Timo Jyrinki,�Denis Vlasenko

Han Boetes said:

The people from the OpenBSD project are currently lobbying to get the firmware for Intel wireless chipsets under a license suitable for Open Source.

Since this will not only benefit BSD but also the Linux Project (and even Intel) I would like to mention the URL here for people who want to help writing to Intel.

Gene Heskett replied:

Please be aware that for the so-called "software radios" chips/chipsets, the FCC, and other similar regulating bodies in other countries has made access to the data quite restrictive in an attempt to keep the less ruly among us from putting them on frequencies they aren't authorized to use, or to set the power levels above whats allowed. These restrictions can vary from governing body to governing body so the software is generally supplied according to where the chipset is being shipped. The potential for mischief, and legal/monetary reprecussions is sufficiently great that I have serious doubts that Intel will budge from their current position unless we can prove, beyond any doubt, that the regulatory limitations imposed will not be violated.

Since open source, where anyone who can read the code can see exactly what the limits are, and 'adjust to suit', virtually guarantees miss-use, sooner if not later, for no other reason than its human nature to experiment, Intel/moto/etc therefore has very good reasons to treat its chip<->software interface as highly secret & proprietary.

Thats not saying that they may at some point furnish a 'filter' that presents the rest of the world with a usable API to control it, but the filter will see to it that attempted illegal settings are ignored. The only way I can see that actually working is to actually put that filter inside the chip, customized for the locale its being shipped to. The radio control portion of the chip itself wouldn't even be bonded out to external world pins or bga contacts, just the port of the filter that the outside world talks to.

I'd rather doubt they want to make 20 to 40 different filtered versions of the same chipset just to satisfy TPTB in some 3rd world country thats less than 1% of the total sales. Even the relatively dense market where Han lives is probably less than 5% of the total for a popular chipset.

I'm a broadcast engineer who has been dealing at times with the FCC for over 40 years, so you could say I'm biased. But thats not real for over 40 years, so you could say I'm biased. But thats not real bias, its just from being fairly familiar with the regulatory territory.

I'd like to see an open source solution to this problem myself, but just because its open source we are asking for, with the attendent liabilities that implies, I would not hold my breath till it happens.

If you do, you'll probably be talking to the rest of the world through a Ouija board.

Denis Vlasenko pointed out that binary firmware didn't really hide anything, because the code could just be disassembled and its secrets revealed. But Dax Kelson replied:

Who cares what the secrets in the firmware are.

Again, it does not execute on your computer's CPU. It does not taint the kernel. The Linux kernel driver is 100% GPLd, no binary blobs.

Nearly all the devices in your computer have firmware. Your keyboard, your CDROM drive, your graphics card. It is hypocritical to clamor for the source code to the IPW2100/2200/etc while not clamoring for the source code to all the other firmwares in your computer.

It is unfortunate that the firmware isn't stored onboard the Intel card, and instead needs to be loaded, however, this is a pretty minor inconvenience.

After Kernel Traffic was published, I got an email from Timo Jyrinki, saying:

I would like to just point out the matter that has been apparently confused even on the lkml: OpenBSD folks are not trying to have the firmwares open-sourced, they are trying to have the binary blobs re-distributable. If the firmwares aren't freely re-distributable, none of the WLAN adapters will work on any *BSD or Linux distribution without manually going to the Internet (how to do that without a network adapter?), accepting some "interesting" EULA and finding out where to put the firmware so that the driver will find it.

Just in case the discussion continues, so that you may do some editorial comment if no-one still gets it.

I've compiled a web page of the matter, though I haven't yet "published" it: - I will keep it and probably improve it, I'm just not sure if it contains (yet) enough useful information or all the points of view.

I think the confusion is because of the mentioned false assumptions, but perhaps also because some folks don't care if it can be re-distributed if it's binary-only anyway. The answer to the latter one is (partly) that we already have firmwares on each of our network cards etc., these new WLAN cards just need it to be uploaded on each boot. Though I also understand that saving the few cents of not having the firmware permanently onboard is just plain stupid anyway.

10. Cross-Compilation HOWTO

27�Oct�2004�-�30�Oct�2004 (8 posts) Archive Link: "massive cross-builds without too much PITA"

Topics: FS: NFS, FS: ext2, SMP, Serial ATA, Version Control

People: Alexander Viro,�Geert Uytterhoeven,�Alessandro Amici

Alexander Viro said:

Contrary to popular beliefs, crosscompiling and full builds for a bunch of platforms are not hard and not particulary time-consuming. Below are my notes on the setup and practices; I hope it will be useful and I really_ hope that people will start doing similar things.

On my boxen full rebuild for 6 platforms (allmodconfig on each) takes about an hour and normally is done once per new upstream tree; build-and-compare after making changes usually takes less than a minute total (again, for all these targets). IOW, it's fairly tolerable.

Requirements to sane setup:

  1. source tree should be common for as many platforms as possible; and I mean physically common, not just cp -rl'ed. Rationale: applying a patch (or using emacs and similar inferior editors) would break links. Propagating fixes between a bunch of partially shared trees is a PITA.
  2. there should be an easy way to get a diff'able build logs before and after a change and do that without heavy massage of logs and without forcing full rebuild. There should also be an easy way to carry a patchset _and_ get new changes into the patchset without having them turn into a huge lump that would need to be split afterwards.

    A way to do that: have a forest of cp -rl'ed trees, starting from the baseline one, carrying the changes we'd done to source tree. Common sequence of events is

    <work with source tree>
    diff -urN <last tree in forest> <source> > delta
    cp -rl <last tree in forerst> <new tree>
    (cd <new tree> && patch -p1 -E) <delta

    Note that this gives a fast way to see build log changes:

    cd <source>
    patch -p1 -E -R <delta
    make # now it's uptodate
    patch -p1 -E <delta
    make <whatever arguments> >../log-new 2>&1
    patch -p1 -E -R <delta
    make <whatever arguments> >../log-old 2>&1
    patch -p1 -E <delta

    and now we have build logs for *exact* *same* *part* *of* *tree* in old and new trees. Ready to be compared. Of course, for multiplatform work we want to do each of these for all platforms involved. It's not going to be a lot of rebuilds, though - we are only rebuilding the stuff we'd been changing.

  3. change of baseline version should be easy. Note that aforementioned forest takes care of most of these problems - we can easily convert it to sequence of patches (script doing diff between cp -rl'ed trees; fast enough) and we can easily convert a series into the forest (cp -rl + patch done by another script). Porting to new baseline mostly consists of folding the forest into patchset and applying it to new tree.
  4. there should be an easy way to spread the builds between several boxen _and_ keep the trees in sync. What I'm doing looks so:

    1. master box where I'm doing most of the work on patchset (not particulary fast CPU, preferably enough memory and fast disk).
    2. slave boxen where the builds are done - these are heavily CPU-bound [see below] and where editing is going on. Layout:

      • clean tree (right now - RC10-rc1-bk6) on each.
      • base - clean + combined patchset applied to it (RC10-rc1-bk6-base; cp -rl'ed from clean)
      • forest (RC10-rc1-bk6-<name>; cp -rl'ed starting from base, that's where additions to patchset go)
      • source (RC10-rc1-bk6-current; originally cp -a from base, that's where editing and builds are done)
      • linux-<arch> - object trees
      • patches/... - NFS-exported by master and shared by all slaves.

      It's easier to move stuff around that way (scp gets annoying real soon)

    Note that sharing the trees between slaves (e.g. by NFS) is not practical - too damn slow. Propagating changes is not hard, provided that slaves are few.

    *NOTE*: one thing you definitely want to do is to turn CONFIG_DEBUG_INFO off. That changes the builds from IO-bound to CPU-bound and saves a *lot* of space. When you work with several platforms the last part gets really important - we are talking about nearly 2Gb per target.

    Splitup of initial patchset lives on master; no point duplicating potentially very large forest on slaves. What we have on slaves is the tail of patchset - the stuff we'd added. From time to time we can move the beginning of that tail to master, collapsing more stuff on slaves.

  5. cross-toolchains themselves and not wearing your fingers by nightmare make invokations. I've done a trivial script that takes target name as its first argument, sets ARCH, O, CROSS_COMPILE and CHECK (for cross-sparse) depending on it and passes them and the rest of arguments to make. It also sanitizes stderr a bit (see for what I'm using right now). That takes care of make side of that mess - something like $ for i in i386 alpha ppc; do kmk $i C=2 drivers/net/ >../$i-net15 & done is done a lot (and history helps enough to make further scripting a non-issue).

    Building cross-toolchain is surprisingly easy these days; I'm using debian on build boxen and cross-compilers are not hard to do:

    apt-get build-dep binutils
    apt-get build-dep gcc-3.3
    apt-get install dpkg-cross
    apt-get source binutils
    get binutils-cross-... patch from
    cd binutils-...
    apply patch
    TARGET=<target>-linux fakeroot debian/rules binary-cross
    cd ..
    dpkg -i binutils-<target>-....deb
    got linux-kernel-headers, libc6, libc6-dev and libdb1-compat for target
    dpkg-cross -a <target> -b on all of those
    dpkg -i resulting packages
    apt-get source gcc-3.3
    cd gcc-3.3-...
    GCC_TARGET=<gcc_target> debian/rules control
    GCC_TARGET=<gcc_target> debian/rules build
    GCC_TARGET=<gcc_target> fakeroot debian/rules binary
    cd ..
    dpkg -i resulting packages.

    One note: <target> here is debian platform name (e.g. ppc), but <gcc_target> is *gcc* idea of what that bugger is called (e.g. powerpc).

    IIRC, Nikita Youshchenko had pre-built debs somewhere, but they were not for the host I'm using (amd64/sid).

    In any case, that's a one-time work (OK, once per target). Major limitation here is that one needs several debs for target (AFAICS all we really need is a bunch of headers). In practical terms it means no ppc64. However, it's not too bad a problem - breaking ppc64 means breaking a box Linus is playing with, so problems on that platform are noticed fast enough as it is ;-)

  6. cross-sparse: sparse snapshots live on; I'm probably doing more work than necessary since I build a separate binary for each target. What it means is

    1. editing pre-process.h to point to cross-gcc headers (e.g. /usr/lib/gcc-lib/alpha-linux/3.3.4/include) and
    2. editing target.c (probably not needed these days)
      make CFLAGS=-O3
      mv ~/bin/sparse-<arch>{,-old}
      cp check ~/bin/sparse-<arch>
      does the build-and-install.

    When switching to new version of sparse (and it's changing pretty fast): make for all platforms to make sure that no compiling will get in a way, followed by

    kmk <arch> C=2 CHECK=sparse-<arch>-old >../<arch>-sparse-old
    kmk <arch> C=2 >../<arch>-sparse-new

    for everything (works fine in parallel), followed by comparing logs and looking for regressions. About 20 minutes for all (well, 20 minutes plus whatever it takes to fix the breakage if we get one, obviously).

  7. useful tricks:

    1. I'm carrying a patch that allows to add to CHECKFLAGS from command line (CF=-Wbitwise in make/kmk arguments instead of editing makefile). It's probably worth merging at some point.
    2. I'm carrying a kludge that teaches allmodconfig to take a given set of options from a file and pin them down; Roman has a cleaner patch and IIRC he was going to merge it at some point. Anyway, that allows to do such things as "allmodconfig on i386, but have CPU type set to K7 and disable SMP first, so we'll get all UP-only drivers into the build". Or "do PPC build for that sub-architecture", etc.

      See* for that stuff.

  8. Random notes:

    Of course, all that stuff can be done on a single box; however, having a bunch of compiles run in parallel will get painful since too many of them == guaranteed way to trash all caches around. I've ended up using a two-years-old K7 box as a master (since that was where I was doing most of the kernel work anyway) and put two amd64 3400+, both with 512Mb and 10krpm SATA as slaves. I considered spreading build on other local boxen; maybe I'll do that when I add more targets to active set, but that's not obvious. The main problem here is doing resyncs between the boxen without too much PITFingers - and getting around to doing it, of course. So far existing setup had been more than enough for my needs - so much that all further scripting, etc. remains theory.

    IME this stuff is *heavily* CPU-bound. Parallel builds on the same source make IO load almost a non-issue, as long as you are not spewing gigabytes of crap all over the disk (== have CONFIG_DEBUG_INFO turned off; again, it's a must-do unless you are reading this posting by mistake, having confused your linux-kernel and masochists-r-us mailboxen).

    ext2 works fine for build boxen - you are not dealing with hard-to-recreate data there (diffs are going to master and you want them carved into small chunks from the very beginning anyway). So journalling, etc. is a pointless overhead in this situation. Keep in mind that forest of cp -rl'ed kernel trees gets hard on caches once it grows past ~60 copies regardless of the fs involved; if your patchset gets bigger than that, fragment it and do porting, etc. group-by-group.

    Currently i386, amd, sparc32, sparc64, alpha and ppc all survive allmodconfig with relatively few patches; amount of new breakage showing up is not too bad and so far didn't take much time to deal with. Bringing in new targets... hell knows - parisc probably will be the next one (which will mean adding delta between Linus' and parisc trees into -bird), arm going after it (that will mean untangling the mess around drivers/net/8390.c first ;-/) After getting the target to build (and barring the acts of Cthulhu or Ingo) it doesn't add a lot of overhead...

    Hardware: well, whatever you have, obviously. Parallel builds *do* scale nicely, so SMP with relatively slow CPUs can do fine. Out of something recent... I went with UP amd64, simply because a couple of UP boxen was actually cheaper than equivalent SMP one (~$600 per box, counting disks and cases) and everything else would be too far overpriced.

Geert Uytterhoeven remarked, "Just in case you ever want to start doing m68k as well: I already have a few sparse-related cleanups at*" . Alexander asked, "Hrm... How far is Linus' tree from building on m68k these days? I hadn't looked at the delta since 2.6.7 or so, but it used to be fairly invasive in some places..." And Geert replied:

For, the different task models (cfr. the thread titled `Re: Getting kernel to build for m68k?' on lkml last September) is the big problem. If you apply the patch from that thread to plain, it'll build fine!

2.6.9 introduced a new problem with the signal handling (for which BTW we don't have a fix yet).

Alessandro Amici also replied to Alexander's original post, saying, "I happen to be learning how to cross-compile on Debian right now, so i can testify that building the cross toolchain 'The Debian Way' is even easier than you describe ;)." He went on:

Detailed instructions are at:

The short story is:
# apt-get toolchain-source dpkg-cross autoconf2.13 fakeroot
# tpkg-install-libc <terget>-linux # grabs, converts and install the headers
$ tpkg-make <target>-linux # no need to patch anything
$ cd binutils...
$ debuild -us -uc # no magic env variables
# dpkg -i ../binutils...deb
$ cd ../gcc...
$ debuild -us -uc
# dpkg -i ../gcc...deb

If I'm not mistaken, that's all.

But Alexander pointet out, "gcc and binutils in there tend to get out of sync with native ones. Which is a killer, as far as I'm concerned..."

11. Kprobes Updates

28�Oct�2004 (12 posts) Archive Link: "[0/3] PATCH Kprobes for x86_64- 2.6.9-final"

Topics: Assembly, SMP

People: Prasanna S. Panchamukhi,�Andi Kleen,�Prasanna S. Panchamuk

Prasanna S Panchamukhi said, "Below are the Kprobes patches ported to x86_64 architecture. I have updated these patches with suggestions from Andi Kleen. Thanks Andi for reviewing and providing your feedback. These patches can be applied over 2.6.9-final." The first patch modified Kprobes to support porting it to other architectures. The second patch (and really the whole project), he said: "

Helps developers to trap at almost any kernel code address, specifying a handler routine to be invoked when the breakpoint is hit. Useful for analysing the Linux kernel by collecting debugging information non-disruptively. Employs single-stepping out-of-line to avoid probe misses on SMP and may be especially useful in aiding debugging elusive races and problems on live systems. More elaborate dynamic tracing tools can be built over the kprobes interface.

Sample usage:

        To place a probe on __blockdev_direct_IO:
        static int probe_handler(struct kprobe *p, struct pt_regs *)
                ... whatever ...
        struct kprobe kp = {
                .addr = __blockdev_direct_IO,
                .pre_handler = probe_handler


A special kprobe type which can be placed on function entry points, and employs a simple mirroring principle to allow seamless access to the arguments of a function being probed. The probe handler routine should have the same prototype as the function being probed.

The way it works is that when the probe is hit, the breakpoint handler simply irets to the probe handler's rip while retaining register and stack state corresponding to the function entry. After it is done, the probe handler calls jprobe_return() which traps again to restore processor state and switch back to the probed function. Linus noted correctly at KS that we need to be careful as gcc assumes that the callee owns arguments. We save and restore enough stack bytes to cover argument space.

Sample Usage:

        static int jip_queue_xmit(struct sk_buff *skb, int ipfragok)
                ... whatever ...
                return 0;

        struct jprobe jp = {
                {.addr = (kprobe_opcode_t *) ip_queue_xmit},
                .entry = (kprobe_opcode_t *) jip_queue_xmit

And the third patch also consisted of minor changes to facilitate porting. Andi Kleen replied to the set of them, saying:

The patch is not ready to be applied yet. You didn't address some issues from the last review.

Like I still would like to have the page fault notifier completely moved out of the fast path into no_context (that i386 has it there is also wrong). Adding kprobe_runn doesn't make a difference.

And the jprobe_return_end change is wrong, my suggestion was to move it into the inline assembler statement. Adding asmlinkage doesn't help at all (I think i386 gets this wrong too)

Prasanna explained regarding the page fault notifier, "The kprobes fault handler is called if an exception is generated for any instruction within the fault-handler or when Kprobes single-steps the probed instruction. AFAIK kprobes does not handle page faults in the above case and just returns immediately resuming the normal execution." Andi replied, "Ok. It's ugly, but ok. Can you remove the bogus kprobes_running() then please, it's unnecessary? With that change it would be ok to merge from my side."

12. Linux 2.6.10-rc1-mm2 Released

29�Oct�2004�-�2�Nov�2004 (29 posts) Archive Link: "2.6.10-rc1-mm2"

Topics: Kernel Build System, Kernel Release Announcement, Virtual Memory

People: Andrew Morton

Andrew Morton announced Linux kernel 2.6.10-rc1-mm2, saying:

13. Linux 2.6.9-ac5 Released

29�Oct�2004�-�31�Oct�2004 (8 posts) Archive Link: "Linux 2.6.9-ac5"

People: Alan Cox,�Nuno Silva,�Greg Louis

Alan Cox announced a new release of his own Kernel patch set, 2.6.9-ac5. He said, "This update adds some of the more minor fixes as well as a fix for a nasty __init bug. Nothing terribly pressing for non-S390 users unless they are hitting one of the bugs described or need the new driver bits." Nuno Silva replied, "Thank god someone started to mantain a stable 2.6 kernel!" Greg Louis concurred, saying, "I was going to wait till at least 2.6.10 -- need reliable operation, and all the "this-and-that-major-function-is- broken-again" messages were putting me off -- but Alan can be trusted."

14. Automated Correctness Checking

30�Oct�2004 (4 posts) Archive Link: "Sparse "context" checking.."

Topics: SMP, USB

People: Linus Torvalds,�Roland Dreier,�Greg KH

Linus Torvalds said:

I just committed the patches to the kernel to start supporting a new automated correctness check that I added to sparse: the counting of static "code context" information.

The sparse infrastructure is pretty agnostic, and you can count pretty much anything you want, but it's designed to test that the entry and exit contexts match, and that no path through a function is ever entered with conflicting contexts.

In particular, this is designed for doing things like matching up a "lock" with the pairing "unlock", and right now that's exactly what the code does: it makes each spinlock count as "+1" in the context, and each spinunlock count as "-1", and then hopefully it should all add up.

It doesn't always, of course. Since it's a purely static analyser, it's unhappy about code like

        int fn(arg)
                if (arg)
                if (arg)

because the code is not statically deterministic, and the stuff in between can be called with or without a lock held. That said, this has long been frowned upon, and there aren't that many cases where it happens.

Right now the counting is only enabled if you use sparse, and add the "-Wcontext" flag to the sparse command line by hand - and the spinlocks have only been annotated for the SMP case, so right now it only works for CONFIG_SMP. Details, details.

Also, since sparse does purely local decisions, if you actually _intend_ to grab a lock in one function and release it in another, you need to tell sparse so, by annotating the function that acquires the lock (with "__acquires(lockname)") and the function that releases it (with, surprise surprise, "__releases(lockname)") in the declaration. That tells sparse to update the context in the callers appropriately, but it also tells sparse to expect the proper entry/exit contexts for the annotated functions themselves.

I haven't done the annotation for any functions yet, so expect warnings. If you do a checking run, the warnings will look something like:

          CHECK   kernel/resource.c
        kernel/resource.c:59:13: warning: context imbalance in 'r_start' - wrong count at exit
        kernel/resource.c:69:13: warning: context imbalance in 'r_stop' - unexpected unlock

which just shows that "r_start" acquired a lock, and sparse didn't expect it to, while "r_stop" released a lock that sparse hadn't realized it had. In this case, the cause is pretty obvious, and the annotations are equally so.

A more complicated case is

          CHECK   kernel/sys.c
        kernel/sys.c:465:2: warning: context imbalance in 'sys_reboot' - different lock contexts for basic block

where that "different lock contexts" warning means that sparse determined that some code in that function was reachable with two different lock contexts. In this case it's actually harmless, since what happens in this case is that the code after rebooting the machine is unreachable, and sparse just doesn't understand that.

But in other cases it's more fundamental, and the lock imbalance is due to dynamic data that sparse just can't understand. The warning in that case can be disabled by hand, but there doesn't seem to be that many of them. A full kernel build for me has about 200 warnings, and most of them seem to be the benign kind (ie the kind above where one function acquires the lock and another releases it, and they just haven't been annotated as such).

The sparse thing could be extended to _any_ context that wants pairing, and I just wanted to let people know about this in case they find it interesting..

Roland Dreier replied, "Do you have a plan for how to handle functions like spin_trylock()? I notice in the current tree you just didn't annotate spin_trylock()." Linus replied:

Actually, the _current_ tree does actually annotate spin_trylock() (as of just before I sent out the email). It looks like

        #define spin_trylock(lock)      __cond_lock(_spin_trylock(lock))

where __cond_lock() for sparse is

        include/linux/compiler.h:# define __cond_lock(x)        ((x) ? ({ __context__(1); 1; }) : 0)

ie we add a "+1" context marker for the success case.

NOTE! This works with sparse only because sparse does immediate constant folding, so if you do

        if (spin_trylock(lock)) {

sparse linearizes that the right way unconditionally, and even though there is a data-dependency, the data depenency is constant. However, if some code does

        success = spin_trylock(lock);
        if (success) {

sparse would complain about it, because sparse doesn't do any _real_ data flow analysis.

So sparse can follow all the obvious cases, including trylock and "atomic_dec_and_lock()".

Greg KH also said to Linus, "Nice, I like this a lot. Already found some bugs in the USB drivers that have been there forever."

15. man-pages Maintainership

31�Oct�2004 (2 posts) Archive Link: "[OT] man-pages-1.70, new maintainer"

People: Andries Brouwer,�Rob van Nieuwkerk

After maintaining the man-pages project for 9 years, Andries Brouwer said, "Just released man-pages-1.70. Find it the usual places. Due to a decreasing amount of time and increasing RSI, maintaining the man-pages package became difficult. Fortunately Michael Kerrisk has accepted to take over. Send corrections and additions to" Rob van Nieuwkerk said, "Thanks a lot Andries for all your great man-pages work!"

16. Migration To New Argument-Passing Method For Some Assembly Interfaces

2�Nov�2004�-�3�Nov�2004 (5 posts) Archive Link: "RFC: avoid asmlinkage on x86 traps/interrupts"

Topics: Assembly

People: Linus Torvalds

Linus Torvalds said:

Here's the second part of a gradual change from using stack argument passing to using register argument passing for a number of assembly interfaces. As covered in previous discussions, this has the advantage of having the caller/callee agree on the ownership of the arguments (well, at least the three first ones), and thus gcc won't occasionally possibly corrupt the stack frame that assembly code believes it owns.

It also removes a few instructions when we can pass arguments in registers in most places. In other places it adds a "movl %esp,%eax", though, as some cases used to just rely on knowing the saved stack layout and use that directly as the arguments.. So it's not really a big win either way, and the real motivation for this is to move away from the argument ownership questions.

No other architecture should care, since for most of them "asmlinkage" vs "fastcall" is a no-op, and when that isn't true (like on ia64) as far as I can tell all the actual call-sites in this patch were all in C code for the routines that were changed. But architecture maintainers should probably take a quick look to verify.

Sharon And Joy

Kernel Traffic is grateful to be developed on a computer donated by Professor Greg Benson and Professor Allan Cruse in the Department of Computer Science at the University of San Francisco. This is the same department that invented FlashMob Computing. Kernel Traffic is hosted by the generous folks at All pages on this site are copyright their original authors, and distributed under the terms of the GNU General Public License version 2.0.