Kernel Traffic
Latest | Archives | People | Topics
Latest | Archives | People | Topics
Latest | Archives | People | Topics
Home | News | RSS Feeds | Mailing Lists | Authors Info | Mirrors | Stalled Traffic

Kernel Traffic #177 For 28 Jul 2002

By Zack Brown

Table Of Contents


A lot of folks have been requesting an RSS feed for Kernel Traffic, so I've put one up for KT and each of the Cousins. They are linked in the top navigation bar, but the one for KT is Please let me know if there are any problems. If folks let me know where they're using it, I'll set up a page to link to those pages.

Mailing List Stats For This Week

We looked at 1424 posts in 7240K.

There were 371 different contributors. 185 posted more than once. 172 posted last week too.

The top posters of the week were:

1. Gang Scheduling In Linux

16 Jul 2002 - 22 Jul 2002 (16 posts) Archive Link: "Gang Scheduling in linux"

Topics: Scheduler

People: Ingo MolnarSam MasonHubertus FrankeRichard GoochWilliam Lee Irwin III

Someone asked if there was support for 'gang scheduling' in Linux, and if so, what its status was. Ingo Molnar replied:

yes - the 'synchronous wakeup' feature is a form of gang scheduling. It in essence uses real process-communication information to migrate 'related' tasks to the same CPU. So it's automatic, no need to declare processes to be part of a 'gang' in some formal (and thus fundamentally imperfect) way.

(another form of 'gang scheduling' can be achieved by binding the 'parent' process to a single CPU - all children will be bound to that CPU as well.)

William Lee Irwin III also gave a link to An Overview Of Gang Scheduling in PDF format. Ingo also added, "the Linux scheduler does not enable classic gang-scheduling: where multiple processes are scheduled 'at once' on multiple CPUs," though he wasn't sure there was any real-world case that could benefit from it. Sam Mason replied, "It's mainly used for programs that needs lots of processing power chucked at a specific problem, the problem is first broken down into several small pieces and each part is sent off to a different processor. When each piece has been processed, they are all recombined and the rest of the calculation is continued. The problem with this is that if any one of the pieces is delayed, all the processors will be idle waiting for the interrupted piece to be processed, before they can process the next set of pieces." A bit later, he added, "The important thing to remember is that this isn't a normal scheduling method, it's used for VERY specialised software which is assumed to have (almost) complete control of the machine. Gang scheduled processes would have the highest priority possible and would get executed before any other processes. This works because the software knows what it's doing and assumes that the user only ran one bit of gang scheduled software, if all of these are valid assumptions everything should work nicely."

Hubertus Franke from IBM also said to Ingo:

I was involved in the gangscheduler implementation for the IBM 340 node SP2 cluster at Lawrence Livermore National Lab. Implementation aside, one can show that the total system utilization can be raised from ~60% to a ~90% when doing gang scheduling rather than FIFO scheduling, which one would otherwise do to get a dedicated machine. We got tons of papers on this.

For this it seems sufficient to simply STOP apps on a larger granularity and have that done through a user level daemon. The kernel scheduler simply schedules the runnable threads that given the U-Sched would always amount to a limited number of threads/tasks.

This made sense to Ingo, and the two of them began to talk about implementation, when Richard Gooch came in with, "A completely user-level solution may have some disadvantages, though, such as delays in scheduling on/off (say if some daemon is used to scan the process list). Perhaps we could add a small hack to the scheduler such that when a task is about to be scheduled off, a signal can be sent to a designated pid? Similarly, when a task is scheduled on, another signal may be sent. An application that wanted to have gang scheduling could then make use of this to STOP/CONT threads." Hubertus replied:

I am glad you brought this up. I'd love to have a generic callback for this. AIX used/has a process change handler that is being called on start/exit.

In Linux, this idea could be done through a generic hook settable through a module... that should be sufficient and would allow for other stuff to be handled as well. For instance in the presence of fast user level communication (e.g. user mapped windows to myrinet the current process could be marked in the communication adapter).

There was no reply.

2. Some Discussion Of The 2.6 Release Schedule

19 Jul 2002 - 21 Jul 2002 (14 posts) Archive Link: "Re: [2.6] Most likely to be merged by Halloween... THE LIST]"

Topics: FS, Feature Freeze, Forward Port, Release Scheduling

People: Hans ReiserAndreas DilgerRik van Riel

In the course of discussion, Hans Reiser asked:

Is Halloween the deadline for submission of patches, or the deadline for inclusion? If I send in reiser4 on Halloween day according to some timezone;-), have I made the deadline for inclusion into 2.6 even if it takes Linus a few months to reach my place in the queue of patches sent to him on Halloween day?

I understand that earlier is better, and I will send it earlier if I can, but even if we do get the reiser4 core (that which does all that V3 does but faster and on top of a plugin infrastructure) done before Halloween, we will inevitably add a few features and tweaks after doing the core, and we will want to send those in at the last minute.

Andreas Dilger replied:

my understanding is that core changes that aren't in by Halloween are not going to be accepted until 2.7. By pre-announcing the deadline, it is hoped that people will have lots of time to submit things that are ready for inclusion, as opposed to rushing to submit when the "freeze" is announced all of a sudden.

If (as we all hope) the important features are added incrementally to the development kernel over the next few months, maybe all of the usage and testing that is going into the development kernel will not be totally lost when the entire kernel is morphed under a huge weight of patches.

It may even mean that there will not be an extra year of features (and bugs) being added to the "frozen" kernel, and we will be able to start 2.7 earlier.

As always, I imagine that as long as you have any core changes in 2.5 before the freeze, it will not be impossible to add self-contained things like filesystems and drivers after the freeze also.

Hans replied:

I, in my egocentrism, think it would make more sense to have a deadline for submission rather than a deadline for acceptance, as that would make things predictable for patch submitters, and avoid unintentional overlooking of good patches from obscure persons due to the crush of patches in October.

Pre-announcing the deadline is good, but having it be a deadline on something the patch submitters control (submission time not acceptance time) would be even better.

Andreas replied:

I would agree, except that this doesn't put any onus on the submitters to get their patches in early, and causes the thundering heard of patches problem the same way that not announcing the patch deadline does.

Note that "accepted" may be a bad term on my part - I can't say if this means that the patch has been recieved by Linus, or whether it actually has to be in the kernel tree at that date.

Note that I wasn't at the kernel summit myself, hence this is all just what I have heard from others.

But Rik van Riel interjected:

It's both. We all know Linus doesn't have the time to keep forward-porting our hundreds of patches so he can only include patches into his kernel that apply to the exact same tree he has at that day.

This (and the fact that Linus gets far too much email and patches to look at old ones) is bound to make the Halloween deadline stick for both submission and acceptance.

I hope.

A couple posts later, he added:

I hope the Halloween feature freeze really will be a feature freeze. Nothing is more frustrating than having a "stable kernel" broken every second release by yet another feature.

If we all restrain ourselves 2.6 will be stable soon and 2.7 will be started shortly after. Backporting "essential" features from 2.7 into a _stable_ 2.6 will be so much easier than trying to stabilise a 2.6-pre that's full to the brim of not-yet-stable new features.

3. New VM Subsystem Lieutenant

20 Jul 2002 - 22 Jul 2002 (14 posts) Archive Link: "[PATCH][1/2] return values shrink_dcache_memory etc"

Topics: Big Memory Support, Forward Port, Maintainership, Virtual Memory

People: Rik van RielLinus TorvaldsAndrew MortonWilliam Lee Irwin IIIMartin J. BlighEd Tomlinson

Rik van Riel announced:

this patch, against current 2.5.27, builds on the patch that let kmem_cache_shrink return the number of pages freed. This value is used as the return value for shrink_dcache_memory and friends.

This is useful not just for more accurate OOM detection, but also as a preparation for putting these reclaimable slab pages on the LRU list. This change was originally done by Ed Tomlinson.

Linus Torvalds replied:

I disagree with the whole approach of having shrink_cache() return the number of pages free.

The number is meaningless, since it has nothing to do with the actual memory zones that are under pressure (right now, the memory zone is almost always ZONE_NORMAL, which is correct, but that's just pure luck rather than anything fundamental).

I'd be much more interested in the "put the cache pages on the dirty list, and have memory pressure push them out in LRU order" approach. Somebody already had preliminary patches.

That gets _rid_ of dcache_shrink() and friends, instead of making them return meaningless numbers.

Rik said he'd try forward porting Ed's code from 2.4 to 2.5, and Linus added:

Side note: while I absolutely think that is the right thing to do, that's also the much more "interesting" change. As a result, I'd be happier if it went through channels (ie probably Andrew) and had some wider testing first at least in the form of a CFT on linux-kernel.

[ Or has it already been in 2.4.x in any major tree? (In which case my testing argument is lessened to some degree and it's mainly just to verify that the forward-port works). ]

Rik agreed that wider testing would be good, adding that the code had not been in any major 2.4 tree. Andrew Morton also replied to Linus:

I'd suggest that we avoid putting any additional changes into the VM until we have solutions available for:

2: Make it work with pte-highmem (Bill Irwin is signed up for this)

4: Move the pte_chains into highmem too (Bill, I guess)

6: maybe GC the pte_chain backing pages. (Seems unavoidable. Rik?)

Especially pte_chains in highmem. Failure to fix this well is a showstopper for rmap on large ia32 machines, which makes it a showstopper full stop.

If we can get something in place which works acceptably on Martin Bligh's machines, and we can see that the gains of rmap (whatever they are ;)) are worth the as-yet uncoded pains then let's move on. But until then, adding new stuff to the VM just makes a `patch -R' harder to do.

William Lee Irwin III replied, "I'll send you an update of my solution for (6), the initial version of which was posted earlier today, in a separate post. highpte_chain will do (2) and (4) simultaneously when it's debugged." Andrew thanked him, but added, "OK. But we're adding non-trivial amounts of new code simply to get the reverse mapping working as robustly as the virtual scan. And we'll always have rmap's additional storage requirements. At some point we need to make a decision as to whether it's all worth it. Right now we do not even have the information on the pluses side to do this. That's worrisome." He and Rik and Martin J. Bligh continued discussion the implementation, and the thread petered out.

4. Strict VM Overcommit; Source File Comments

20 Jul 2002 (13 posts) Archive Link: "[PATCH] VM strict overcommit"

Topics: Source Tree, Version Control, Virtual Memory

People: Robert LoveAlan CoxLinus TorvaldsHugh Dickins

Robert Love announced:

The following patch implements VM strict overcommit for rmap. Strict overcommit couples address space accounting with a strict commit rule to ensure all allocated memory is backed and consequently we never OOM. All memory failures should be pushed into the allocation routines - a page access should never result in a process kill.

The new strict overcommit policies are implemented via sysctl.

This is relatively low-impact on other code and does not change the behavior of the system in the default overcommit policy. Rik has given his approval.

This is based on Alan Cox's work in 2.4-ac with some cleanup and a new overcommit mode for swapless machines. Hugh Dickins also contributed some fixes for shmfs.

Patch is against 2.5.27

Alan Cox hated the patch, saying it was very different from the work he'd done, and was broken in a number of ways. He added, "I took the time to *measure* this stuff and test it in real world setups. Please don't randomly frob with it unless you are going to repeat the oracle test sets." He recommended the patch not be applied, or else that his earlier version be applied unchanged.

Robert replied that he'd emailed Alan about his changes before-hand, and asked why Alan was picking on him now. Alan denied getting any email, but when Robert offered to produce the mail along with Alan's reply, Alan said, "Ok I take that back. It merely never got as far into my brain as to stay stuck." After that they proceeded to discuss implementation details.

Elsewhere, Alan also exhorted:

The change on the mm/*.c headers for the files changes to include the GPL statement is part of my patch. The code submissions by Red Hat are GPL and the no warranty clauses are applicable. When you resubmit them please include the correct GPL headers I added, or a written guarantee that you will personally take liability for all defects, errors and so on. Also since you changed the code please credit yourself too.

The GPL no warranty clauses were added directly to the file because they are suppsed to be there.

But Linus Torvalds jumped in, with:

That's a load of bull. They are _NOT_ supposed to be there.

If you want legal disclaimers etc, do them in files you created and you own 100%, not in places that others started and work on. Or put them to the bottom of the file where they aren't in the way. Or add a "read teh GPL in the COPYING file", but don't start adding a ton of crap to core kernel files.

There is no "goodness" in being a lawyer in .c files.

Alan replied, "Thats fine be me too. I just grabbed the usual boilerplate but see COPYING is just fine." And Linus said:

Good. I hate the fact that so many people seem to think that adding 15 lines of copyright notice to a file somehow makes it "more legal". All it does is to give some corporate lawyer a bone, and take up precious real-estate on the first thing you see when you open the file that could be used to actually say what the file does (and who has worked on it).

Slightly off-topic, but in the same vein: I also dislike having tons of changelogs that relate to matters that aren't relevant to the sources any more (because the changes _changed_ them, duh!). The changelogs are valid as a way to show who worked on what, of course, but some people seem to take them to be the beginning of their Great Novel.

I'm hoping that one of the things BK does is to make people less inclined to write change stories in the C files, and more inclined to explain them to me in email when they send the changes in. At which point they are there in a format where you can actually see the "before and after" picture, not just get a feeling that "it looked different before" - well, DUH!

Rik suggested a two-line description at the top of each source file. He asked if a patch to add that would be welcome, but there was no reply, and the thread ended.

5. Status Of Bluetooth PC Card Drivers In 2.5

21 Jul 2002 - 23 Jul 2002 (8 posts) Archive Link: "[PATCH] Bluetooth Subsystem PC Card drivers for 2.5.27"

Topics: Source Tree

People: Alan CoxDave JonesIngo MolnarMarcel Holtmann

Marcel Holtmann posted a patch to update the PC Card drivers of the Bluetooth subsystem in the 2.5.27 kernel. Someone pointed out that Marcel had used EXPRT_NO_SYMBOLS, which was deprecated in 2.5 kernels. Alan Cox replied, "For 2.4 you want to use it whenever possible and a file exports no symbols. For 2.5 EXPORT_NO_SYMBOLS is the automatic default behaviour so you can lose the line." Marcel posted an updated patch, but Ingo Molnar asked why it was important to remove it. The first person who replied to Marcel, said that, since it had now become standard, all the extra occurrences would have to be removed at some point. But Dave Jones replied, "Completely removing it means driver maintainers need to keep separate 2.4/2.6 versions of their drivers. The extra EXPORT_NO_SYMBOLS is harmless, and allows single source, multi kernel version drivers." End of thread.

6. Warning: Serious Problems With 2.5 IDE Code

22 Jul 2002 - 24 Jul 2002 (35 posts) Archive Link: "please DON'T run 2.5.27 with IDE!"

Topics: Disks: IDE, Disks: SCSI

People: Andries BrouwerBartlomiej ZolnierkiewiczMarcin DaleckiMorten Helgesen

Bartlomiej Zolnierkiewicz said that the IDE 99 code, introduced in 2.5.27, contained a bug that could result in system lockups and data corruption. He asked people not to use that code. Andries Brouwer replied, referring to the thread covered in Issue #176, Section #2  (9 Jul 2002: 2.5 IDE Rewrite Interferes With Other Developments) , "On the other hand, thanks to Jens, I have been running 2.5.27 with 2.4 IDE now for two days without any IDE-related trouble."

Brad Littlejohn pointed out that in the 2.5.17 changelog, there was reference to IDE patch 99 and IDE patch 100, which indicated to him that patch 100 supplanted 99. Someone else reminded Brad that a bug introduced in patch 99 might not be fixed in 100, Bartlomiej confirmed that indeed, it had not been fixed, adding, "IDE 100 is a trivia patch indendation + initializers etc."

Elsewhere, Morten Helgesen asked for an elaboration on what was actually wrong with the code, and Marcin Dalecki said, a couple posts later, "The problem is of a somehow general nature. Many of the block devices *need* a mechanism to run commands asynchronously. The most preffered way to do this is of course to go by the already present request queue. However the generic queue handling layer doesn't give us any mechanism to actually stuff request from the driver and it doesn't behave well in boundary conditions where the queues are nearly full." He posted a temporary fix, adding that the proper fix would be modeled after what would be done in the SCSI code, or perhaps even by unifying both. Bartlomiej pointed out that actually, Marcin's 'quick fix' had been the default behavior prior to patch 99, and accused Marcin of hiding the facts. He said, "You have INTRODUCED a bug and now you try to pretend that it wasn't your fault and it was somehow broken before. Before 2.5.27 code had the same functionality as scsi version. And yes it will be useful to move it to block layer." Marcin objected to that characterization, saying, "Sure it was my fault I looked in the wrong direction I looked at the ide-tcq code, becouse I still dont like the idea that we pass a pointer for a struct on the local stack down. (It's preventing the futile hope to make this thingee somehow asynchronous form ever taking place.) I should have looked at SCSI in first place instead indeed." The argument ended right there, and folks went on to discuss various implementation details.

7. New Bitkeeper-To-CVS Gateway

24 Jul 2002 (1 post) Archive Link: "ANNOUNCE: bitkeeper to CVS gateway"

Topics: Version Control

People: Pavel Machek

Pavel Machek announced:

I've created some scripts usefull for doing bitkeeper 2 cvs gateway. I'm currently running them on (as you can see from the close look). I have not yet figured out how to export resulting CVS to the world.

[These scripts are in CVS at]

Hopefully this will be usefull to someone...

There was no reply.







Sharon And Joy

Kernel Traffic is grateful to be developed on a computer donated by Professor Greg Benson and Professor Allan Cruse in the Department of Computer Science at the University of San Francisco. This is the same department that invented FlashMob Computing. Kernel Traffic is hosted by the generous folks at All pages on this site are copyright their original authors, and distributed under the terms of the GNU General Public License version 2.0.