Kernel Traffic
Latest | Archives | People | Topics
Latest | Archives | People | Topics
Latest | Archives | People | Topics
Home | News | RSS Feeds | Mailing Lists | Authors Info | Mirrors | Stalled Traffic

Kernel Traffic #150 For 14 Jan 2002

By Zack Brown

Table Of Contents


Today is the three-year anniversary of Kernel Traffic. The first issue came out on January 14, 1999. I can safely say I had no idea what I was getting myself into. It started as a way to give back to the community of people that had enabled me to actually use my computer. Since then, it has been responsible for a 3000-mile relocation from New York to San Francisco, various stages of employment, many friends, and has just basically been a complete eye-opener from day one. Thanks for everything, folks.

Mailing List Stats For This Week

We looked at 2507 posts in 10941K.

There were 592 different contributors. 312 posted more than once. 173 posted last week too.

The top posters of the week were:

1. Reiserfs Problem On 2.4.17 Sparc64 Systems

27 Dec 2001 - 3 Jan 2002 (13 posts) Archive Link: "reiserfs does not work with linux 2.4.17 on sparc64 CPUs"

Topics: Debugging, FS: ReiserFS

People: David S. MillerAlan CoxNikita DanilovLuigi Genoni

Luigi Genoni reported that using 2.4.17 with Reiserfs on a Sparc64 would cause an oops at boot-time, when the system tried to mount his disks. On x86 machines, everything worked fine. Nikita Danilov asked him to reboot into single-user mode, mount the partition manually, and send the decoded oops to the list. Luigi did so, and David S. Miller commented, "Looks like some change in reiserfs in 2.4.17 has caused it to start doing {set,clear,change}_bit() operations on pointers which are not "long" aligned." Alexander Zarochentcev posted a patch, and Alan Cox remarked elsewhere, "The fix is definitely right - without it you'll scribble on other memory on any 64bit BE platform." Luigi also confirmed Alexander's fix, and promised to do some stability testing. End Of Thread (tm).

2. Status Of Framebuffer In 2.4 And 2.5

27 Dec 2001 - 7 Jan 2002 (38 posts) Archive Link: "Framebuffer, mmap(), hanging in D state, root FS unmount failure."

Topics: Debugging, FS, Framebuffer, Small Systems, Virtual Memory

People: Mark J RobertsAndrew MortonTimothy CovellLinus TorvaldsJames Simmons

Mark J Roberts posted a very short test program and reported, "When I run this on my 2.4.17rc2aa2 kernel with a Voodoo3000 framebuffer, the process hangs forever in D state. ps and top will then hang the same way when they read the /proc/pid files for the hung process. And my root filesystem won't unmount. It only happens when PROT_READ|PROT_WRITE is specified - when I use only PROT_WRITE, the program segfaults like you'd expect.... but once the PROT_READ|PROT_WRITE version has hung, PROT_WRITE-only versions will also hang." Andrew Morton diagnosed, "the framebuffer driver is failing to mark the mmapped vma as VM_IO, so the kernel is trying to dump the framebuffer device to the core file, takes a recursive fault and deadlocks." He said the simplest fix was to mark the framebuffer as not dumpable for x86 machines, but added, "However I don't see why _any_ architecture wants framebuffer contents to be included in core files. It sounds risky." At this point Timothy Covell lamented, "When X11 locks up, I can still kill it and my box lives. When framebuffers crash, their is no recovery save rebooting. Back in 1995 I thought that linux VTs and X11 implemenation blew Solaris out of the water, and now we want throw away our progress? I'm still astounded by the whole "oooh I can see a penquin while I boot-up" thing? Granted, frame buffers have usage in embedded systems, but do they really have to be so deeply integrated??" Linus Torvalds replied:

They aren't.

No sane person should use frame buffers if they have the choice.

Like your mama told you: "Just say no". Use text-mode and X11, and be happy.

Some people don't have the choice, of course.

James Simmons replied, "Try pretty much every platform except ix86. Plus now that M$ doesn't support DOS you are starting to see graphics card manufactures dropping VGA support. Even BIOS setup interfaces use the VESA graphics interface these days. So VGA text days are numbered. I agree the framebuffer/console layer really needs to reworked to do the right things." He added, "I plan to do that for 2.5.X."

3. Support For Large FAT Partitions

3 Jan 2002 - 5 Jan 2002 (8 posts) Archive Link: "kernel patch support large fat partitions"

Topics: Big File Support, FS: FAT32, FS: NTFS, Legacy Support, Microsoft

People: Vijay KumarH. Peter Anvin

Vijay Kumar posted a patch and said:

This is regarding a change I had to make to the kernel 2.2.17-14 to support really large drives. In our project we deal with huge drives, sometimes drives larger than 128GB. The file system is FAT32 and only one partition is create on any drive. During our testing we realized that linux fat implementation doesn't support partitions larger than 128GB(actually 64GB because of a bug in fat implementation).

This limitation is imposed by the data structures used by the linux fat implementation to read/write directory entries. A 'long' data type is used to access directory entries on the disk, of which only 28 bits are used to identify the sector that contains the directory entry and the rest 4 bits are used to specify offset of the directory entry inside the sector. Using 28 bits to identify a sector means we cannot access sectors beyond 128GB (2^28*512), thus limiting us from creating partitions larger than 128GB on large disk drives.

I have made changes to fat, vfat and msdos file system implementations in the kernel to use larger data types, thus allowing us to create larger partitions. As per the GPL I would like to make the patch available to everyone and also in case somebody has run into the same problem(who cares about fat in the linux world). The patch has been fairly well tested only on our systems(p3, 700MHz with FC). I truly appreciate if you & anybody in the kernel mailing list have any feedback about the changes.

The patch is applicable to only 2.2.17-14 kernel and may require changes to use with other kernel versions. As far as I know none of the kernel versions provide any fix for this problem.

Someone pointed out that FAT stored 28-bit cluster numbers, and that altering that size would break FAT compatibility. The poster suggested increasing cluster size instead. Vijay replied, "Using 28bit cluster numbers and 65536 cluster size I could go upto 16TB which is lot more than what I wanted. So right now I have no problem with the on-disk format of a fat partition. Nevertheless with hard drives prices going down dramatically it may not be too long before we hit the limit." H. Peter Anvin commented:

That's Microsoft's problem -- that's a fundamental limit of the format they defined. The fact that they defined it in the first place is part of the problem (the only way you can make a FAT filesystem work *well* is by loading the entire FAT into memory ahead of time, and "FAT32" breaks that), instead of creating a more sensible replacement.

(FWIW, the reason they used to justify FAT32 was "it would be too much work to make DOS handle NTFS", as if those were the only options...)

4. New Scalable Scheduler

3 Jan 2002 - 7 Jan 2002 (67 posts) Archive Link: "[announce] [patch] ultra-scalable O(1) SMP and UP scheduler"

Topics: Big O Notation, Efficiency, Networking, Real-Time, SMP, Scheduler

People: Ingo MolnarGeorge AnzingerRobert LoveDavide LibenziLinus Torvalds

Ingo Molnar announced:

now that new-year's parties are over things are getting boring again. For those who want to see and perhaps even try something more complex, i'm announcing this patch that is a pretty radical rewrite of the Linux scheduler for 2.5.2-pre6:

for 2.4.17:


The main goal of the new scheduler is to keep all the good things we know and love about the current Linux scheduler:

and the goal is also to add a few new things:


(those who find the following design issues boring can skip to the next, 'Benchmarks' section.)

the core of the new scheduler are the following mechanizms:

the split-array solution enables us to have an arbitrary number of active and expired tasks, and the recalculation of timeslices can be done immediately when the timeslice expires. Because the arrays are always access through the pointers in the runqueue, switching the two arrays can be done very quickly.

this is a hybride priority-list approach coupled with roundrobin scheduling and the array-switch method of distributing timeslices.

one of the toughest things to get right is good interactive feel during heavy system load. While playing with various scheduler variants i found that the best interactive feel is achieved not by 'boosting' interactive tasks, but by 'punishing' tasks that want to use more CPU time than there is available. This method is also much easier to do in an O(1) fashion.

to establish the actual 'load' the task contributes to the system, a complex-looking but pretty accurate method is used: there is a 4-entry 'history' ringbuffer of the task's activities during the last 4 seconds. This ringbuffer is operated without much overhead. The entries tell the scheduler a pretty accurate load-history of the task: has it used up more CPU time or less during the past N seconds. [the size '4' and the interval of 4x 1 seconds was found by lots of experimentation - this part is flexible and can be changed in both directions.]

the penalty a task gets for generating more load than the CPU can handle is a priority decrease - there is a maximum amount to this penalty relative to their static priority, so even fully CPU-bound tasks will observe each other's priorities, and will share the CPU accordingly.

I've separated the RT scheduler into a different codebase, while still keeping some of the scheduling codebase common. This does not look pretty in certain places such as __sched_tail() or activate_task(), but i dont think it can be avoided. RT scheduling is different, it uses a global runqueue (and global spinlock) and it needs global decisions. To make RT scheduling more instant, i've added a broadcast-reschedule message as well, to make it absolutely sure that RT tasks of the right priority are scheduled apropriately, even on SMP systems. The RT-scheduling part is O(1) as well.

the SMP load-balancer can be extended/switched with additional parallel computing and cache hierarchy concepts: NUMA scheduling, multi-core CPUs can be supported easily by changing the load-balancer. Right now it's tuned for my SMP systems.

i skipped the prev->mm == next->mm advantage - no workload i know of shows any sensitivity to this. It can be added back by sacrificing O(1) schedule() [the current and one-lower priority list can be searched for a that->mm == current->mm condition], but costs a fair number of cycles during a number of important workloads, so i wanted to avoid this as much as possible.

(i'm sure there are other details i forgot to explain. I've commented some of the more important code paths and data constructs. If you think some aspect of this design is faulty or misses some important issue then please let me know.)

(the current code is by no means perfect, my main goal right now, besides fixing bugs is to make the code cleaner. Any suggestions for simplifications are welcome.)


i've performed two major groups of benchmarks: first i've verified the interactive performance (interactive 'feel') of the new scheduler on UP and SMP systems as well. While this is a pretty subjective thing, i found that the new scheduler is at least as good as the old one in all areas, and in a number of high load workloads it feels visibly smoother. I've tried a number of workloads, such as make -j background compilation or 1000 background processes. Interactive performance can also be verified via tracing both schedulers, and i've done that and found no areas of missed wakeups or imperfect SMP scheduling latencies in either of the two schedulers.

the other group of benchmarks was the actual performance of the scheduler. I picked the following ones (some were intentionally picked to load the scheduler, others were picked to make the benchmark spectrum more complete):

[ i can test any other workload too that anyone would find interesting. ]

i ran these benchmarks on a 1-CPU box using a UP kernel, a 2-CPU and a 8-CPU box as well, using the SMP kernel.

The chat-server simulator creates a number of processes that are connected to each other via TCP sockets, the processes send messages to each other randomly, in a way that simulates actual chat server designs and workloads.

3 successive runs of './chat_c 10 1000' produce the following message throughput:


Average throughput : 110619 messages per second
Average throughput : 107813 messages per second
Average throughput : 120558 messages per second


Average throughput : 131250 messages per second
Average throughput : 116333 messages per second
Average throughput : 179686 messages per second

this is a rougly 20% improvement.

To get all benefits of the new scheduler, i ran it reniced, which in essence triggers round-robin batch scheduling for the chat server tasks:

3 successive runs of 'nice -n 19 ./chat_c 10 1000' produce the following throughput:


Average throughput : 77719 messages per second
Average throughput : 83460 messages per second
Average throughput : 90029 messages per second


Average throughput : 609942 messages per second
Average throughput : 610221 messages per second
Average throughput : 609570 messages per second

throughput improved by more than 600%. The UP and 2-way SMP tests show a similar edge for the new scheduler. Furthermore, during these chatserver tests, the old scheduler doesnt handle interactive tasks very well, and the system is very jerky. (which is a side-effect of the overscheduling situation the scheduler gets into.)

the 1-CPU UP numbers are interesting as well:


./chat_c 10 100
Average throughput : 102885 messages per second
Average throughput : 95319 messages per second
Average throughput : 99076 messages per second

nice -n 19 ./chat_c 10 1000
Average throughput : 161205 messages per second
Average throughput : 151785 messages per second
Average throughput : 152951 messages per second


./chat_c 10 100 # NEW
Average throughput : 128865 messages per second
Average throughput : 115240 messages per second
Average throughput : 99034 messages per second

nice -n 19 ./chat_c 10 1000 # NEW
Average throughput : 163112 messages per second
Average throughput : 163012 messages per second
Average throughput : 163652 messages per second

this shows that while on UP we dont have the scalability improvements, the O(1) scheduler is still slightly ahead.

another benchmark measures sched_yield() performance. (which the pthreads code relies on pretty heavily.)

on a 2-way system, starting 4 instances of ./loop_yield gives the following context-switch throughput:


  # vmstat 5 | cut -c57-
     system         cpu
   in    cs  us  sy  id
  102 241247   6  94   0
  101 240977   5  95   0
  101 241051   6  94   0
  101 241176   7  93   0


  # vmstat 5 | cut -c57-
     system         cpu
   in     cs  us  sy  id
  101 977530  31  69   0
  101 977478  28  72   0
  101 977538  27  73   0

the O(1) scheduler is 300% faster, and we do nearly 1 million context switches per second!

this test is even more interesting on the 8-way system, running 16 instances of loop_yield:


   vmstat 5 | cut -c57-
      system         cpu
    in     cs  us  sy  id
   106 108498   2  98   0
   101 108333   1  99   0
   102 108437   1  99   0

100K/sec context switches - the overhead of the global runqueue makes the scheduler slower than the 2-way box!


    vmstat 5 | cut -c57-
     system         cpu
    in      cs  us  sy  id
   102 6120358  34  66   0
   101 6117063  33  67   0
   101 6117124  34  66   0

this is more than 6 million context switches per second! (i think this is a first, no Linux box in existence did so many context switches per second before.) This is one workload where the per-CPU runqueues and scalability advantages show up big time.

here are the lat_proc and lat_ctx comparisons (the results quoted here are the best numbers from a series of tests):


./lat_proc fork
Process fork+exit: 440.0000 microseconds
./lat_proc exec
Process fork+execve: 491.6364 microseconds
./lat_proc shell
Process fork+/bin/sh -c: 3545.0000 microseconds


./lat_proc fork
Process fork+exit: 168.6667 microseconds
./lat_proc exec
Process fork+execve: 279.6500 microseconds
./lat_proc shell
Process fork+/bin/sh -c: 2874.0000 microseconds

the difference is pretty dramatic - it's mostly due to avoiding much of the COW overhead that comes from fork()+execve(). The fork()+exit() improvement is mostly due to better CPU affinity - parent and child are running on the same CPU, while the old scheduler pushes the child to another, idle CPU, which creates heavy interlocking traffic between the MM structures.

the compilation benchmarks i ran gave very similar results for both schedulers. The O(1) scheduler has a small 2% advantage in make -j benchmarks (not accounting statistical noise - it's hard to produce reliable compilation benchmarks) - probably due to better SMP affinity again.


i've tested the new scheduler under the aforementioned range of systems and workloads, but it's still experimental code nevertheless. I've developed it on SMP systems using the 2.5.2-pre kernels, so it has the most testing there, but i did a fair number of UP and 2.4.17 tests as well. NOTE! For the 2.5.2-pre6 kernel to be usable you should apply Andries' latest 2.5.2pre6-kdev_t patch available at:

i also tested the RT scheduler for various situations such as sched_yield()-ing of RT tasks, strace-ing RT tasks and other details, and it's all working as expected. There might be some rough edges though.

A lot of people tried the patch. Some had problems getting it working, others got it running right away. A lot of people were very impressed at Ingo's figure of 6,000,000 context switches per second. At one point in the discussion, Dieter Nitzel asked if the patch might be merged with the preemption patch. Ingo replied, "yes, fast preemption of kernel-mode tasks and the scheduler code are almost orthogonal. So i agree that to get the best interactive performance we need both." A couple posts later, George Anzinger said, "The two patches will are not compatable. When the time comes we will have to work out how to make them compatable as they both modify key parts of sched.c." Robert Love replied, "Ingo and I are working on a merged patchset that works. Yay."

Elsewhere, Davide Libenzi replied to Ingo's initial post, pointing out that Ingo had objected to a patch by Davide 4 years earlier, that did what Ingo's patch went to great lengths to accomplish now: achieve O(1) algorithmic efficiency. Davide said, "You said, just in case you do not remember, that the load average over a single CPU even for high loaded servers is typically lower than 5. So why the hell create 13456 queues to achieve an O(1) on the lookup code when 70% of the switch cost is switch-mm ? Yes, 70% of the cost of a context switch is switch-mm, and this measured with a cycle counter. Take a look at this if you do not believe :" Ingo replied, "yep, this might have been the case 4 years ago :) But i'd not be ashamed to change my opinion even if i had said it just a few months ago. Today we have things like the slashdot effect that will break down otherwise healthy and well-sized systems. People have started working around things by designing servers that run only a limited number of processes, but the fact is, why should we restrict application maker's choice of design, especially if they are often interested in robustness more than in the last bit of performance. There are a fair number of real-world application servers that start to suffer under Linux if put under realistic load." Davide replied, "No Ingo the fact that you coded the patch this time does not really change the workloads, once you've a per-cpu run queue and lock. The thing that makes big servers to suffer in the common queue plus the cache coherency traffic due the common lock. Look here for an 8 way system : In even _high_realistic_ workloads on these servers the scheduler impact is never more than 5-15% and by simply splitting the queue/lock you get a 30 up to 60 time fold ( see picture ), that makes the scheduler impact to go down to 0.2-0.3%. Why do you need to overoptimize for a 0.2% ?"

At one point Linus Torvalds said, "I don't think O(1) matters all that much, but it certainly won't hurt. UNLESS it causes bad choices to be made. Which we can only guess at right now." This satisfied Davide, and a bunch of folks proceeded with an implementation discussion.

5. Multiple Kernel Trees

4 Jan 2002 - 7 Jan 2002 (10 posts) Archive Link: "Linux Kernel-2.4.18-nj1"

Topics: Development Philosophy, FS: ReiserFS, FS: ext3, Project Forks

People: Willy TarreauTom Rini

Nathan Russell posted his own patch-set against 2.4.18, but Willy Tarreau objected:

Please don't take it bad, I don't want to flame you nor anybody, but I think that if everyone publicly announces his own tree with his own set of changes against the main kernel, many users will be lost quickly.

Once, we had "only" 3 main trees for the stable release :

Now that Alan is working on something else, I can easily understand that people need a branch like the one he maintained, even if the majority of his work has been merged into the main tree.

But there is now an -mjc tree, an -nj tree, perhaps others I missed, and many more not announced here (-wt, I remember of Mathias Andree's -ma for example who may have been the first one to merge ext3 and reiserfs into a same kernel). All of them include nearly the same set of fixes that have not yet got into the official kernel, plus a set of more-or-less stable, more-or-less interesting features (depending who the targets are). At least, each one should announce the goal of his kernel, and who it is intended to : developpers, production users, desktop users, testers, all of them ? With your announce, nobody even knows if he takes some particular risks using features from your kernel. Example: I believe that at least Andre Pavenis still has problems with Doug's i810_audio driver, so this cannot be annouced simply as a "fix" without a little note.

I sure know it takes a lot of time maintaining a set of patches against a mainstream kernel, and it even takes more time reading bug reports and determining what is stable and what isn't. Not to tell about the dozens of compilations before an announcement (because you compile them, don't you?), and occasional porting of some fixes to other architectures.

So I really think it would be more productive if people worked around a smaller set of trees, stopped editing patches by hand again and again to resolve the same conflicts and tried to be a bit more understandable for newbies who are a bit lost when they don't know what kernel to try.

Perhaps you could have sent your patches to Mickael Cohen to help him release an -mjc2 more quickly, like we all did with Alan ? Or perhaps you have a clear idea in mind about what your kernel tree will be, but in this case, please elaborate on this a bit more than this simple post, and prepare to cope with bug reports from people who will trust your tree. This last point may be the major reason why I chose to keep my kernels for my friends and I...

I hope you didn't take it as an attack, it wasn't really against you, nor to start a flame war, but just to make people think a bit about something more constructive.

Tom Rini felt that even the current number of trees was too many, saying, "When Alan picked up 2.2.x, there wasn't someone else doing an -ac'ish 2.2 release as well. Marcelo is doing 2.4.x now, and seems to be doing a good job of making sure stable stuff gets in, and other stuff doesn't. The only patches that won't make it into Marcelos tree in the very-near-term (Which is all I'll speculate about) are the preempt (and lock-break) patches. Please people, more trees are not always a better thing when you're all doing the same thing." Willy replied, "Perhaps people who have a solid personal tree would like to continue this discussion off-list and find an arrangement about a single test tree. Concerning stable trees, I think that both Marcello's and Andrea's are rock solid. Othe people may want to use their distributor's."

6. Source Tracking

5 Jan 2002 - 6 Jan 2002 (5 posts) Archive Link: "Binutils and the Linux kernel source finder"

People: Dr. David Alan Gilbert

Dr. David Alan Gilbert said:

I am the author of the 'Linux kernel source finder' web page that lists for each architecture the place to get appropriate Linux kernel patches - see:

I wish to extend this to include pointers to the best/latest/most appropriate binutils for each architecture. I've put links in for x86, Alpha and MIPS to H.J.Lu's ftp site, since he tests for those 3 platforms prior to release.

I'd appreciate recommendations and comments from those using binutils on Linux for other platforms, with links to ftp, cvs or web pages describing the solutions for those architectures.







Sharon And Joy

Kernel Traffic is grateful to be developed on a computer donated by Professor Greg Benson and Professor Allan Cruse in the Department of Computer Science at the University of San Francisco. This is the same department that invented FlashMob Computing. Kernel Traffic is hosted by the generous folks at All pages on this site are copyright their original authors, and distributed under the terms of the GNU General Public License version 2.0.