Table Of Contents
|1.||7 Jan 2002 - 13 Jan 2002||(15 posts)||Lightweight User-Level Semaphore Implementation|
|2.||8 Jan 2002 - 11 Jan 2002||(20 posts)||IDE Patch|
|3.||9 Jan 2002 - 11 Jan 2002||(9 posts)||Ensuring 2.4 Interface Stability Regarding ReiserFS|
|4.||11 Jan 2002||(7 posts)||Slight Developer Disconnect Over NCR5380 Maintenance|
|5.||11 Jan 2002 - 12 Jan 2002||(6 posts)||Specifying Module Licenses In The Code|
|6.||12 Jan 2002 - 13 Jan 2002||(11 posts)||Problems In 2.2 SMP Support|
|7.||13 Jan 2002 - 14 Jan 2002||(13 posts)||Alan To Continue -ac Tree Against 2.4|
|8.||13 Jan 2002 - 15 Jan 2002||(8 posts)||Merging Preemptive Kernel Patch With New Scheduler Code|
|9.||13 Jan 2002 - 14 Jan 2002||(7 posts)||User-Mode Linux And The New Scheduler Code|
I'd like to thank all the folks who emailed me with congratulations for the 3rd anniversary of Kernel Traffic. Even slashdot picked it up, which was a nice surprise.
Oddly enough, one of the biggest complaints I see about KT in its rare moments in the limelight, is that it isn't a complete summary of events on the list. I think one of the slashdot commentaries was that KT sometimes leaves out important threads, while going into too much detail on irrelevant side discussions.
As far as leaving out important threads, that is certainly true. List traffic has averaged 5.5 megs per week over the time I've been covering it, and in my opinion almost all of that traffic is on topic. This does not include the dozens of auxiliary mailing lists devoted to specific parts of the kernel, or the never-ending IRC discussions in which much real development takes place. Kernel Traffic, and other kernel-related publications, can do little more than give the briefest flavor of what is going on in kernel development. If you really want a thorough understanding, you have no choice: you must subscribe to LKML and experience its tidal forces for yourself. Take it from me: it is well worth it.
My goal with KT is to present the threads that most interest me personally. And I am most interested in the way kernel development plays out as a process. How are decisions made? Who is involved? What is free software development? This development model was born with the Linux kernel. Before then, although the sources may have been available under the GPL etc., the universally accepted wisdom was that high-quality software could only be created by a small team of experts working for a long time in private, putting out new releases only after many months or years of effort. This method was found in the GNU project as it was in proprietary software companies. Linus cracked that idea wide open, and the core essence of his methods are now found in the organization of virtually every open source software project out there. Even some commercial entities try to simulate it in-house, with greater or lesser success.
These development processes are themselves still under development, and I choose to search for them here, where it all began. Not all the threads I cover focus on this desire, because the broad landscape of the kernel project doesn't always reveal itself under tight focus. And some summaries are more news-related than anything else, just presenting portions of the kernel as they currently are.
I hope above all, people find Kernel Traffic enjoyable and interesting. If its failings and insufficiencies provoke people to delve deeper into the real kernel development forums, I count that as a complete success.
Be well, folks.
Mailing List Stats For This Week
We looked at 2071 posts in 8812K.
There were 497 different contributors. 255 posted more than once. 213 posted last week too.
The top posters of the week were:
1. Lightweight User-Level Semaphore Implementation
7 Jan 2002 - 13 Jan 2002 (15 posts) Archive Link: "[PATCH][RFC] Lightweight user-level semaphores"
Topics: SMP, Virtual Memory
People: Matthew Kirkwood, Linus Torvalds, Alan Cox, Manfred Spraul, Rusty Russell
Matthew Kirkwood announced:
The patch below implements some of your design for a really quick user-level locking primitive, as explained here:
There's a user-level API and a couple of test programs in the attached tarball. I haven't bothered wih the vital security hash/signature thing yet.
It all seems to work (i686 UP and SMP), but isn't without issues:
I don't do the:
if (kfs->user_address != fs)
because it doesn't seem to add anything, and prevents putting these locks in a non-fixed file or SysV SHM map.
Is that a problem?
To this last, Linus Torvalds replied that he'd suggested that mainly as another sanity check, that wasn't strictly needed. Ragerding the time-out requirements, Linus said nothing existed for that as such in the kernel so far, though in theory all the needed infrastructure would be provided eventually. Finally, regarding the leaks, he said that attaching the refcounts to the VM mappings would be an acceptable way to make sure all memory was freed at the proper time. He added that he might "also require a flag at mmap time (MAP_SEMAPHORE - some other unixes have something like it already) to tell the OS about the consistency issues that might come up on some architectures (on x86 it would be a no-op)." He and Matthew exchanged a few words on how to implement the reference counting, and then Linus said:
Note that there are other, potentially cleaner solutions. In particular, some people like the "semaphore as file descriptor" approach, and I have to say that I think they may be right. Then you just pass the file descriptor along as the cookie, and you can do dup()/close() etc on it.
Mind trying that approach instead? It's not all that far off from your current setup, and it would certainly have none of the security implications..
After some off-list discussion, Matthew posted some code, and there followed a technical discussion with Manfred Spraul, Matthew Kirkwood, Alan Cox, and Rusty Russell.
2. IDE Patch
8 Jan 2002 - 11 Jan 2002 (20 posts) Archive Link: "IDE Patch (fwd)"
Topics: Disks: IDE, Forward Port
People: Rob Radez, Andre Hedrick, Andrew Morton, Oliver Xymoron
On the principle that success reports for a given patch will not result in actual inclusion in the kernel sources unless first lauded on LKML, Andre Hedrick forwarded some private praise from Rob Radez, regarding Andre's ide.22.214.171.12402001 patch. Rob said, "I'm using your ide.126.96.36.19902001 patch with a Promise PDC20269 controller and a Maxtor 160GB hard drive on 2.4.17, and I just wanted to tell you that it's working great so far." A lot of other people agreed that Andre's code was working perfectly, and urged inclusion in 2.4 and 2.5; at one point Andre remarked, "I know the driver is stable and effectively perfect in operations. So I do not understand the total ignore I receive about it." Elsewhere, Andrew Morton said, "I spent a couple of hours beating the crap out of it, and none actually came out. I'd vote for prompt inclusion in 2.5, and inclusion in 2.4.x-pre1 when it's shown to be stable." Oliver Xymoron put in, "I vote for doing the reverse. The 2.4 codebase is the more tested, the 2.5 is a forward-port. Given all the related block changes still settling out in 2.5, changing IDE might make block layer/IDE issues hard to sort out. Let's see it in the next 2.4.x-pre1."
3. Ensuring 2.4 Interface Stability Regarding ReiserFS
9 Jan 2002 - 11 Jan 2002 (9 posts) Archive Link: "[PATCH] UUID & volume labels support for reiserfs"
Topics: FS: ReiserFS
People: Chris Mason, Oleg Drokin, Andreas Dilger, Hans Reiser
Oleg Drokin posted a patch (originally by Andreas Dilger) against the 2.4 kernel, to reserve space for volume label and UUID in the Reiserfs 3.6 superblock, and to generate random UUID for volumes converted from 3.5 to 3.6 format by the kernel. He urged inclusion in the sources, but Chris Mason said, "This should not be applied until an updated (non beta) reiserfsprogs package that supports these features has been released." Oleg felt there was no need to wait for outside support before applying the patch. He said, "when actual reiserfsprogs and util-linux support will appear, people will just start to use these features." He also cautioned that if tools were released, supporting kernel features that were not yet implemented, bad things could happen. He added that Hans Reiser also felt the time had come for the patch.
Chris replied that applying the patch would force changes in the userland tools, which should as policy never be done during a stable series. But he went on, "But, the progs are improving so quickly that we should bend this rule a little bit. Another example is the unlink truncate patch never should have been sent to Marcelo without a non-beta reiserfsprogs that understood it. Neither should this patch (even though it is a much smaller problem)."
Oleg pointed out that the patch would not force any changes on userland programs, although "if someone will update their progs voluntarily, we cannot forbit them to! ;))" . Chris replied, "The point is that we should never add something to the kernel until our utils package understands it. Yes, this is a simple case, but if we want to call reiserfs stable, there are some basic rules we need to start following." Oleg replied that actually, the latest reiserfsprogs package did understand the new data organization, it just couldn't actually change the content of the new fields itself. Chris, looking over the code, didn't see how the tools were aware of the new design, or even what a UUID was. Oleg said, "It does not know about uuid per se, but it know in that area some text data is stored."
At this point Oleg noticed, "I see MArcello have not applied this patch to 2.4.18-pre3, so we have some more time to prepare reiserfsprogs ;)" . End of thread.
4. Slight Developer Disconnect Over NCR5380 Maintenance
11 Jan 2002 (7 posts) Archive Link: "Big patch: linux-2.5.2-pre11/drivers/scsi compilation fixes"
Topics: Disks: SCSI
People: Alan Cox, Adam J. Richter, Linus Torvalds
Adam J. Richter posted a large patch to clean up the SCSI modules in 2.5, and said he'd post smaller incremental patches for Linus Torvalds unless there were objections. Alan Cox replied:
I specifically told people not to hack on the old NCR5380 driver. You've taken a semi broken driver, destroyed it completely and risked disk corruption for anyone who uses it.
What really annoys me is that I've already asked you specifically not to submit patches to that driver but to take the 2.4.18pre version of the driver and port that one forward if you must fiddle with it. Instead you've wasted your time, and tried to make the future merge harder.
Its absolutely obvious from the changes that you have no grasp how the locking in that driver is handled, nor what it depends upon. If you had understood that locking you'd have realised you were hacking on a driver version that was totally flawed.
How many other maintainers have you ignored trying to send in untested patches to their drivers ?
Adam objected that he'd never received such instructions from Alan. Going over his email archive, he couldn't find any email like the one Alan described. But he added, "Now that I am aware of your request regarding using the 2.4.18pre version of the NCR driver for future maintenance of the 2.5 driver, I am happy to follow it."
Alan apologized for confusing him with someone else, and suggested, "I think you'll find it" (the 2.4 code) "a lot easier to follow too. The thing to watch is that the queue of devices to process on an IRQ is not per host but driver global. The rest should be obvious, but watch the co-routine locking. If you get that wrong the driver does occasionally recurse down the stack and explode mysteriously." End of thread.
5. Specifying Module Licenses In The Code
11 Jan 2002 - 12 Jan 2002 (6 posts) Archive Link: "CIPE vs. GPLONLY_"
Topics: Version Control
People: Brian Litzinger, Alan Cox, Olaf Titz
Brian Litzinger saw the following error when trying to load the CIPE module:
/lib/modules/2.4.17/misc/cipcb.o: Note: modules without a GPL compatible license cannot use GPLONLY_ symbols
He pointed out the CIPE was itself GPLed, and asked, "I remember reading on l-k a few times some stuff about GPLONLY_ but I have no idea what to do now that I've run into whatever the problem is that is caused by this?" Alan Cox instructed:
to the cipe code and all will be well
Brian tried this with complete success, and Olaf Titz pointed out that the fix had already made it into the CIPE CVS tree.
6. Problems In 2.2 SMP Support
12 Jan 2002 - 13 Jan 2002 (11 posts) Archive Link: "Linux-2.2.20 SMP & Asus CUR-DLS: "stuck on TLB IPI wait (CPU#3)""
People: Andreas Haumer, Benjamin LaHaise, Alan Cox
Andreas Haumer reported, "I'm seeing a problem with SMP Linux-2.2.20 on an ASUS CUR-DLS motherboard. I noticed there were similar reports in the past few months and I got the impression the problem should already be fixed in 2.2.20, but seemingly it isn't." Benjamin LaHaise said this was fixed in 2.4, and Andreas asked if there would be a back-port into 2.2.21; Benjamin replied, "That's unlikely: the improvements in smp locking are what 2.4 was all about, so "backporting" them is basically reinventing 2.4." And Alan Cox also said to Andreas, "2.2 does not support VIA SMP, its probably not a good kernel to choose for the buggy VIA chipsets either."
7. Alan To Continue -ac Tree Against 2.4
13 Jan 2002 - 14 Jan 2002 (13 posts) Archive Link: "Linux 2.4.18pre3-ac1"
Topics: Virtual Memory
People: Alan Cox, Adam Kropelin, Benjamin LaHaise
Alan Cox announced:
People keep bugging me about the -ac tree stuff so this is whats in my current internal diff with the ll patch and the ide changes excluded.
Much of this is stuff just waiting to go to Marcelo but it has the 32bit uid quota that some folks consider pretty critical and the rmap-11b VM which I consider pretty essential
(Marcelo I'll be sending you stuff I've done from this anyway, if there is other stuff you want extracting just ask)
Adam Kropelin reported, "For the sake of completeness I ran my large inbound FTP transfer test (details in the "Writeout in recent kernels..." thread) on this release. Performance and observed writeout behavior was essentially the same as for 2.4.17, both stock and with -rmap11a. Transfer time was 6:56 and writeout was uneven. 2.4.13-ac7 is still the winner by a significant margin." Alan replied, "That is very useful information actually. That does rather imply that some of the performance hit came from the block I/O elevator differences in the old ac tree (the ones Linus hated ;)). Now the question (and part of the reason Linus didnt like them) - is why ?" Benjamin LaHaise said, "Iirc, Linus just didn't like the low/high watermarks for starting & stopping io. Personally, I liked it and wanted to use that mechanism for deciding when to submit additional blocks from the buffer cache for the device (it provides a nice means of encouraging batching). The problem that started this whole mess was a combination of the missing wake_up in the block layer that I found, plus the horrendous io latency that we hit with a long io queue and no priorities. The critical pages for swap in and program loading, as well as background write outs need to have a priority boost so that interactive feel is better. Of course, with quite a few improvements in when we wait on ios going into the vm between 2.4.7 and 2.4.17, we don't wait as indiscriminately on io as we did back then. But write out latency can still harm us. In effect, it is a latency vs thruput tradeoff."
8. Merging Preemptive Kernel Patch With New Scheduler Code
13 Jan 2002 - 15 Jan 2002 (8 posts) Archive Link: "[PATCH] update: preemptive kernel for O(1) sched"
Topics: Big O Notation, Scheduler, Virtual Memory
People: William Lee Irwin III, Ingo Molnar, Robert Love
Robert Love announced an update to allow his preemptive-kernel patch to be used with Ingo Molnar's O(1) scheduler. Several folks pounded on it, and William Lee Irwin III said, "I have at least run it on my laptop, together with rmap even. No pathological behavior that I can tell. Of course, the interactive response is wonderful, but I haven't precisely measured anything, as I have enough other things to measure precisely it's a bit far afield."
9. User-Mode Linux And The New Scheduler Code
13 Jan 2002 - 14 Jan 2002 (7 posts) Archive Link: "The O(1) scheduler breaks UML"
Topics: Big O Notation, SMP, Scheduler, User-Mode Linux
People: Jeff Dike, Ingo Molnar, Davide Libenzi
Jeff Dike reported:
The new scheduler holds IRQs off across the call to context_switch. UML's _switch_to expects them to be enabled when it is called, and things go badly wrong when they are not.
Because UML has a host process for each UML thread, SIGIO needs to be forwarded from one process to the next during a context switch. A SIGIO arriving during the window between the disabling of IRQs and forwarding of IRQs to the next process will be trapped on the process going out of context. This happens fairly regularly and causes hangs because some process is waiting for disk IO which never arrives because the process that was notified of the completion is switched out.
So, is it possible to enable IRQs across the call to _switch_to?
Davide Libenzi posted a fix which seemed to work, but Ingo Molnar pointed out that it was broken for SMP systems. Elsewhere, Ingo also replied to Jeff's request, saying:
unfortunately this cannot be done, due to exit(), ptrace() and other SMP races. On SMP, the 'previous' task is protected by the runqueue lock. If we do the context switch outside the runqueue lock then a task might be freed on another CPU while it's in fact still in use.
there are other heavy implications as well:
in 2.4 i've implemented irq-enabled context switches, and it was a major PITA. To do it correctly one has to do reintroduce __schedule_tail() and do a task_lock/task_unlock to get context-switch atomicity via other means than the local runqueue lock. On 2.4 i did this because global runqueue contention was such an issue for certain workloads that even the task-unlocking overhead was worth it. With the O(1) scheduler this is pretty much out of the question.
we could enable interrupts on UP - because UP is special, disabling interrupts there is in essence a cheap 'global interrupt lock'. But that doesnt help the SMP/UML situation much.
i'd suggest to find some other solution for UML, besides signals. __switch_to is a very internal function that can very well be called with spinlocks disabled, we just cannot guarantee that it will be called with irqs enabled. Signals are something that is often 'heavy', it cannot be done atomically in the generic case.
You suggest implementing interrupts with something other than signals? What else is there?
In any case, I stuck a little kludge in _switch_to which checks for pending SIGIO and, if there is one, hits the incoming process with a SIGIO. This seems to do the trick.
There was no reply.
Sharon And Joy
Kernel Traffic is grateful to be developed on a computer donated by Professor Greg Benson and Professor Allan Cruse in the Department of Computer Science at the University of San Francisco. This is the same department that invented FlashMob Computing. Kernel Traffic is hosted by the generous folks at kernel.org. All pages on this site are copyright their original authors, and distributed under the terms of the GNU General Public License version 2.0.