Kernel Traffic #175 For 14 Jul 2002

By Zack Brown

Table Of Contents

Mailing List Stats For This Week

We looked at 1136 posts in 4973K.

There were 388 different contributors. 176 posted more than once. 126 posted last week too.

The top posters of the week were:

1. Reducing Disk Spin When Running Off Battery Power

24 Jun 2002 - 5 Jul 2002 (18 posts) Archive Link: "Automatically mount or remount EXT3 partitions with EXT2 when a laptop is powered by a battery?"

Topics: FS: ext2, FS: ext3, Laptop Support, Ottawa Linux Symposium, Power Management: ACPI

People: Andrew MortonPavel MachekStephen C. TweedieMiles Lane

Miles Lane asked if there were any way to make a laptop switch its filesystem mount from ext3 to ext2, when running on battery power. Several people said this was not possible, and asked why Miles would want to do such a thing. Andrew Morton speculated, "If it's because of the disk-spins-up-too-much problem then that can be addressed by allowing the commit interval to be set to larger values." Miles confirmed that this was indeed why he'd mentioned it; and went on to ask if there were any ay to set the commit interval automatically when switching to battery power. Andrew replied, "If the APM/ACPI stuff can report the transition to userspace then yes, that's something which their support scripts could do." Pavel Machek chimed in, with, "ACPI should be able to pass that info. Please make that patch go to linus, it looks very usefull to me." Elsewhere some folks discussed various implementation details, and Stephen C. Tweedie said he'd check a patch into the tree when he got back from the Ottawa Linux Symposium.

2. X86 Page Sizes

26 Jun 2002 - 5 Jul 2002 (7 posts) Archive Link: "x86 Page Sizes"

Topics: Big Page Size Support, Ottawa Linux Symposium

People: Dan SturtevantPeter SvenssonSteven Cole

Dan Sturtevant asked, "I know the x86 linux kernel has 4K pages in userspace and 4M pages in kernel space. These two sizes seem to be limitations of the intel architecture (I think). Does anyone know a way to increase the userspace page size above 4K? Are there any patches for a 4M userspace pagesize?" A couple posts down the line, Peter Svensson explained, "The x86 cpus can use 4K or 4M pages in the hardware. The 4M pages are restricted to the kernel in Linux due to various problems" ... "4M pages are useful to minimize tlb misses which can be costly for some algorithms." He referred Dan to an earlier discussion on the list, with the Subject: Have the 2.4 kernel memory management problems on large machines been fixed? (http://www.uwsg.iu.edu/hypermail/linux/kernel/0205.2/1201.html) Steven Cole also gave a link to the 2002 Ottawa Linux Symposium (http://www.linuxsymposium.org/2002/) page, which contained the full pProceedings as a PDF file. He warned that it was 631 pages long, and pointed to pages 573-593 as bearing on this subject.

3. Some Discussion Of Hardware Auto-Detection

27 Jun 2002 - 4 Jul 2002 (9 posts) Archive Link: "Multiple profiles"

Topics: FS, Hot-Plugging, USB

People: Jesse PollardBrad Hards

Gregory Giguashvili asked why Linux was unable to detect all hardware configurations automatically. After some confusion over his phrasing of the question, he clarified, and Jesse Pollard explained:

Most tapes/scanners/disks that are removable/detachable are using the USB.

If that is the case, then yes - they can be handled automatically.

You do have to setup the USB daemon and drivers. Once configured they should be connected automatically. Depending on the type of disk (hard disk, filesystem type, access authorizations) you run into additional complications.

Not everything SHOULD be automatically done. For instance - overriding authorizations on a disk drive can allow a workstation user to violate the security policy established for the disk drive. The same can be said for a tape or floppy. Such policies are NOT implementable inside the kernel (at least not portably).

This is one reason an automatic mount is not necessarily valid. That policy cannot be supported (or even identified) by the kernel.

Scanners and printers however, are more policy neutral - they don't inherently store data that is policy controlled. At least not in the US. These devices are usually immediately available after connection. (Though I'm still working on getting my HP G55 scanner/printer working - it is recognized by the USB subsystem as soon as it is attached).

I believe in other countries scanners are required to be able to label the data being scanned and/or printed to identify the source of the data (doesn't prevent tampering, but it is still a policy).

And Brad Hards also replied to Gregory:

We can do this, for some device types. Not just for boot, but for hotplug type devices as well. The kernel option is CONFIG_HOTPLUG, and it signals userspace to describe what went on.

It is not appropriate for the kernel to decide what goes on (eg, if you attach a USB scanner, whether you'd like to load the necessary kernel modules, start up KDE and kooka, start a scan and save to /tmp/pr0n; or just ignore it for now because the scanner is noisy, and you'll start it running overnight from a cron job). So we make such policy decisions in userspace. This is normally some shell script run as /sbin/hotplug (although you can change the script name using a /proc interface). Sample scripts can be downloaded from http://linux-hotplug.sf.net, which has lots more documentation on this.

4. SCHED_IDLE Implementation

30 Jun 2002 - 7 Jul 2002 (9 posts) Archive Link: "[announce] [patch] batch/idle priority scheduling, SCHED_BATCH"

Topics: Big O Notation, Real-Time, SMP, Scheduler, Security

People: Ingo MolnarNicholas Miell

Ingo Molnar announced:

the attached patch adds a feature that was pretty high on the scheduler features wishlist: it implements the functionality of SCHED_IDLE, in a safe way. Another desired scheduler feature was batch scheduling, the cache-friendly handling of lowprio, batch-like, CPU-bound, 100% noninteractive tasks. The new SCHED_BATCH scheduler policy implements both features.

the existing SCHED_IDLE patches floating around, despite their simplicity, had one major flaw that prevented their integration into the scheduler: if an unpriviledged SCHED_IDLE process uses normal kernel functionality, which happens to grab a critical kernel resource such as the root directory's semaphore, and schedules away still holding the semaphore, then there is no guarantee that the task will run again in any deterministic amount of time - keeping the critical resource potentially forever - deadlocking every other process that attempts to use that critical resource. This property, while being a source for soft lockups even during ordinary use, also makes SCHED_IDLE an easy DoS exploit.

as the size of the patch suggests, the safe solution is not simple. The basic concept is the identification of user-space preemption via a special scheduler upcall: one safe point to delay a task's execution indefinitely is when the task is preempted in pure user-space mode - if this happens then the lowlevel kernel entry code calls the schedule_userspace() function, instead of schedule(). In every other case the task needs to stay in the 'normal' scheduler queues, to guarantee prompt processing of kernelspace code. Furthermore, such batch-mode tasks need to be scheduled if they get a signal delivered - otherwise it would not be possible to eg. kill them.

other properties: SCHED_BATCH also triggers much longer, batch-like timeslices - the default SCHED_BATCH timeslice is 1.5 seconds. Nice values still have a meaning for SCHED_BATCH processes as well - they determine the relative percentage of idle CPU time allocated to SCHED_BATCH processes. If the SCHED_BATCH process is in kernel-mode then the nice value is used as the normal priority when preempting (or not preempting) other, non-SCHED_BATCH processes.

put in another way: whenever a SCHED_BATCH process is in kernel-mode, it's "elevated" into the SCHED_NORMAL priority domain - which guarantees timely execution of kernel-space code. When the SCHED_BATCH process is executing user-space code then it can be put into the batch-queue, and can be delayed indefinitely.

Timeslice distribution is a modified/simplified version of SCHED_NORMAL scheduling: SCHED_BATCH processes are scheduled in a roundrobin way, timeslices are distributed based on the nice value. SCHED_BATCH tasks that use up their timeslices get suspended until all other SCHED_BATCH tasks on that CPU exhaust their timeslices - at which point a new turn begins. SCHED_NORMAL, SCHED_RR and SCHED_FIFO tasks preempt SCHED_BATCH processes immediately. All this functionality is implemented in an O(1) way. (The interactivity estimator is active for SCHED_BATCH processes as well - this has an effect if the task is in kernelspace mode. This also makes sure that no artificial priority boost can be achieved by switching in/out of SCHED_BATCH mode.)

on SMP there are per-CPU batch queues - which enables the use of hundreds or thousands of SCHED_BATCH processes, if desired. A new, independent load-balancer is used to distribute SCHED_BATCH processes: SCHED_BATCH processes will populate CPUs depending on the CPU's "10 seconds history of idleness". The more idle a CPU, the more SCHED_BATCH processes it will handle. The weighting is done in a way to make the global distribution of SCHED_BATCH timeslices fair. The load-balancer also honors caching properties and tries to reduce unnecessery bouncing of SCHED_BATCH processes. (The balancing, like in the SCHED_NORMAL case, is not intended to be 100% 'sharp' - some statistical fuzziness is allowed to keep overhead and complexity down.)

(to see the SMP SCHED_BATCH load-balancer in action, start up multiple SCHED_BATCH processes on an SMP box - they populate all available CPUs evenly. Then start up a single CPU-intensive, non-SCHED_BATCH process - after a few seconds all SCHED_BATCH processes will migrate off to the remaining CPUs, and the SCHED_NORMAL task will get 100% CPU time of a single CPU.)

(design sidenote: initially i tried to integrate SCHED_BATCH scheduling into the existing scheduler and SCHED_NORMAL balancer somehow, but gave up on this idea. While that worked for RT scheduling, SCHED_BATCH scheduling is quite different, and is 100% orthogonal to all the other scheduling domains. Eg. the balancing of non-SCHED_BATCH processes *must not* be influenced by the way SCHED_BATCH processes are distributed amongst CPUs. The distribution of timeslices must be completely separated as well. So since all the queues and state has to be separate, they can as well be in separate (and simplified) data structures.)

i've also attached setbatch.c, which is a simple utility to change a given PID's scheduling policy to SCHED_BATCH. One straightforward way of using it is to change one shell to be SCHED_BATCH:

./setbatch $$

and start various commands from this SCHED_BATCH shell - all forked children inherit the SCHED_BATCH setting.

the load generated by multiple SCHED_BATCH processes does not show up in the load average - this is the straightforward solution to not confuse load-average-sensitive applications such as sendmail.

the runtime performance impact of SCHED_BATCH is fairly minimal. There is a (pretty light) branch and function call cost in the entry.S preemption codepath. Otherwise the SCHED_BATCH code triggers in slowpaths only: eg. when we would otherwise switch to the idle thread.

the patch was tested on x86 systems. non-x86 systems should still work with the patch applied, but no SCHED_BATCH process will in fact be suspended. For batch-suspension to work the architecture needs to call schedule_userspace() instead of schedule(), when pure userspace code is preempted.

the attached patch is against 2.5.24, it was tested on SMP and UP systems as well, but keep in mind that this is the first version of this patch, so some rough edges might be present. The patch can also be downloaded from my scheduler patches homepage:

http://redhat.com/~mingo/O(1)-scheduler/batch-sched-2.5.24-A0

There were only a few comments on this. Nicholas Miell had some questions about how Ingo's patch would jibe with the IEEE standard, but it turned out his objections would be best solved in user-space. However, Nicholas pointed out, "Keep in mind that someday, someone who is looking for the implementation of the SCHED_OTHER policy will be thoroughly confused by the kernel's complete lack of reference to SCHED_OTHER. And they'll be asking you for clarification. Or, you could make some note in the source that SCHED_OTHER is SCHED_NORMAL and eliminate any source of confusion now." Ingo did this.

5. Status Of O(1) Scheduler In 2.4

1 Jul 2002 - 7 Jul 2002 (28 posts) Archive Link: "[OKS] O(1) scheduler in 2.4"

Topics: Big O Notation, Feature Freeze, Scheduler, Virtual Memory

People: Ingo MolnarBill DavidsenTom Rini

Bill Davidsen asked what the holdup was for getting the O(1) scheduler into 2.4. Various distributions had been using it for awhile, he said, with no problems. Ingo Molnar replied, "well, the patch is barely 6 months old. A new scheduler changes the 'heart' of the kernel and something like that should not be done for the stable branch, especially since it has finally started to converge towards a state that can be called stable ..." Bill replied that there were very invasive changes being done to the 2.4 Virtual Memory subsystem, that would be more likely to break things than the new scheduler. But no one replied to him.

Elsewhere, Tom Rini also pointed out that the current 2.4 kernel was 2.4.19-rc1, and any major change would make it impossible to move the release-candidate into the official 2.4.20 kernel. Also, he said the 2.4 tree was not supposed to receive such huge changes, as it was theoretically a stable tree. Bill replied:

Since 2.5 feature freeze isn't planned until fall, I think you can assume there will be releases after 2.4.19... Since it has been as heavily tested as any feature not in a stable release kernel can be, there seems little reason to put it off for a year, assuming 2.6 releases within six months of feature freeze.

Stable doesn't mean moribund, we are working Andrea's VM stuff in, and that's a LOT more likely to behave differently on hardware with other word length. Keeping inferior performance for another year and then trying to separate 2.5 other unintended features from any possible scheduler issues seems like a reduction in stability for 2.6.

Tom replied that it was important to stop feature-creep, and that O(1) would be a major change to the core kernel. It was too invasive, he felt, for the 2.4 series. A little later he added, "I don't think the low-latency, preempt or O(1) should make it into 2.4. And since Ingo, who wrote this, doesn't think it should go into 2.4 right now, it hopefully won't." Elsewhere, Ingo commented:

it might be a candidate for inclusion once it has _proven_ stability and robustness (in terms of tester and developer exposion), on the same order of magnitude as the 2.4 kernel - but that needs time and exposure in trees like the -ac tree and vendor trees. It might not happen at all, during the lifetime of 2.4.

Note that the O(1) scheduler isnt a security or stability fix, neither is it a driver backport. It isnt a feature backport that enables hardware that couldnt be used in 2.4 before. The VM was a special case because most people agreed that it truly sucked, and even though people keep disagreeing about that decision, the VM is in a pretty good shape now - and we still have good correlation between the VM in 2.5, and the VM in 2.4. The 2.4 scheduler on the other hand doesnt suck for 99% of the people, so our hands are not forced in any way - we have the choice of a 'proven-rock-solid good scheduler' vs. an 'even better, but still young scheduler'.

if say 90% of Linux users on the planet adopt the O(1) scheduler, and in a year or two there wont be a bigger distro (including Debian of course) without the O(1) scheduler in it [which, admittedly, is happening already], then it can and should perhaps be merged into 2.4. But right now i think that the majority of 2.4 users are running the stock 2.4 scheduler.

Bill replied that the O(1) scheduler had proven itself to be at least as good as the stock scheduler, but Ingo came back with, "this is your experience, and i'm happy about that. Whether it's the same experience for 90% of Linux users, time will tell."

6. Some Discussion Of Major Version Release Scheduling

1 Jul 2002 - 7 Jul 2002 (19 posts) Archive Link: "[OKS] Kernel release management"

Topics: Feature Freeze, Release Scheduling, Virtual Memory

People: Bill DavidsenDave JonesRob LandleyRussell KingAdrian BunkAlan Cox

Bill Davidsen suggested that 2.7 be forked as soon as Linus released 2.6; he said, "I think developers will maintain the 2.6 work out of pride and desire to have a platform for the "next big thing." And their code can always be placed on hold for 2.7 until they clarify their thinking on 2.6, if that's really needed. Most of the developers take pride in what they did in the recent past and would certainly not be a problem if a fix were needed. And if there is a reasonable -rc process there shouldn't be any major bugs of the "start over" variety." Dave Jones replied, "Unfortunatly, there's the possibility of people thinking "I'll fix it properly in 2.7, and backport", during which time, 2.6 doesn't get fixed any faster. People diving into 2.7 development and leaving 2.6 to those that actually care about stabilising it was Linus' concern if I understood correctly at the summit." Rob Landley replied:

And leaving stabilization to the people who care about stabilization would be a bad thing why? 2.4's first ten releases are a marvelous counter-example to the "stonewall new development to speed up bugfixing" theory of software development. The musical rotating feature freeze/thaw/slush/slurpee halfway through development cycles haven't been that effective either.

Linus ain't so good at maintenance, and he has said as much on this list. Linus's kernel sets the direction for Linux evolution, but he couldn't get the 2.4.0 VM stabilized and Alan Cox did. (Better than mainline, anyway.) If Linus had handed over the stable series to Alan right after 2.4.1, taken a month long vacation, and then opened a new branch that was a bit selective at first about what it took and from who, does anybody think 2.4 would have taken any longer to properly stabilize than it wound up doing? (Did Jens's bio patches really need to wait on the VM stabilization work? Did Jens help stabilize the 2.4 VM?)

We live in a world of multiple Linux kernel trees already, each with a different maintainer who is good at different things. Linus is a brilliant architect who is great at plucking the best ideas from the cream layer of the churning mass of Sturgeon's Law flung at him on a daily basis. When presented with four ways to do something, he'll spot the hidden fifth better way like nobody else can. But saying no in such a way as to promote stability is a different skill, and last time Linus went into big time "saying no" mode he wound up dropping VM stabilization patches from the then VM maintainer. And the feature freezes haven't historically been remarkably effective at producing a stable kernel soon after either.

A "stabilization fork" off of the development series could be done, as an experiment, during the next "feature slush". A maintainer who specializes in stabilizing code (You, Alan, and Marcelo are all doing a decent job at this now: it's not a common skill but not as rare as being a brilliant architect like Linus) can fork a "fixes only" tree that may or may not become 2.6, and see how it goes.

It it works, great, if it doesn't work, fine. You already maintain a fork off of Linus's tree, and Alan maintains one off of Marcelo's tree. Red Hat and SuSE maintain their own forks as well. The existence of such a fork, with a compentent maintainer and its own user base, is not inherently disruptive to the rest of the world. Feeding patches from one tree into another and dropping the rest until they're merged is what you and Alan do normally anyway, so the down side of it NOT working (giving up after a few months and going "shucks, people just won't listen to anyone but Linus") isn't exactly catastrophic. As long as the maintainer is competent at merging to clean up the fork afterwards, and if they're not they can't effectively maintain their own tree in the first place anyway.

An explicit stabilization-only fork could even be a tool to help Linus's fork stabilize (if that is or becomes the goal), by tracking down bugs and performance tuning in a less turbulent environment while trying hard to introduce as few new problems as possible, and that being the ONLY goal of the fork. Lots of bugs have been tracked down in -dj or -ac and the fix then ported to the appropriate mainline later.

If the stabilization fork DOES become 2.6, then 2.6 can START with a new maintainer, like Marcelo for 2.4 and Alan for 2.2. Stable branch maintainers aren't normally expected to make major new architectural decisions anyway, that's what development kernels are for. :)

And if nothing else, it reduces the likelihood of development being stuck in a nebulous "no new features, well, okay, one more but that's it" mode for most of a year.

Yes, in theory 2.5 should BECOME a stabilization fork, under Linus, during the feature freeze. It might even happen this time. But how would hedging the bet hurt?

Russell King asked:

Think about who will do the stabilisation. Do you really think Alan or Marcelo will pick up 2.6 when it comes out? Or do you see someone else picking up 2.6?

One of the fundamental questions that needs to be asked along side the "fork 2.7 with 2.6" problem is _who_ exactly is going to look after 2.6. Dave Jones? If Dave, who's going to do Daves job of making sure fixes get propagated between stable and development trees?

Dave Jones pointed out that Marcelo Tossatti was not averse to taking over 2.6 once 2.7 came out. Dave said, "as he's done a pretty good job so far in 2.4, he seems to be the ideal guy for the job (time permitting). At the time we get to 2.6.0, 2.4 should have slowed down sufficiently that he'll be looking for something else to do." As far as his own future role, Dave added, "When we get to 2.6, I'll do 2.6-dj's until the important bits are all pushed to $maintainer, and keep the leftovers until 2.7-dj."

Elsewhere, Adrian Bunk was very much against forking 2.6 and 2.7 at the same time. He said:

If 2.7 doesn't start before 2.6 is _really_ stable everyone who wants to have a new development tree is more interested in making 2.6 a really good kernel instead of focussing immediately on 2.7 .

Bill replied:

Seems the reason this is being suggested is that lots of new stuff got shoved into 2.2 and 2.4 in the early stages, and they were NOT stable. Since far more influential people than I are suggesting this, obviously at least some of the folks feel it's worth trying something different.

The maintainer can alway push really new stuff into 2.7, and Linus can always refuse to take a feature into 2.7 until something else is fixed in 2.6. Looking at how hard people are working to backport things from 2.5 to 2.4 I have faith that extra effort will be taken.

Russell King came in at this point, with:

I'm maintaining the 2.5 and 2.4 ARM trees here in parallel, and it is *really* tough to handle. There are several problems:

  1. finding the time to build and test each kernel version on hardware reasonably well.
  2. keeping track of what has been applied to which kernels
  3. getting down-stream developers to produce patches for the stable and development kernels generally doesn't happen.

The net effect is I have more support for various ARM machines in 2.4 at present than in 2.5, but 2.5 only contains my new features.

If 2.6 and 2.7 appear at the same time, you _will_ run into the same problems across the community. Unless people are willing to put lots of work in to making patches apply to two widely different kernel source trees, you could end up in the same situation. And it's no fun to be there.

The discussion went on for a few more posts, then petered out inconclusively.

7. Device Model Docs

2 Jul 2002 - 5 Jul 2002 (6 posts) Archive Link: "Device Model Docs"

Topics: Documentation, Ottawa Linux Symposium, Version Control

People: Patrick MochelArnd Bergmann

Patrick Mochel announced:

I have had a chance to make a pass over the device model documentation. I've removed most gross inaccuracies, though there may still be a few glowing warts. You can get at them at

http://kernel.org/pub/linux/kernel/people/mochel/doc/

The OLS paper and presentation (in Open Office format) are also there.

Everyone is encouraged to have a look. Feel free to send me comments, corrections, or patches.

Additionally, I've also exported my BK tree, which can be found at

bk://ldm.bkbits.net/linux-2.5/

Currently, the only thing committed so far is the above mentioned documentation and the fixing of the existing documentation.

Arnd Bergmann took a look at the documentation, and the two of them went over some of the details.

8. Localizing #include Directives

3 Jul 2002 - 5 Jul 2002 (5 posts) Archive Link: "[patch,rfc] make depencies on header files explicit"

Topics: Source Tree

People: Tim SchmielauStephen RothwellSandy Harris

Tim Schmielau remarked, "It seems to be quite common to assume that sched.h and all the other headers it drags in are available without declaration anyways. Since I aim at invalidating this assumption by removing all unneccessary includes, I have started to make dependencies on header files included by sched.h explicit. This is, again, just a small start, a patch covering the whole include/ subtree would be approximately 25 times as large. However, before I'll dig into this further, I'd like to make sure I haven't missed some implicit rules about which headers might be assumed available, or should be included by the importing .c file, or something like that. So any comments about this project are welcome." Stephen Rothwell thought this was a super plan, and added, "IMHO any source file (and here I include header files) should include all the header files it depends on. This gives us at least some chance of keeping the headers consistant with their usage." Sandy Harris replied:

I thought conventional wisdom was that header files should never #include other headers, and .c files should explicitly #include all headers they need.

Googling on "nested header" turns up several style guides that agree:

http://www.cs.mcgill.ca/resourcepages/indian-hill.html

http://www.doc.ic.ac.uk/lab/secondyear/cstyle/node5.html

and others that say it is controversial, can be done either way: http://www.eskimo.com/~scs/C-faq/q10.7.html

Am I just off base in relation to kernel coding style? Or would getting rid of header file nesting be a useful objective.

Stephen replied that "conventional wisdom" varied depending on who was asked, but he did add, "I just find it a real pain sometimes trying to figure out what other include files I need to when all I really want is one or two definitions in one particular include file. The same holds true when I am removing or moving stuff from one place to another (especially when trying to clean up some of the current mess)." But Tim Schmielau also put in, "Avoiding nested headers certainly results in the smallest set of header files actually #included. However, I think it's just not feasible with the kernel: many files would start with a list of some hundred includes, and I can't imagine a reasonable way to document the dependencies between them." EOT.

9. User-Mode Linux Security

5 Jul 2002 - 9 Jul 2002 (8 posts) Archive Link: "user-mode port 0.58-2.4.18-36"

Topics: Capabilities, FS, Security, Software Suspend, User-Mode Linux

People: Jeff DikePavel Machek

Jeff Dike announced:

This is the fourth release of the 2.4.18 UML.

The major changes in this release include:

It is now possible the to attach the UML gdb to sleeping threads. This is done by detaching gdb from the in-context thread and attaching it to the host pid of the sleeping UML process. UML may be continued by reattaching to the in-context thread. This feature was sponsored by Cluster File Systems, Inc.

There is a /proc/exitcode, which allows a UML process to set the eventual UML exit code.

Fixed some segfaults caused by calling openpty, which has an unusually large stack frame, overflowing the UML kernel stack.

The tty logging patch is integrated. This allows UML honeypots to log all tty traffic to a host file. This logging can't be detected or interfered with by root inside the UML.

UML now has a "hardware" watchdog.

The UML binary now lives in its own physical memory. This makes it easier for the swsusp patch to be ported to UML.

Fixed a bug with lots of zombies causing a UML panic.

It is now possible to move backing files and update the COW files with ubdx=cow-file,new-backing-file. Note that you must preserve the modification time when moving a backing file with something like 'cp -p' or 'tar p'.

Added support for kernel watchpoints. They can be mixed with watchpoints in gdb inside UML.

Fixed the bug which was closing file descriptors which should have been left open. This was most often seen as a panic during UML shutdown, although it also appeared in other places.

The mconsole driver now sends panic notifications to mconsole clients.

A number of smaller bugs were fixed and features added.

The project's home page is http://user-mode-linux.sourceforge.net

Downloads are available at http://user-mode-linux.sourceforge.net/dl-sf.html

Pavel Machek asked, "what prevents uml root from inserting rogue module (perhaps using /dev/kmem) and escape the jail?" Jeff explained, "That's prevented by the admin taking basic precautions and turning on 'jail', which refuses to run if module support is present and which also disables writing to /dev/kmem." Pavel pointed out that Jeff had disabled /dev/kmem writes by turning off CAP_SYS_RAWIO, and that this might interfere with operations that needed CAP_SYS_RAWIO. He felt that UML should report CAP_SYS_RAWIO as an unplugged hole, instead of simply disabling it. Jeff agreed that a user might be surprised to find CAP_SYS_RAWIO disabled when they'd expected it to be available, but he added, "I haven't seen anything that cares about CAP_SYS_RAWIO being off. That was the simplest way I could find to disable writing to /dev/kmem." Pavel felt that disabling writes to /dev/kmem was a strange way to architect the code, but Jeff pointed out that this, along with disabling CAP_SYS_RAWIO, were only done when 'jail' was turned on.

10. Linux 2.5.25 Announced

5 Jul 2002 - 8 Jul 2002 (11 posts) Archive Link: "linux 2.5.25"

Topics: Disk Arrays: EVMS, Disk Arrays: LVM, Disks: IDE, Disks: SCSI, FS: driverfs, Kernel Build System, Kernel Release Announcement, USB

People: Linus TorvaldsMatthias AndreeJoe ThornberAlexander ViroHeinz MauelshagenPatrick Caulfield

Linus Torvalds announced 2.5.25, saying:

More merges all over the map - ppc, scsi, USB, kbuild, input drivers etc.

And both Al and Andrew have been busy again.

This also introduces the support for non-100 Hz internal kernel times on x86, while still exporting the old interface to user space (ie anybody who exported raw jiffies before should be exporting "clock_t", which on x86 continues to be a 100 Hz clock, regardless of whatever the internal kernel frequency is).

Right now the x86 timer frequency is set to 1kHz, but that's just another random number. It could be a config option if people really care, but I'd rather just have people argue for or against specific internal frequencies and we'll find something most people are happy with. It's easy to change without user space even noticing now.

The other thing that we should sort out eventually is the unified naming for disk devices, now that both IDE and SCSI are starting to have some support for driverfs. Let's make sure that we _can_ have sane ways of accessing a disk, without having to care whether it is IDE or SCSI or anything else.

Matthias Andree asked, "Did the LVM guys (are you listening?) tell anything if they were about to go fix the current 2.5 LVM breakage? Or does EVMS work on 2.5 instead?" And Joe Thornber replied:

I'll say this yet again:

Heinz Mauelshagen is maintaining LVM1.0.x on 2.4 kernels. This is for bug fixes only, no new features will be added.

Alasdair Kergon, Patrick Caulfield and myself are working on the more generic device-mapper driver for both 2.4/2.5. Initially we have concentrated on 2.4, this driver is now very stable IMO (I would certainly trust my data to it in preference to LVM1).

I will post a URL to the 2.5 patch at some point this week.

There is no intention to maintain the broken design that is LVM1 in the 2.5 series - we do not have the spare resources to waste.

Alexander Viro replied, "All right. So how about removing it from the tree? It's broken; it won't be fixed; it's abandoned by maintainers (and $DEITY witness, there is a lot of very good reasons for that); there's nobody who would be able and willing to pick it. What's the point of keeping the damn thing in the tree? Could you (or Heinz) submit the patch removing it from 2.5?" There was no reply.

11. New rmap Patch For The VM Subsystem

5 Jul 2002 - 11 Jul 2002 (11 posts) Archive Link: "[PATCH][RFT](2) minimal rmap for 2.5 - akpm tested"

Topics: Big Memory Support, SMP, Virtual Memory

People: Rik van RielSebastian DroegeCraig KulesaLinus TorvaldsAndrew Morton

Rik van Riel linked to a patch (http://surriel.com/patches/2.5/2.5.25-rmap-akpmtested) , and said:

This patch is based on Craig Kulesa's minimal rmap patch for 2.5.24, with a few changes:

It should be mostly ready for being integrated into the 2.5 tree, with the note that pte-highmem support still needs to be implemented (some IBMers have been volunteered for this task, this functionality can easily be added afterwards).

Right now this patch needs testing and careful scrutiny.

Andrew Morton and Linus Torvalds discussed a bug related to (though not directly within) Rik's patch. Elsewhere, Sebastian Droege reported, "after running your patch some time I have to say that the old VM implementation and the full rmap patch (by Craig Kulesa) was better. The system becomes very slow and has to swap in too much after some uptime (4 hours - 2 days) and memory intensive tasks... Maybe this happens only to me but it's fully reproducable" Rik explained:

It's a known problem with use-once. Users of plain 2.4.18 are complaining about it, too.

This is something to touch on after the rmap mechanism has been merged, Linus has indicated that he wants to merge the thing in small bits so that's what we'll be doing ;)

12. Status Of ATM/SONET Maintainership

9 Jul 2002 (4 posts) Archive Link: "who is the ATM/SONET maintainer?"

Topics: Maintainership, Networking, Version Control

People: Chris FriesenJoe PerchesWerner AlmesbergerMitchell Blank JrJeff Garzik

Chris Friesen wanted to contribute to the ATM and SONET drivers, but couldn't find a maintainer. He asked where he should send his patches, and Jeff Garzik replied that the drivers currently had no maintainer. He offered Chris the job, but Chris said, "<shudder> I don't think so...I've seen how much work it can be. Although I suspect ATM would be a lot less churn than, say, ethernet." Elsewhere, Joe Perches also replied to the original question, saying:

I believe Linux-ATM on sourceforge http://linux-atm.sourceforge.net/

I noticed you posted this same question there a month ago...

Developers of Linux-ATM on SourceForge

Werner Almesberger almesber almesber at users.sourceforge.net
Mitchell Blank Jr mblank mblank at users.sourceforge.net
Paul B. Schroeder paulsch paulsch at users.sourceforge.net

You could sign up as a developer to that project on sourceforge and make CVS changes there.

Of course you could always send the patches to the lk list and/or you could become the maintainer.

End of thread.

13. Preemption During Disabled Interrupts

11 Jul 2002 (5 posts) Archive Link: "Q: preemptible kernel and interrupts consistency."

Topics: Real-Time, Scheduler

People: Robert LoveOleg Nesterov

Oleg Nesterov noticed that, according to Documentation/preempt-locking.txt, disabling interrupts would prevent preemption, unless TIF_NEED_RESCHED were set by the running process. Robert Love replied:

Yes, you are right, if need_resched is set under you, you will preempt off the last unlock, even if interrupts are disabled.

However, the only places that set need_resched like that are the scheduler and they do so also under lock so we are safe.

Also, in your example, being in an interrupt handler bumps the preempt_count so even the scenario you give will not cause a preemption. If we did not bump the unlock, then your example would give a lot of "scheduling in interrupt" BUGs so we would know it ;-)

All that said, there is a bug: the send_reschedule IPI can set need_resched on another CPU. If the other CPU happens to have interrupts disabled, we can in fact preempt. I have a patch for this I will submit shortly.

Oleg did not agree that the code was safe. He said, "if process does not hold any spinlock and interrupts disabled, then any distant implicit call to resched_task() silently enables irqs. At least, this must be documented." Robert said:

If interrupts are disabled, where is this distant implicit call from resched_task() coming from?

That was my point, aside from interrupt handlers all the need_resched-touching code is in sched.c and both Ingo and I verified everything is locked.

If interrupts are disabled, there are no interrupts handlers. And if you are in an interrupt handler, preemption is already disabled.

 

 

 

 

 

 

Sharon And Joy
 

Kernel Traffic is grateful to be developed on a computer donated by Professor Greg Benson and Professor Allan Cruse in the Department of Computer Science at the University of San Francisco. This is the same department that invented FlashMob Computing. Kernel Traffic is hosted by the generous folks at kernel.org. All pages on this site are copyright their original authors, and distributed under the terms of the GNU General Public License version 2.0.