Kernel Traffic #58 For 13 Mar 2000

By Zack Brown

Table Of Contents

Mailing List Stats For This Week

We looked at 1425 posts in 6129K.

There were 473 different contributors. 226 posted more than once. 196 posted last week too.

The top posters of the week were:

 

1. /proc Vs. devfs
22 Feb 2000 - 1 Mar 2000 (92 posts) Subject: "[patch-2.3.47] /proc/driver/microcode -> /dev/cpu/microcode"
Topics: Executable File Format, FS: devfs, FS: procfs, Ioctls
People: Tigran AivazianRichard GoochH. Peter AnvinJeff GarzikHorst von BrandAlan CoxTheodore Y. Ts'oJes SorensenEric W. BiedermanWerner Almesberger

In the course of posting and discussing a patch to the microcode driver, Tigran Aivazian suggested that perhaps /proc/driver/rtc should be moved to /dev/cpu/rtc. He said, "I know that /dev/rtc is already is /dev/misc/rtc but that is the actual driver whilst the human readable regular file could happily live in /dev/cpu/rtc (although it is not strictly "on cpu" but only a couple of inches away :)"

Richard Gooch replied:

I think we should be strict with the new devfs namespace. If it's not actually part of the CPU, it doesn't belong in /dev/cpu. If we're not strict, we end up with the same ad-hockery as /proc.

I'll (reluctantly) be the gatekeeper for devfs names if it means the namespace is kept sane.

H. Peter Anvin volunteered, with, "If you don't want to do it, I guess I could. At some point I will work on changing the Linux Device List to include the new names; I can't really promise to do it immediately, though." There was no reply to this, but Tigran replied to Richard with his explanation, "in addition to the binary-data device /dev/rtc (which is in /dev/misc/rtc of devfs). The /proc/driver/rtc is the human-readable dump of RTC data and I thought it should find its proper place in devfs instead of /proc for the same reasons why we moved microcode from /proc/driver to /dev/cpu."

Jeff Garzik felt that devfs was not suited for text output such as /proc/driver/rtc, but Richard, also replying to Tigran, felt that such a thing did belong somewhere in devfs. Jeff replied again, this time to Richard and more forcefully, "Devfs _is not_ the place for generic, driver-specific text output to userspace. The RTC data dump is in /proc and is not a /dev device for a reason..." But Richard asked what that reason was, and asserted that /proc should be only for processes. There were a number of replies.

Jeff explained:

RTC is a standard driver, and devfs is an experimental driver defaulted to off? :) Once devfs makes it into Mandrake, RedHat, ... then it will be time to consider moving EXISTING interfaces from /proc -> devfs.

Until then, moving existing interfaces to devfs requires the user to run devfs for functionality which was previously available without. Since installing devfs is not a "flip a switch and forget" change, you cannot move existing interfaces to devfs without due consideration, and giving devfs time to filter out to the userbase at large.

Patience, grasshopper. :) I agree with you in principle, but late in the 2.3.x cycle is NOT the time to break existing interfaces.

Richard agreed with this logic, and added, "I'm not trying to push removal of all the procfs interfaces at this point. But it would be nice to see patches in my inbox which added devfs interfaces :-)"

To Richard's previous assertion that /proc should be only for processes, Horst von Brand also replied:

Non sequitur. I mostly agree with the /proc bit, but I also agree that /dev is not the place for random text output. Simply because many people have no use for devfs, but need the data being shown.

So, the ways out AFAIKS are:

  • /sys directory with system data (a replacement for groveling around in kmem)
  • Clean up /proc, and put it back in there.

Any alternative needs a HPA that just vetoes stuff in bad format or with stupid names.

And Alan Cox added, "Peter happily vetoes and fixes naming even in things like IBM's naming choice for their S/390 port. I think we are in safe hands 8)"

Werner Almesberger also replied to Richard's statement that /proc should be only for processes, saying that the people actually putting stuff into /proc, at least, felt the data should be there. Tigran replied that the benefit of devfs was a rational namespace, "instead of just putting a lot of stuff in the root of /proc and some things group into subdirectories." He concluded, "After reading devfs/README and the code I am starting to be on Richard's side that "tough, use devfs" is not such a bad idea, after all :)" Horst replied that the thing to do was to just clean up /proc; and H. Peter responded to the "tough, use devfs" statement (which had been coined not by Tigran or Richard, but in a different sub-thread altogether), with, "It would be a *very* bad idea for some applications to have a system where we couldn't run without devfs, unless there is considerable benefit to such an arrangement. So far, there is no evidence of that, so having a system that works with a traditional /dev should be a requirement." But he added, "However, that's *NOT* an argument for putting things in /proc."

Elsewhere, Richard commented that given the choice, "I'd advocate we move from a "tough, use procfs" attitude to a "tough, use devfs" attitude," which also got a number of replies. Jeff pointed out that realistically, disabling devfs would be a lot more common than disabling /proc. Horst reiterated his position to Richard, saying, "What you are proposing is just moving a dungheap from one place to the other."

George Greer also felt that existing solutions worked fine, but Tigran replied, "Once upon a time we had a working (not perfect but working) aout-based Linux systems. What was the point of switching to ELF? Lots of points. So it is with devfs - it solves a lot of problems, all nicely documented in Documentation/filesystems/devfs/README." A little later, Alan commented, "Your 68K of devfs does actually get you a whole pile of things beyond /dev. The trouble is none of them are things that embedded people care about."

Theodore Y. Ts'o also expressed to Richard's comment about preferring "tough, use devfs" over "tough, use /proc":

You know, it wasn't that long ago that you said that using devfs should be a choice, and not something that would ever be forced. Now you're saying "tough, use devfs". I guess your earlier statements were just made to pursuade people to accept it into the kernel, and now you're changing your mind?

As far as devfs not having to be mounted over /dev, if we *are* going to move to a world where for certain functionality devfs is mandatory, it would be useful to standardize using a standard pathname for accessing devfs --- say, /devfs. If you do want to mount devfs over /dev, then /devfs can be a symlink to /dev. If you don't want to mount devfs over /dev, then devfs can just be mounted on top of /devfs.

This way, application programs that need fixed, compiled-in paths can just use /devfs and be guaranteed to work on both kinds of systems.

There were several replies to Ted as well. Jes Sorensen agreed with him, and added, "I always had the impression that if devfs ever went into the official kernel it was going to be as an option, leaving system functional without enabling it. If devfs is going to be a mandatory I would like to see a statement about this from Linus." Eric W. Biederman replied:

Who do think is silently pushing devfs?

  • Linus has argued that sysctl is bad (hardcoded magic numbers).
  • Linus has argues that ioctl is bad (Not the UNIX way, all could/should go through read/write)
  • Others have argued procfs is bad (you can't handle device permissions...).
  • Richard Gooch argued devfs seems to handle these issues best...
  • The next day devfs was in the development kernel.

But I do agree there is no need to rush deployment or change. However we are certainly moving in a devfs direction.

Tigran replied to Ted's claim that Richard had changed his story from advocating devfs as a choice to advocating it as a requirement, "Theodore, with all due respect, I think you are being a bit tough on Richard. It is inevitable that initial acceptance of an idea requires some smooth talking and convincing whilst when the idea is accepted one could go on and gently push humans towards doing the right thing as a norm as opposed to as "optional". There is nothing wrong about that. Richard pushed me a bit towards converting microcode to devfs and my only resistance (as I found out) was due to ignorance - when that is overcome I am only grateful to him that he cared enough to convince me." And Richard added, "It's also because in private email Linus has pushed me further than I was intending to go. He basically took off the reigns and gave me a nudge :-)"

 

2. Kernel Panic In 2.3.47
23 Feb 2000 - 4 Mar 2000 (5 posts) Subject: "2.3.47 kernel panic with serial.o"
Topics: MAINTAINERS File
People: Ian PetersTheodore Y. Ts'o

Yau Man NG discovered that in 2.3.47, the following simple commands (as root) would hang the final command unkillably (and could crash the system -- see below):

shell# /sbin/insmod serial
shell# /sbin/rmmod serial
shell# /sbin/insmod serial
shell# /sbin/rmmod serial

The system would still work, but the final 'rmmod' wouldn't terminate, and would be unkillable, and would use a lot of CPU time. Other modules could be 'insmod'ed or 'rmmod'ed without a problem. However, doing a 'cat /dev/ttyS0' would cause a kernel panic. The problem was not repeatable in 2.2.13; Ian Peters replied:

Just a heads up -- this problem, as reported by myself and several other during February for kernels 2.3.45 and up (and possibly 2.3.44) persists up through 2.3.48. So far, all of the people who have reported problems have reported the exact same symptoms (well, I'd never tried to access the serial device after trying to unload the module, but doing so just now gave me an instant reboot, no oops message).

I've been unable to find anyone listed in MAINTAINERS who would seem to be responsible for the generic serial driver; if there is such a person, could someone point me in the right direction?

Theodore Y. Ts'o claimed maintainership of the serial driver, and asked for more detailed information about the hardware involved and the error messages observed. Ian provided this, and Ted posted a one-line patch, and explained:

I found the problem; it's not actually a bug in the serial driver, but in the rewrite of the bottom-half drivers to use tasklets.

When you remove the serial driver, it calls remove_bh(SERIAL_BH), which calls tasklet_kill(). Tasklet_kill() uses a test_and_set synchronization loop on the TASKLET_STATE_SCHED bit of t->state to make sure that tasklet isn't being scheduled, and then waits for the tasklet to finish running. The problem is that it doesn't clear the bit afterwards!

This means that after calling remove_bh(), the TASKLET_STATE_SCHED bit is stuck on, which permanently disables that particular bottom-half handler from ever working properly. It also means that the next time remove_bh() is called for that bottom-half handler, it will get stuck in an (unkillable) infinite loop.

End Of Thread (tm).

 

3. Loading A New Kernel From A Running System
23 Feb 2000 - 1 Mar 2000 (18 posts) Subject: "Load linux...from linux?"
Topics: Access Control Lists, FS: NFS, FS: ext2, FS: ext3, Small Systems, User-Mode Linux
People: Lars Kellogg-StedmanDan MalekWerner AlmesbergerRonald G. MinnichJeff Garzik

Lars Kellogg-Stedman asked if a Linux equivalent of 'loadlin' were possible. He explained, "For instance, it would be nice if the last action when rebooting a linux system were (optionally) to load and execute a new kernel image, rather than reset the system. Or better yet, shutting down a linux system would return one to a boot loader prompt from which one could select a new kernel (kind of a poor-man's open firmware)." Jesper Juhl gave a pointer to User-Mode Linux (http://user-mode-linux.sourceforge.net/) , which he acknowledged was not exactly what Lars had asked about, but was at least in the same ballpark. Jeff Garzik gave a pointer to Linuxbios (http://www.acl.lanl.gov/linuxbios/) , and suggested grepping the page for "LOBOS". Dan Malek also replied to Lars' question, and explained in response to the possibility of having the last action of a running system be to load and execute a new kernel:

I do this quite frequently on embedded systems. Rather than try to write boot roms with network protocols, I just put a Linux kernel in the flash rom. They can boot up using diskless NFS (or whatever you want), and are hacked to reserve a chunk of contiguous physical memory for loading the new kernel. I have a rather trivial program that will mmap() the memory and copy a kernel file there. Then I modified the machine_restart() to take a reboot command with a starting address. The "gorom" function then disables interrupts, mmu, cache, and jumps to the starting address of the new code.

The kernels I boot are all Linux/PPC compressed zImage. Since they relocate themselves and uncompress the kernel to the proper memory location, it doesn't matter where they are initially loaded in physical memory. Works very well.

I chose to reserved a big piece of physical memory because it was easy to do. There are several variations on allocating memory for the new kernel, with associated modifications to boot loader wrappers. I am sure you could write a driver that would coalesce memory for the new image that would work with any kernel, so from one kernel you could just boot another.

Elsewhere, Werner Almesberger said that, before finding LOBOS, he'd spent half a weekend trying to implement Lars' suggestion. He explained, "All the paging stuff in LOBOS looks quite elegant, but I'm a little confused by how it does the final copy. Seems to me that you may burn holes into your kernel image this way ... (My code maintains a list of "sacred" pages, including the code page, which it moves away when it has to overwrite their location. Basically the same procedure as in FiPaBoL (part of the linux-7k boot code), but less of a mind twister, because it's in C. Also, I leave all the setup to user space, so the kernel doesn't even know what it is booting.)"

Ronald G. Minnich, also developing along the same lines, replied peripherally, "using LOBOS and ext3 I can chop my laptop reboot time from 90 seconds or so to 43. We'd like to see it down to 5 seconds, much time is lost on daemon startup and that initial mount of ext2 where it seems to like to do a full-surface-scan of the disk :-)" and there followed an implementation discussion.

 

4. ext3 Status And Discussion
25 Feb 2000 - 3 Mar 2000 (17 posts) Subject: "ext3 status?"
Topics: FS: NTFS, FS: ext2, FS: ext3
People: Stephen C. TweedieJeff V. MerkeyAndreas DilgerRiley WilliamsMatthias Andree

Matthias Andree noticed that the only available ext3 version was 0.0.2c, which was already 3 months old and against 2.2.13; he asked if development was still proceeding. Stephen C. Tweedie replied:

0.0.3 should be out within the week. The journal abort code is complete now for response to fatal errors such as EIO in the journal.

In doing this work I've found that there are a number of options for the future in terms of duplicating some journaling information to make it robust against IO errors during recovery, but I'll not hold up the 0.0.3 release to do that --- that will have to come in the future, and there will be a journal format change involved. (Migration between the journal types will be simple, of course.)

Jeff V. Merkey came in with a question of his own. He said, "We are working on the on-disk conversion from Netware to EXT2 on disk formats. I have not looked at the EXT3 code (but will soon). Were any of the changes to the on-disk formats too radical that we should be aware of?" Stephen replied, "No, none at all right now. The journal is in a separate inode referenced by a currently-unused superblock field. There are a couple of new superblock compatibility flags used. Nothing else has changed: all the clever stuff goes on inside the journal itself."

Jeff replied, asking if ext3 would automatically create the journal file if it wasn't already present, or if an empty journal should be created by hand, or what (he added, "This is what I am doing at present with NetWare2NTFS conversion (create it empty) and the first time Windows 2000 mounts the converted volume, the journal and the first 16 meta files are created when the volume is mounted." ). Riley Williams replied that if no journal were present, ext3 would simply drop through to ext2; and Stephen confirmed this. Nearby, Andreas Dilger also explained, "There are comments in the ext3 code to the affect that this" [auto-creating the journal file] "will be done in the future, but currently you are required to create a 10000 block file from userspace and record the inode number (it will be #13 on a new filesystem). Your conversion utility would have to do the same." Jeff had also asked if there were any other meta-files he should be aware of, other than the journal file itself, and there followed what appeared to be some misunderstandings of ext3 from some of its users:

Andreas replied to Jeff, "The first 12 reserved inodes at the start of an ext2 filesystem (with appropriate data, if applicable) are also required for ext3." But Stephen said simply, there were no other meta-files. Near by, Riley had said, "I believe ext3 requires that the journal file occupies 12k (three 4k memory pages) of CONSECUTIVE disk space, and your conversion would need to locate and allocate that for ext3 conversion to work." But Stephen replied, somewhat alarmed, "No. I've no idea where you got that idea from! 12k isn't a special size at all to journaling. The journal will work more efficiently if it is contiguous, but you can make it any size you want, anywhere on the disk. It should be at least 1024 blocks long."

 

5. Proposed Scheduler Changes
25 Feb 2000 - 3 Mar 2000 (26 posts) Subject: "[PATCH] proposed scheduler enhancements and fixes"
Topics: Assembly
People: Dimitris MichailidisLinus TorvaldsPavel MachekArtur Skawina

Dimitris Michailidis posted a patch to the process scheduler. One of his changes, he described, "adds a new scheduling policy SCHED_IDLE. Processes of this type run on CPUs that would otherwise be idle. Useful for apps like SETI@Home, code crackers, etc. Implementation of this feature is extremely lightweight. Among the scheduling functions only schedule() is SCHED_IDLE-aware and the overhead for non-idle CPUs is at most 1 instruction per schedule() invocation." Linus Torvalds rejected the patch, and explained:

So why do you think your implementation of this doesn't have the deadlock that every other implementation has?

The deadlock is due to priority inversion, where a "idle-priority" task gets a resource (say the directory semaphore), goes to sleep, and never wakes up again because there is some slightly more important process running all the time.

In short, this has been tried before, and it has ALWAYS been a serious security holw full of denial-of-service kind of attacks.

Not applied

Pavel Machek digressed historically, "Once upon a time, there was trick which enabled slow for SCHED_IDLE processes (by abusing flags -- something like PTRACE), then did priority boost in slow path. I even remember it made fast path slightly faster by assembly level optimalization. Unfortunately, that clever trick is not present in this sched patch." Artur Skawina replied, "if you're referring to my sched_idle hack, then it's still exists, but never really made it past the proof of concept stage. [somebody else was interested in maintaining it though]. Anyway, even w/o any special handling, the deadlock is no worse than the SCHED_OTHER vs SCHED_{RR,FIFO} one; it's just that it's more likely to get triggered in a typical setup. I will probably be dropping the ptrace hack anyway, as it may prevent some kernel entry optimizations. you should be able to work around the deadlock by having a kernel thread periodically check the thread queue and raise the idle threads priority temporarily. [you'll need a thread anyway for reasonable signal handling] The current scheduler doesn't really support many static priorities so that any sched_idle solution will be a hack. [and this is good, a more complex scheduler would create more problems than it would solve]."

 

6. Suggestion: /proc/nzombie Zombie Counter
26 Feb 2000 - 3 Mar 2000 (38 posts) Subject: "/proc/nzombies/index.html"
Topics: SMP
People: Jos VisserRicky BeamWalter HofmannAlan CoxAlbert D. CahalanFrank v WaverenMike A. Harris

Jos Visser proposed a trivial change to exit.c, to create '/proc/nzombies/index.html', a file to contain the current number of zombie processes on the system. Mike A. Harris replied that instead of this, the thing to do was send fixes to the maintainers of zombifying programs. Assuming the apps were Open Source, he felt most fixes would be simple one-liners. Jos replied, "Before I can fix the problem, or pound on the vendor, I have to be aware that the problem exists. I need automated system monitoring for that, and the system monitors must have a quick and efficient way to get at the information." Mike took the point and agreed that something was needed, perhaps even an in-kernel feature, but he felt it would bloat the mainstream kernel, and should only be maintained as an outside patch. But Ricky Beam protested, ""bloat"? It's two lines of code in the process exit path. That adds, what, four cpu cylces to process termination? That's not bloat. The price of tracking process zombies in the kernel is far, far less than tracking it in user land. I see no reason why this shouldn't be added as a configurable feature. If you don't want it or need it, then don't compile it in." Mike tested the patch, and decided that it didn't unnecessarily bloat the kernel at all. He put in his vote for including it in the official sources. But Walter Hofmann objected:

It certainly is bloat.

The process exit path is critical--it is called often and every cycle is important. Take a webserver: It can fork a process per request. Consider an SMP system: If you write to the same counter repeatedly with all CPUs you will get cache ping-pong effects which slow you down severely. All these things have to be considered.

OTOH, counting zombies can be easily done in user space. A monitoring tool doesn't need to do it more often than, say, once per minute. If it needs to scan all processes for other purposes as well then counting zombies is nearly free.

Jos replied that calling the feature "bloat" was an exaggeration, because, "If I do a zombie count in userland every 5 minutes, the number of cycles this is going to burn is tremendously larger than performing the count inline in exit()." Alan Cox replied, "However that doesnt interfere with the 99.9999999999999% of the rest of us who don't give two hoots about your zombie count, putting it in a mainstream kernel does." And Albert D. Cahalan added, "Normal systems don't have such a high zombie population, or maybe you are just less tolerant of a few zombies. Millions of Linux users would have to spend CPU time so that you can more quickly count zombies."

There was a bit more discussion, and then elsewhere Jos posted his summary:

Jos Visser (that's me) proposed to put a configurable option in the kernel to count the number of zombies in the system and export that information through a file in /proc (/proc/nzombies/index.html). The underlying reason was to make this particular system information quickly available to applications that want to monitor the system's health.

Some people objected to this because it would "bloat" exit() with an inc++ and release() with a dec--. I think that calling this "bloat" in an era where web servers are sinking from userland into the kernel is stretching it a wee bit (however, others disagree with me here). Furthermore I would suggest that since it is (would be) configurable, everybody can choose whether or not to incur this penalty.

Other people suggested various ways to perform the zombie count in shell script. My whole point in posting the /proc/nzombies suggestion in the first place was that the kernel is uniquely placed to perform this count not only factors, but whole magnitudes, more efficiently. It's not just a case of moving code from userland to the kernel, it's doing it in the kernel because there I can solve this particular problem in a way I can nowhere else in the system.

Then, would it be fair to slow down exit() for a feature that will not be used by the majority of users? Again, if it's configurable, anyone can make their own choices. On the other hand, if these type of very minor pieces of functionality would start being configurable options, the number of CONFIG_xx options would probably soon go through the roof! However, this would probably not be the first function the kernel contains that is geared toward a particular type of user/usage.

Many people questioned why I would care, and why I couldn't just patch the zombie generating software in the first place. My answer there is that in my experience in implementing system monitoring for various types of Unix systems over the years have led me to making the zombie check a "default" thing to be monitored for. You just would not believe some of the (closed source) software one comes across. Changing or dumping the software at fault is usually not acceptable for business reasons.

One (interesting) comment was that the feature was not general enough, and should maybe be changed into a feature to obtain the entire process table quickly and efficiently from a userland program, thereby solving a bunch of other problems as well. I find that an interesting approach, and one that I'll ponder on for some time. Would this be "bloat" as well? Who knows.....?

Alan encouraged, "Remember also - if the nzombies patch is useful to you and it works, this is free software - make it available. The fact its not relevant for the main kernel stream doesnt mean its not worth havign a patch out there - some day someone will be glad of it," and to the more general process-table feature, he went on, "Being able to yank the table fast is much more generic than the zombie thing so yes I think it would be far more useful." And Frank v Waveren added his support, with, "I think that's be a great idea. So many programs try to get all this info, that having one big proc-file with it (Though this shouldn't replace /proc/pid, not even in the long run imo) would be a worthwile idea."

 

7. C Compiler Saga Continues
26 Feb 2000 - 1 Mar 2000 (9 posts) Subject: "3Com support broken in 2.3.47"
Topics: Networking
People: Alan CoxLinus Torvalds

This was first covered in Issue #1, Section #5  (8 Jan 1999: GCC or EGCS?) , where 'egcs' was still unreliable. 'egcs' seemed satisfactory in Issue #10, Section #4  (6 Mar 1999: 'gcc' Vs. 'egcs') , but by Issue #13, Section #10  (29 Mar 1999: 2.2.5 Announcement; Linus Goes On Vacation) it looked as though there were still problems after all. By Issue #14, Section #4  (4 Apr 1999: The 'gcc' Vs. 'egcs' Saga Continues) , it looked as though true compatibility might be far off. In Issue #19, Section #16  (16 May 1999: 2.3.1 Boot Failures May Be Linked To egcs) , 'egcs' was a common tool in two systems that failed to compile, and by Issue #22, Section #6  (27 May 1999: Linus Still Prefers gcc To egcs) Linus Torvalds was saying that the 'egcs' team was heading in the wrong direction, by forcing too great an adherence to standard C. In Issue #29, Section #3  (18 Jul 1999: gcc Vs. egcs: The Saga Continues) , it was pointed out that a number of distributions compiled their kernels with 'egcs'. There were optimistic predictions that 'gcc' 2.95 would be out soon and would solve all problems. Then in Issue #48, Section #6  (13 Dec 1999: Development Process Criticized; Alan Uses egcs 1.1.2) it was revealed that Alan Cox used 'egcs' 1.1.2 for his compiles.

This time, in the course of discussion, Alan Cox said, "we believe egcs 1.1.x and gcc 2.7.2 are both totally solid compilers for the kernel but gcc 2.95 is tripping kernel asm errors and stuff still."

 

8. Legacy And Modern Bloat In BSS Data Initialization
27 Feb 2000 - 29 Feb 2000 (19 posts) Subject: "[PATCH 2.3.48] initrd fix (Mike Galbraith)"
People: Alan CoxMatthias UrlichsRussell KingDaniel Phillips

Frank Bernard posted a patch, and Russell King noticed that some variables were being explicitly initialized to 0, even though they were in the BSS (Block Started by Symbol) data segment, which is zeroed to begin with. Unnecessarily initializing the variables would tend to make kernel binaries larger, he pointed out. Alan Cox explained, "Long long ago (before 1.0) the kernel didnt zero the BSS. Some legacy of that survives in old assignments." Daniel Phillips added that one benefit of explicitly initializing variables was to prevent compiler warnings; but Matthias Urlichs specified, "We're talking about variables outside of functions here. They're zeroed out by the BSS init code, and the C compiler knows that." Russel also replied to Alan, pointing out that he was referring to new explicit initializations he'd seen in the patch, not legacy ones.

 

9. ACPI Vs. APM
28 Feb 2000 - 2 Mar 2000 (23 posts) Subject: "APM_power_off"
Topics: Power Management: ACPI
People: Jeff GarzikDaniel EggerAndy HenroidJamie Lokier

Lars Vahlen was trying to get 'power off on shutdown' working on his Athlon 500/Microstar 6167 with 2.3.47, without success. Jeff Garzik said that APM was outdated and should not be used. Instead, he recommended, "Download and install acpid, and enable CONFIG_ACPI. Power-off on shutdown will work beautifully. Forget about APM, you shouldn't use such an old technology on such a nice new machine ;-)" Jamie Lokier was surprised by this advice, and pointed out that as recently as 2.3.40, acpid had seemed virtually useless. Jeff replied, "Hacking has definitely occurred since then. Power-off and idle (power save) work fine for me, both on my desktop K6 and my laptop K6. I haven't tested suspend... maybe you would be up to grabbing the latest kernel and acpid, and testing suspend and resume?"

Elsewhere, Daniel Egger perplexedly requested, "How can I use ACPI in real life at the moment? It seems like it's just a playground for developers but on the other side if it's enabled in the kernel APM automatically gets overridden, so laptop users would be rather silly to activate ACPI at the moment because the system will empty the battery really fast." But Andy Henroid replied:

Actually, if you are running acpid, you'll get a fair amount of power savings from ACPI C-states (low power when the system is idle, like an improved "hlt") So, no, you definitely will not be draining the battery any faster than if you were using APM.

The issue with ACPI support right now is that you don't get battery status or suspend capability. The former is fairly easy to do and is coming along soon in acpid. The later requires signficant changes to drivers and probably the kernel and is going to take a bit.

Daniel reported apparent success, but he didn't think he'd performed an accurate test. He asked how to identify broken ACPI implementations. Andy replied that he'd update the ACPI documentation to provide real tests, and explained:

You can verify that C-states work by running acpid and looking for "ACPI Cx works" in the kernel log. (Note, this is if your system has ACPI tables, otherwise you need to fill in C-state latency values for /proc/sys/acpi/p_lvlX_lat. Sorry, this should probably go in the documentation somewhere)

A working S5 is verified by running acpid and do "shutdown -h now". Does you system power-off? If not, S5 is not supported (check syslog) or S5 is broken. (This will not work at all unless you have ACPI tables)

You can verify that ACPI events work by running acpid and pressing the power button. Does the system start to shutdown?

Daniel tried these tests without much success, and went back to APM. EOT.

 

10. Some Discussion Of Kernel Multitasking And Scalability
29 Feb 2000 - 6 Mar 2000 (9 posts) Subject: "Linux kernerl preemptive ??"
Topics: BitKeeper, SMP
People: Matti AarnioTuukka ToivonenLarry McVoy

Aneesh Kumar had several questions. He asked whether the kernel did pre-emptive multitasking, whether the kernel was re-entrant, and whether it supported more than one CPU. Matti Aarnio replied, "SMP support is quite good, but the kernel isn't PRE-EMPTIVE in sense that kernel thread X could be pre-empted by thread Y. Kernel threads can YIELD (e.g. while doing some wait) to other processing until whatever they are waiting for, they get it." Aneesh didn't quite understand this answer. If the kernel could yield to some other process while waiting for a resource, why was it not made pre-emptive to begin with? Also, did SMP have anything to do with pre-emption? Matti replied at length to the former question:

  • "SMP" = "Symmetric Multi Processing" (Or in case of Linux: "Symmetric Multi Penguin" -- a joke)
  • Pre-emption is a thing where executing thread is involuntarily stopped from doing whatever it was doing, and processor is given to another thread. Linux kernel DOES NOT DO THAT INTERNALLY. Well, there is a sort of hierarchy of pre-emptions:
    • user threads (+ kernel threads for user syscalls)
      • scheduling in between user threads
    • interrupt processing

  • Only pre-emption what there is, does happen in between compute- bound user processes when process' timeslice has completed.
  • Kernel threads can voluntarily YIELD their processor to other tasks by means of waiting for some resource, or executing explicite schedule() call.
  • As to why the kernel isn't made internally pre-emptive; doing such a thing is *difficult*, very difficult. System would need to be rewritten to allow internal execution stopped at any point, and having other processors/threads trample on incomplete datasets at any point... (Multiprocessor support requires sort of this behaviour anyway, and that is why there are various spinlocks all over the place..)

Tuukka Toivonen also replied to Aneesh's query about why the kernel was not written to be pre-emptive. Tuukka said, "That requires locking all data that is used, and _that_ means lower performance. It means, of course, also somebody to write the code." And to Aneesh's question about whether SMP had anything to do with pre-emption, Tuukka replied:

Well, yes in a way. While CPU#1 is running in kernel mode, it has acquired locks that prevent CPU#2 from running in kernel mode too. Or rather, that was the case with 2.0.0 kernel. This is called (very) coarse-granularity locking.

In time it is supposed that the locking is tuned to more fine-granularity locking. It means that some process running in kernel mode on CPU#1 has locked only parts of the kernel, and other CPUs can use the all unlocked parts freely, simultaneously. For example, with kernel 2.2.x it's much better already. In practice this means that the kernel becomes more scalable -- additional CPUs are used more effectively.

Now, one would suppose that fine-granularity locking is the goal. Many commercial operating systems do that, eg. IRIX. It is supposed to give the best scalability. However, this has a hit on performance especially on systems with just one or two CPUs (like most PCs these days).

Some people (mainly Larry McVoy) have proposed that this shouldn't be done. He has proposed another way how scalability should be achieved. But since you didn't ask that, I stop my story here.

He also gave an interesting pointer to a document on Concurrency Control (http://www.cs.unc.edu/~dewan/242/s99/notes/trans/trans.html) , by Prasun Dewan (http://www.cs.unc.edu/~dewan/) .

Alexey Zhuravlev asked for more information on Larry's alternative. Tuukka recommended reading Larry's PDF paper (http://www.bitmover.com/llnl/smp.pdf) , and replied:

The main idea seems to be to duplicate the kernel data structures in memory. For example, 32-way SMP system might have 8 copies of kernel data, each running as 4-way SMP system. This means that the kernel copy A accessing its own data structures locks only its own data -- kernel copies B, C, ... are not locked and can run at full speed.

This also means that the cache coherency is not necessary between different copies of kernel.

The kernel copies would communicate via shared memory blocks. For example, buffer cache might be shared.

As I have understood, Larry has mainly talked this kind of operating system on dedicated hardware, like SGI Origin 2000. My own opinion is (please correct me if I'm wrong) is that it might help on Intel-style (ie. shared bus) too, but the problem with Intel style hardware is that the bus saturates easily. Unlike on Origin 2000, which doesn't have a shared bus but a kind of switch between nodes, of which each have several CPUs.

 

11. Spurious IRQ7
28 Feb 2000 - 2 Mar 2000 (13 posts) Subject: "2.3.49-1 -- Compilation error in traps.c in function `do_nmi'"
Topics: Sound
People: Ingo MolnarLinus TorvaldsMiles Lane

Miles Lane was getting compilation errors in 'traps.c' in 2.3.49-1; and Ingo Molnar posted a one line patch to include the 'asm/hardirq.h' header. Peter Blomgren pointed out that the same thing was needed in 'io_apic.c', and posted a similar patch. Miles reported success with Ingo's patch, but then reported a different problem: he was seeing a "spurious 8259A interrupt: IRQ7" error. He posted some system information, and Ingo replied, "i believe IRQ7 is a tad overloaded here. Do you see any system instability when the spurious IRQ message appears? The parallel port and the sound card are both using IRQ7. Plus IRQ7 is the IRQ for the motherboard to report spurious interrupts (it's a bit more complex than this but thats the gist of it)." He added that he didn't know why this would be happening now, and not with earlier kernels.

Miles moved the mapping for the MPU-401 interrupt to IRQ 10 to ease the overload, and reported instability in the form of an xfree86 3.9.18 crash. 'dmesg' gave two new messages: "spurious 8259A interrupt: IRQ7" and "keyboard: Timeout - AT keyboard not present?" Linus Torvalds replied:

The spurious 8259A interrupt is probably just because there is a driver that had some timing whereby it made its own interrupt go away without it having ever been acknowledged by the CPU - so the 8259A had had time to raise it, but by the time the CPU got along to servicing it it wasn't there any more and the i8259 gives us the spurious 7 instead.

A spurious irq7 is not necessarily a sign of anything really bad happening..

The keyboard thing makes me more worried. My current suspicion would be that somehow a edge-triggered interrupt is just lost due to some magic timing issues, and it stays lost forever because we don't touch the keyboard controller unless it asks us to look at it.

So putting the two theories would be that it's the _keyboard_ interrupt that the driver just made go away, and because it wasn't handled the machine is now dead forever as far as the keyboard is concerned.

When we read the keyboard status or data ports enough to make the controller think we handled the interrupt, but not enough to realize that there isn't anything more to read, then..

The simple solution to this may be just a timeout - having a timer that checks the keyboard every few seconds and does a "handle_kbd_event()" whether an interrupt came in or not.

 

12. New Linux Console Project
29 Feb 2000 (1 post) Subject: "[ANNOUNCE] linux console project"
People: James Simmons

Continuing from the discussion covered last issue in Issue #57, Section #13  (24 Feb 2000: Future Of Console Code) , James Simmons announced:

Due to demand. I have started the linux console project. The goal is to design a new multihead console system for linux. I hope to have it ready to be ported right into the 2.5.X kernels. I see alot of people have worked on new and different ideas for a console system. I would like to see this project be the center where people can display their ideas and code. In the hope is to merge everyones efforts and develope a lean and yet powerful multihead console system.

The web site is http://sourceforge.net/mail/?group_id=3063

I also have a http://linuxconsole.sourceforge.net domain but haven't yet linked the two pages together. What I need is someone to work on the web pages. I also need developers of course.

Their is a mailing list already. Just go to http://sourceforge.net/mail/?group_id=3063 and select the mailing list section. From tehri you can subscribe via the web. Have fun :)

There was no reply.

 

13. New Pipe Code
29 Feb 2000 - 6 Mar 2000 (14 posts) Subject: "[PATCH,BETA] new pipe code"
Topics: POSIX
People: Manfred SpraulRichard GuentherAlan Cox

Manfred Spraul reported, "I've rewritten the pipe code: the old code caused lots of reschedules under certain loads. The new code is not yet fully tested (esp. O_NONBLOCK is untested), but I'm really interested what you think about it." Richard Guenther requested, "Can you add a hook to tune the write buffer size per pipe? (with an fcntl or the like) It would be really useful to me to allow write-throttling without hacks like I do now (writing 64 times garbage for each written datum - I'm passing just pointers between threads in a FIFO manner, each pointer has lots of storage attached to it, so write throttling is essential to allow thread switching every few megs of data). Or do you know another way to throttle writes to a pipe?" But Manfred objected, "Unfortunately, Linux guarantees that it can send 4096 bytes in one atomic operation, and such an fcntl could violate that." But Alan Cox pointed out, "The requirement for atomicity comes from the POSIX and Unix API. Nothing in the API prohibits changing the size by calls not in the API. I think its silly to add stuff like that however. Pipe is optimised for pure speed, AF_UNIX sockets can do the variable buffering already" Richard replied that he couldn't find anything about this in the manpages, and asked for some pointers to docs. He added, "If unix sockets allow blocking write after a specifiable amount of data written (and not yet read) this would be really cool." And Alan concluded, "You can sort of get the result you want with the SO_SNDBUF/SO_RCVBUF socket options. But those are armwaving general controls not precise ones. If you want to get message boundaries and precise control I think it ends up being a user space + datagram socket job"

Elsewhere, under the Subject: [patch] updates for the pipe code () , Manfred reported a race in his initial code: "if 2 threads read and write to a pipe concurrently, then wake-up's could get lost. I forgot to check PIPE_LEN after I reacquired PIPE_LOCK." He posted a new patch, and there was some discussion

 

14. No WAP For Linux?
4 Mar 2000 (3 posts) Subject: "WAP for Linux"
Topics: Patents
People: Alan CoxLinus Torvalds

Chetan Sakhardande asked if a WAP (Wireless Application Protocol) suite existed or was under development for Linux. Alan Cox and Linus Torvalds gave the only two replies. Linus gave pointers to Kannel: Open Source WAP and SMS gateway (http://www.kannel.org) and WapIT Ltd As a Company (http://www.wapit.fi) , and Alan explained:

At least in the USA the patent requirements and royalties make it impractical to do WAP clients. Apache can be told the wap mime types but remember if you are operating the server in the USA you may need to pay about $10K/year to run your server.

Either way WAP is user space so outside of the kernel issues. There are one or two people working on simple WAP tools in europe, and its possible that Mozilla will support WAP although I guess the patents killed that

EOT.

 

 

 

 

 

 

We Hope You Enjoy Kernel Traffic
 

Kernel Traffic is grateful to be developed on a computer donated by Professor Greg Benson and Professor Allan Cruse in the Department of Computer Science at the University of San Francisco. This is the same department that invented FlashMob Computing. Kernel Traffic is hosted by the generous folks at kernel.org. All pages on this site are copyright their original authors, and distributed under the terms of the GNU General Public License, version 2.0.