Kernel Traffic #11 For 27�Mar�1999

By Zack Brown

Table Of Contents


Those of you who tried to read Kernel Traffic over the last couple days might have seen part or all of this message (crash.html) . We're back now, hopefully for good, though still with hardware problems :-(. I'd like to thank the folks who sent us encouragement over the past few days. Thanks guys! And I'd especially like to thank my friend Will, who spent the whole day over here with me, trying everything we could think of.

Mailing List Stats For This Week

We looked at 935 posts in 4346K.

There were 369 different contributors. 155 posted more than once. 139 posted last week too.

The top posters of the week were:

1. Kernel Accounting

11�Mar�1999�-�17�Mar�1999 (56 posts) Archive Link: "[patch] kstat change to see how much Linux SMP really scale well"

Topics: SMP

People: Andi Kleen,�Andrea Arcangeli,�Oliver Xymoron,�Vojtech Pavlik,�Philip Gladstone,�David S. Miller,�Brian Gerst,�Ingo Molnar,�Linus Torvalds

This was one of the largest threads of the week, starring a lot of important folks (not Linus Torvalds though). My impression was that they started off just having fun trying to eke the most out of the kernel, killing time until the 2.3 series. Toward the end things got pretty serious, though.

Andrea Arcangeli started things off with an interesting objection. Apparently in a spinlock, while one thread is waiting for access to the locked code, the time it spends mindlessly looping is reported as useful activity, which can be misleading if you're trying to get a profile on how much useful activity really is taking place. So he posted a patch against 2.2.3 to discard lock-time. He figured it wasn't perfect, and could probably be confused by some kinds of cpu usage, but was at least no worse than what there was before.

A couple days later Andi Kleen had tried and liked Andrea's patch, and posted an extention of it to the list, to do lock-time accounting separately, for folks who want to do spin-lock profiling. But he was running into the problem that the i386 SMP timer interrupt did not handle lost ticks he observed, which he felt might have adverse effects on scheduling.

He said that ideally there could be detailed per-spinlock accounting, but he said implementing them wouldn't be easy. The problem is how to refer to an individual spinlock: "the current spinlock declaration syntax makes it hard to name them (and for dynamic spinlocks embedded in other structures a different way would be needed anyways than to statically name the spinlocks - for them reading doesn't work well neither). This means for named spinlock profiling it is needed to change all spinlock users :(:(." But he added, "I believe for effective SMP tuning in 2.3 they are definitely needed though."

(David S. Miller suggested that some of this was already partially done in the debugging code of some ports. Andrea replied that since the whole issue is really for hackers only, it should be developed in his patch instead of the main kernel. Andi agreed, as long as they were only talking about 2.2.x; but for 2.3, he felt the patches should go into the Linus tree.)

Andrea replied to Andi's lock-time accounting patch, explaining that he had wanted to add that functionality himself, but felt it would break /proc/stat, which would in turn break a ton of user-space software. So he'd held back. But he liked the patch, and felt he could solve the "lost ticks" problem by always calculating the difference "between rdtsc inside the smp timer irq" .

Andi disagreed that his patch broke /proc/stat, since all it did was add fields. He said any program that couldn't handle extra fields in /proc/stat was broken anyway. But he had no objections to putting the added data in a new /proc file if that would be easier. He also liked Andrea's solution to the "lost ticks" problem, and added that he'd had the same idea in mind.

Ingo Molnar also replied to Andi's lock-time accounting patch, regarding the "lost ticks" problem, which he asserted was not a problem at all. He said that if the timer interrupt was locked out for more than 10 msecs (i.e. long enough to lose a tick) on an SMP system, then there were bigger problems to worry about. Andrea came back with a report that people were actually starting to suffer from that, adding that he was working on a new patch. He added, "I think the main irq bottleneck right now is the scsi subsystem locking that according to me has to be rewritten using failsafe mechanizm (not spinning spinlocks)."

Andrea posted again 9 hours later (4 in the morning his time) with a patch to recover the lost time. He added, "PS. Time to sleep for me."

Ingo replied, saying it was better to prevent the delays in the first place rather than recover the time once its been lost. He offered a patch (Linus-friendly -- it removed 13 lines of code) to simply report such lost time as a bug, since it shouldn't be happening.

At this point in the thread ( , Ingo and Andrea had a short, friendly, rather technical dispute over implementation. Essentially, Ingo felt Andrea was missing the point and that the problem should be fixed at the source, rather than having its effects compensated for; while Andrea felt that there was no fundamental disagreement between them, and they were really just approaching the problem from different angles.

Vojtech Pavlik came in around now, pointing out that Ingo's idea of fixing the delays was simply impossible for certain hardware like the MS SideWinder joystick and the Logitech joystick. Ingo was impressed with the examples, and Brian Gerst observed that the Sidewinder digital protocol was uglier than sin.

Philip Gladstone also came in about the lost ticks, saying that printk() runs with interrupts disabled, and can cause lots of lost ticks. Ingo was surprised by this, and said the problem was starting to feel like a 2.3 issue, because of how complex it was getting. But Oliver Xymoron said he felt the problem could be fixed. He said, "we can dynamically decide to buffer based on string length, baud rate, and message priority, trying to always stay below a timer tick." He also added, "People running serial may in fact be mostly doing embedded development and might actually care to know that serial console is destroying their latency."

For now, the thread seems over. I guess the various folks are re-examining the code.

2. System Accounting

12�Mar�1999�-�19�Mar�1999 (11 posts) Archive Link: "SAR for Linux?"

People: Larry McVoy,�Martin Pool,�Chuck Lever,�Dominik Kubla,�Stephen Tweedie

Matthew Brown announced his intention to start work on a Linux version of the 'sar' system performance monitor, but was having trouble finding the stats so he could monitor them. Greg Franks mournfully commented that disk data in particular had very poor accounting under Linux, and that he intended to get something going on it pretty soon; but Martin Poole saved him the effort, pointing out that Stephen Tweedie had already been working on something along those lines for 2.3 since late '98, and patches were available at

Larry McVoy saved Matthew the effort too, pointing out that someone was already working on a system performance monitor. Chuck Lever also posted, saying he was busily porting Solaris' /usr/proc/bin tools to Linux, and had been posting relevant patches to linux-kernel.

Larry also went on to say he thought there was a mailing list somewhere. He and several other people offered to host it if there was interest, and Dominik Kubla said sadly that he was already hosting a (very quiet) mailing list at and using majordomo with all the usual commands ( .

3. Out Of Memory Errors With Plenty Of Unused Swap In 2.2.x

13�Mar�1999�-�17�Mar�1999 (25 posts) Archive Link: "vfork: out of memory, when there's plenty of swap free"

Topics: FS: NFS, Virtual Memory

People: Andrea Arcangeli,�Linus Torvalds,�Alan Cox,�Pavel Machek,�Rik van Riel

Pavel Machek was trying to crash his computer by using up resources -- and nearly succeeded! He ran a 'make -j' under 2.2.3 and started getting "vfork: out of memory" errors before swap was even half used up. Alan Cox confirmed the exploit and explained that the machine was running out of 8K blocks because 2.2 didn't defragment its virtual memory properly. He added that the same problem was causing sound skips and NFS hangs.

Rik van Riel was alarmed by this news, and Andrea Arcangeli said, "It's not the lack of defragmentation but bad allocation of memory. Just to make an example look at the inode cache. It uses a _bad_ way to alloc memory. That's the _best_ way to generate _persistent_ fragmentation all over the place."

Linus Torvalds replied regarding the inode cache, saying the allocators weren't the problem, but that "inodes have very difficult lifetime behaviour (some very short-lived, some _extremely_ long-lived), and that makes it hard to allocate them well using _any_ allocator scheme." He suggested, "it might certainly be an option to allocate inodes in bigger chunks at a time. That would at least make the problem become less," to which Andrea replied, "it would make the problem to go away completly according to me." He went on, "we simply need to alloc 32pages at time instead of 1 page at time. This will probably improve allocating performances and will avoid us to leave bomb on the VM over the time. Low memory machines could complain because the minimal granularity of inode-nr will be a 32pagechunk, but I don't think this is an issue."

Linus was skeptical, replying that allocating inodes in bigger chunks would fix the problem for inodes but not for other things. He also added, "I don't think we want to go quite that aggressively up to 32 pages, it's just too likely to fail. Going to 2 or 4 is probably fine." He suggested Andrea try it and see how it worked.

Andrea did, it worked, and he posted his patch. (He also added that he hadn't meant it would solve the problem in other places, but had only been talking about inodes)

4. Storing Kernel .config In The Kernel Itself

13�Mar�1999�-�18�Mar�1999 (37 posts) Archive Link: "[patch] v1.01 of /proc/.config (ready to eat)"

Topics: Compression

People: Tigran Aivazian,�Alan Cox,�Oliver Xymoron

Tigran Aivazian had been working on a patch to store the kernel config file in the kernel itself and make it accessible through /proc/.config; he posted his patch and there were some bug fixes and suggestions from Alan Cox. Oliver Xymoron also indicated that he'd done a similar patch just before 2.2.0; and Alan indicated that in his opinion Oliver's approach was the best so far. Tigran objected to the complexity of Oliver's approach, and tried to clarify things with a discussion of the possible implementations:

I can see the following alternatives:

  1. Appending stuff to zImage. This is bad because being not .config-specific forces a (user-space) reader to validate info. An app has to seek and find it first too.
  2. /proc/.config.gz This is bad because a reader must decompress it himself. That is *every* reader (unless he uses some shared library to do it transparently).
  3. decompress-on-the-fly /proc/.config (not the same as /proc/.config.gz!) This is the case where the kernel presents data in such a way that the user can lseek()/read() at any position without knowing that internally it is actually compressed. This (and not .config.gz) is what I meant by decompress-on-the-fly approach and believed it to be wrong because I have not seen a piece of code which can provide such elegant abstraction for a 10K buffer which itself (including data) would fit in 10K.
  4. plain text /proc/.config (as per my patches) This is bad because we unconditionally use up a couple of pages of physical memory. But we don't do much else other than provide user space with data correctly positioned for them to lseek() and get at it.

As we see, everything is bad, in which case, we choose the least evil, which is 4.

Alan preferred #2, and explained, "This is very good because it keeps the kernel space small and the standard gzip tool can handle it as well as the standard zlib library." He added, "Analogy: A database reader has to do its own ISAM in Unix, this is seen as good because it keeps ISAM file support out of the kernel."

Oliver updated his patch to go cleanly into 2.2.3; and also added a couple of other features such as a bin2hex replacement called bin2c, as well as "an updated version of my little "patch names" Makefile hack. When you build the kernel, it looks for files of the form and the resulting kernel will report itself as 2.2.x+foo. So, for instance, Alan could distribute his -ac series with files like patchdesc.ac2 and the resulting kernel would be easily distinguishable from the standard Linux kernel both at boot time and when using uname. Also extremely useful is that it puts modules in 2.2.x+foo as well so you can have a standard kernel and several patched kernels co-exist comfortably on the same machine. The patchdesc files also provide a convenient place to put changelog info, documentation, credits, etc." He added, "The only objection I've gotten to this so far is from Ingo, who claimed it would encourage forking."

Alan put in, "I don't think its a big issue. In terms of -ac certainly the ultimate aim is to have -ac contain nothing exciting enough to make anyone download it over the Linus tree." Oliver agreed, but in a different part of the thread added, "Anyway, unless someone with an inside line feels like advocating for either of these patches, I'm going to drop it until the next time it's brought up."

5. Hunt For A linux-kernel Mailbomber

17�Mar�1999�-�18�Mar�1999 (22 posts) Archive Link: "TICAL OwnZ JoO 6412x"

Topics: Mailing List Administration, Spam

People: Matti Aarnio,�Gerhard Mack,�Majdi Abbas

The list was bombed with many 80K messages consisting of random data. Other lists at were bombed as well. The address used was and the fellow seems to have covered his tracks fairly well. Once the crises had passed, Matti Aarnio posted this to the list (under the Subject: ADMIN: Thank you for *not* responding to spam... ( ):

I have just cleaned out about 4000 messages from VGER's queue which were related to that moronic bandwidth wastage spam from TICAL.NET's shell account machine (it seems).

Any new bogon coming from that particular machine with exact same characteristics is blocked, but nothing prevents changeing those parameters a bit, and another such will be seen...

As a result, nearly all messages with size exceeding about 20 kB are now in "freezer", and if in the set there is some real thing, that will hopefully be released once I go over that area.

Oh yes, I "froze" quite a few "stop spamming linux-kernel!" responses with the entire original message in them. Lets not be stupid ourselves folks!

Gerhard Mack tried to hunt the guy down, and posted:

Administrative Contact, Technical Contact, Zone Contact:
Barnes, Ray (RB8415) DjCorrupt@HOTMAIL.COM
(904) 816-8461

Internic registration for that site is a joke, apperantly doesn't wish to be contacted.


12 ( 269.795 ms 258.174 ms 309.787 ms
13 ( 259.740 ms 248.603 ms 249.826 ms
14 ( 249.766 ms 268.404 ms 249.833 ms
15 ( 309.754 ms 308.447 ms 279.809 ms

We can all complain to their uplink.

Administrative Contact, Technical Contact, Zone Contact:
ACSI-ADS (AC323-ORG) hostmaster@ACSI.NET

Majdi Abbas added:


1740 4200 6467, (aggregated by 6467 from (
Origin IGP, valid, external, atomic-aggregate

Apex Global Information Services, Inc. (ASN-AGIS-NET)
3601 Pelham
Dearborn, MI 48124

Autonomous System Name: AGIS-NET
Autonomous System Number: 4200

AGIS DNS Administrator (ADA2-ORG-ARIN) dns-admin@AGIS.NET
(313) 730-5151
Fax- (313) 730-9886

Sharon And Joy

Kernel Traffic is grateful to be developed on a computer donated by Professor Greg Benson and Professor Allan Cruse in the Department of Computer Science at the University of San Francisco. This is the same department that invented FlashMob Computing. Kernel Traffic is hosted by the generous folks at All pages on this site are copyright their original authors, and distributed under the terms of the GNU General Public License version 2.0.