Kernel Traffic #9 For 11 Mar 1999

By Zack Brown

Table Of Contents

Introduction

Kernel Traffic has been reviewed: Positive Propaganda (http://www.honeylocust.com/positive/) , Volume 4 Issue 5, has us sandwiched between a serious page about East Timor (http://www.freedom.tp/) , and a light-hearted story called Linux Got Me Kicked Out Of Wal*mart (http://www.honeylocust.com/walmart/) by the same folks who do Positive Propaganda.

Someone wrote in, asking us to give a link to the archives for each thread we discuss here. We never mentioned this, but we've been doing that all along: just click on the title of each article, and you'll go directly to that thread's initial message in the archives at http://www.uwsg.indiana.edu/hypermail/linux/kernel/index.html. Unfortunately, the archives themselves are not well threaded (read: threading is completely broken). If anyone knows who we can talk to about that, please tell us (mailto:zbrown@tumblerings.org) or ask them to contact us. We'd like to help.

Mailing List Stats For This Week

We looked at 961 posts in 3642K.

There were 377 different contributors. 148 posted more than once. 136 posted last week too.

The top posters of the week were:

1. Panic Hunt

28 Feb 1999 - 4 Mar 1999 (77 posts) Archive Link: "Kernel panic: can't push onto full stack"

Topics: BSD: FreeBSD, FS, SMP, Security

People: Alan CoxAlexander ViroAlexey KuznetsovMichael MerhejAndrea Arcangeli

Alexander Viro, Alexey Kuznetsov, and Andrea Arcangeli took up the vast majority of this 5-day, nearly 80-post thread. Michael Merhej (who was also a significant voice in the thread) posted an interesting hang he'd been experiencing on 3 similarly configured machines. He asked what the error ("Kernel panic: can't push onto full stack") meant, and Alan Cox replied, "It means something very bad happened in a situation where a lot of file handles were being passed across sockets." He added, "I've been pondering this one (you are the third report I've seen). It basically implies a bug in the mark/sweep garbage collector, or a race of some kind." Alexander Viro confirmed the bug was a race and posted a short patch (cced to Alexey Kuznetsov). Two hours later he posted a revised patch after help from Alexey, who found another problem in addition to the race.

Andrea Arcangeli asked for some explanation; he thought Alexander's first patch had looked good. Alexander and Alexey both replied, Alexander saying, "Sockets waiting for to be accepted don't have associated struct file yet. Anyway, the second variant eliminates the need to allocate the stack which is Good Thing(tm);" and Alexey saying, "we may have up to 2*max_files unix sockets: not-yet-accepted ones are not associated with files/inodes" and "we could limit number of unix sockets by max_files, but it is still not complete solution: max_files may be decreased," and "using list is always OK, however it also requires reviewing af_unix.c: f.e. pending socket destruction must be blocked while gc is running." To that last, Alexander added, "Only on the marking phase. And there we cannot block. *If* it may happen within the marking phase we were equally deep in it with the old variant."

Michael Merhej initially thought his troubles were over, but then his machines started locking up again, and he had some private email with Alexander, who then wrote to the list, "Michael sent me the lis of oopsen for the second patch. All of them are in the same place and it looks, uhm, interesting." He proceeded to identify the file and function (fs/proc/array.c, get_stat() (file:/usr/src/linux/fs/proc/array.c) ) where the problem showed itself, but couldn't get any farther than that. Andrea said he thought the problem was that "some (SMP) race in af_unix.c or garbage.c is corrupting memory."

Alexander posted jubilently, "Methink I found the sucker we have one place where the unix_socket is freed without kernel_lock hold. It is in unix_destroy_timer(). I'm adding a semaphore (unix_gc_marking). unix_gc() holds it during the mark phase. unix_destroy_timer() tries to get it, if it fails it postpones removal even more, otherwise it holds the semaphore while doing sk_free(). I hate voodoo programming and it's pretty likely to be only a bandaid, but if it will reduce breakage we'll get at least some hints on the nature of sucker." He also added a lengthy patch. But Alexey cried out, "Stop, stop, stop! To the time, when socket is destroyed from timer, it is dead. It is detached from hash tables, its queues are already destroyed etc. It is hold only to tell peer, that endpoint is dead and to die in piece."

Alexander asked, "Alexey, could you comment on test for tsk->dead in unix_accept()? Just what is tested there? It had been added in 2.1.124 and AFAICS there is no way to ->dead on a passed sock to be set. Was it just-in-case test or should it be unix_peer(tsk)->dead? IOW, if somebody did connect() and then close() before we accept() should we return this (hung) connection on accept()?"

Alexey replied:

Yes, it is crap. Apparently, I meaned unix_peer(tsk)->dead, but, luckily, did mistake.

Why luckily? 8) Because we have to pass even dead connection request to accept(), otherwise blocking accept() may hang after select().

I do not see any graceful solution now. I have to think.

Possible ways are:

Both of them are ugly. New ideas are required.

Alexander added two more possible solutions:

Notice that either way we are handling a nasty race to applications: if all you are going to do is to connect, write something and close the connection there is no way to know if the data we've passed will not be eaten by close() - it may happen before accept() on server end. I'm at loss here - looks like we *have* to do it to avoid DoS found by Andrea. Are there any other protocol families with non-blocking connect()?

But 25 minutes later, he wrote, "Darn. Screw #4 - it doesn't solve anything. Add #5: block on close() (maybe make it an setsockopt()-controllable)." He added, "BTW, from my reading of kern/uipc_userreq.c it looks like freebsd-hackers folks might be also interested in situation - I'm not on FreeBSD box, but from what I see in source they are also vulnerable..."

Andrea said Alexander's patch seemed to be working, and also that he'd "discovered a way to leak memory and cause the machine to stall completly in kernel mode for minutes as normal user. Waiting a bit more it will eat all memory and the machine will crash badly. It's a plain security issue." He posted a program to demonstrate the exploit, as well as a lengthy patch that grouped together all the unix domain socket fixes he was aware of (including Alexander's patches). And he put together a 2.2.2_arca4.gz (ftp://e-mind.com/pub/linux/arca-tree/2.2.2_arca4.gz) which included unrelated fixes as well.

At this point they dove so deep they were lost from view. I could only snorkel around as they faded from sight. They stayed down a long time, but finally, Alexander came up and said:

Folks, I tried to summarize those threads and here's the result:

  1. garbage collector didn't notice that receive queues of still-not-accepted connections are worth scanning. That was fixed in 2.1.126. Fix broke the stack allocation code in GC. See panics observed by folks actively using AF_UNIX. That was fixed in my second patch.
  2. There is a DoS (memory exhaustion) that had been fixed by Andrea's patch. Fix is not ideal and leaves open a related DoS with datagram sockets. Proper fixing involves too serious rewrite to do it in 2.2. Proposed variant (reap the unaccepted connection on close) leaves a nasty race to userland ;-/
  3. BSD implementation suffers from the same problems (and from several problems we avoided).
  4. On one of the boxen that used to panic combination of my patch to GC and Andrea's patch to connect() seems to solve the problem. Another one still demonstrates memory corruption with SMP enabled and works fine with SMP disabled, even with unpatched 2.2.2. It may be something we've missed, it may be an unrelated bug and it may be a faulty RAM.
  5. Latest Andrea's patch includes both my fix to GC and his fix to connect(). Missing part: SOCK_DGRAM-related DoS."

2. Article On I/O Buffering And Caching

2 Mar 1999 - 4 Mar 1999 (19 posts) Archive Link: "OSDI paper - IO-Lite: A Unified I/O Buffering and Caching System"

Topics: BSD: FreeBSD

People: Alan CoxJim GettysOliver Xymoron

Jim Gettys gave URLs http://www.cs.rice.edu/~vivek/iol98/ and http://www.cs.rice.edu/~vivek/vivekmsee/, called IO-Lite: A Unified I/O Buffering and Caching System, by Vivek Pai, Peter Druschel, and Willy Zwaenepoel, published in the 3rd Symposium on Operating Systems Design and Implementation (OSDI '99) Proceedings, New Orleans, Louisiana, February 22-25, 1999, pp15-28.

The paper describes a method of IO, already used in FreeBSD, which should result in significant performance gains.

Oliver Xymoron was intrigued, but felt Linux wouldn't see the gains FreeBSD saw. One subthread coming out of that was a technical discussion of performance gains and implementation details, while in another, Jan Vroonhof suggested that the whole thing could be done in user space. This seemed reasonable to Alan Cox, and others agreed. There followed a brief technical discussion of implementation.

3. Disabling Intel Serial Numbers; US History

2 Mar 1999 - 5 Mar 1999 (47 posts) Archive Link: "[patch] PIII/Katmai & FXSAVE support, disable serial-#, 2.2.2"

People: Richard B. JohnsonIngo Molnar

Ingo Molnar posted a new version of his Katmai patch, including changes from various sources. He added that this wasn't the final version, but he wanted to make it available for developing. There followed a big discussion about the new Intel P3 serial numbers. Linux will disable the serial numbers, so the main question was, "could someone re-enable them from user-space?" A number of people were worried about this, while a number of others felt that there was no privacy in today's world anyway, so a serial number on our CPUs wouldn't make much difference. The following story from Richard B. Johnson is a fun hilight:

When I was a young kid, everybody was worried about the Russians. The whole country was watching each other to see if neighbors had any "Communist characteristics".

I thought that this was amusing. I had access to a Teletype-500 at school that was not connected to anything. However, by mucking around, I could make it print a random assemblage of characters.

I took a printout and made a "thermal" copy (we didn't have Xerox then). I then made a copy of the copy.

Then I went to a store and bought a box of envelopes. I took one out of the center, was careful not to touch it, and threw the rest away.

I addressed the envelope to Radio Moscow, Moscow, USSR, using a new ball-point pen which I also threw away. Without touching anything, I inserted the "encrypted" message, added a stamp and mailed it from a mailbox in a city 30 miles from where I lived.

Within two weeks, the FBI visited me at school. Although I never admitted nor told anybody anything, they knew everything. They even knew how I got to the city 30 miles from where I lived.

Those were the days in which J. Edgar Hoover kept a file on practically every man, women, and child. These records were kept by humans. Now they use computers.

If you don't want somebody to know what you are typing on your computer, turn off the power.

4. Undelete

2 Mar 1999 - 5 Mar 1999 (42 posts) Archive Link: "EXT2_UNRM_FL"

Topics: FS

People: Theodore Y. Ts'oMike A. HarrisRichard Gooch

The quest for the perfect undelete. This discussion comes up every once in awhile, on linux-kernel as everywhere else. On linux-kernel it can get a bit more interesting because the people talking about it actually try to find a good way to implement it. In this thread, a number of top developers hacked at it for awhile.

Theodore Y. Ts'o said, "It's not too hard to translate a unlink to a rename call, and move the file to some directory; but handling (a) undeleting the inode, (b) security so that users can't see other people's deleted files, and (c) automatically deleting "deleted" files when the filesystem needs space, all starts making the problem a lot harder...."

Richard Gooch said he wrote a userspace tool years ago to accomplish all that. Mike A. Harris objected, "And what if someone deletes using mc, or some other filemanager, or their own code? It isn't transparent. I'd rather see it in the filesystem myself," but Richard replied, "Write a wrapper for unlink() and use LD_PRELOAD. No need to bloat the kernel."

Ted agreed, but added, "It might be worth it for the kernel to add a wakeup to the undelete daemon telling it that space is low and it should remove some of the deleted files, but I'd want to see how well a strategy of polling every minute works (or doesn't work) before deciding whether the extra kernel bloat was worth it." He explained, "Someone deletes a large number of files on the system. Sometime later, some other user starts unpacking a kernel tree, and needs space. It would be nice for a daemon to notice that free space was getting critical, and for it to start really deleted files that were in the "trashcan" in order to make space. After all, when the user deleted the files, they're effectively stating that they don't want them anymore, and the undelete option is just as a saving throw. If they then logout for three months, if the space is needed for something more worthy, there should be a way for that space to be reclaimed automatically." Richard Gooch replied, "OK. Yeah, regular polling by the daemon could be used to see if you run out of space. Yep. Let's first see if a polling daemon is effective before hacking the kernel. The cost of polling should be really cheap, so doing it once a second or more should be OK."

5. RAID And CONFIG_FILTER Troubles Under 2.2.2

3 Mar 1999 - 5 Mar 1999 (18 posts) Archive Link: "2.2.2 two major problems"

Topics: Disk Arrays: RAID, FS: NFS

People: Mike BlackAlan CoxMark LordAndrea Arcangeli

Mike Black posted the following raucus flame:

Linus (and all CD creators) -- please do NOT make 2.2.2 the CDROM release version!!!

  1. Had to rebuild a RAID array 3x8G raid5 set. mke2fs causes entire operating system to hang due to "out of memory" condition (I've got 128Meg in the machine). Looks like the attempted fixes in inode.c are not successful yet in flushing the buffers. Got it to finally work by reducing the RAID to 3x6G, and using 16384 for the bytes-per-inode -- this could've been a show stopper (my RAID had failed).
  2. 2.2.2 breaks CONFIG_FILTER (won't compile with this option) -- this means I can't upgrade one of my machines (DHCP server requires CONFIG_FILTER)

I'm not usually one to flame but here goes...

I've been using Linux for 3 years+ for the main reason that support has been outstanding. However, with the release of 2.2.2 it looks like either Linus is taking a vacation (well deserved I might add) or everybody fell off the face of the earth.

When can we hope to have 2.2.3 with these important fixes?? Even a pre-patch release would be welcome (which I generally avoid). The above two problems would be extremely embarassing for the Linux community to be on thousands of CDROMs.

I don't want to change over to the ac kernel series but I'm really disappointed to see that work has apparently stopped for the moment on the main Linux branch.

Alan Cox replied to this, saying that the second problem had been fixed a couple weeks ago, and Mike could pick up the solution from the linux-kernel archives. To the first problem, he said, "Ok they've been working for pretty much everyone. Congratulations on finding a torture test,"

Mark Lord interrupted, with:

Err.. perhaps not everyone, Alan.

Since adopting 2.2.2, I've had:

I've done all the Right Things and have all the Right Stuff in this box.

Still feels like a development kernel here, ever since the MM code got re-designed just before "release".

Romano Giannetti confirmed he also had problems, but hadn't said anything on the list because they weren't reproducible.

While this was going on, Mike Black posted some raucus praise under the Subject: 2.2.3-pre1+patch Success with RAID and DHCP (http://www.uwsg.indiana.edu/hypermail/linux/kernel/9903.0/0913.html) :

Good job guys!!

2.2.3-pre1 fixes the CONFIG_FILTER option which allows DHCP to work.

I had reported the mke2fs on an 3x8G RAID array was running out of memory. I'm using raid0145-19990128-2.2.0.gz

Andrea Arcangeli posted a patch which I applied to 2.2.3-pre1 and I regenerated my RAID array with no problems at all. The inode buffers looked like they got flushed out just fine -- no swapping occurred, no out-of-memory conditions.

He also posted the lengthy patch that allowed mke2fs on the RAID set.

6. Trident To GPL A Sound Driver

3 Mar 1999 - 4 Mar 1999 (18 posts) Archive Link: "which pci sound card does not have dropouts under load"

Topics: PCI, Sound: ALSA, Sound: SoundBlaster, USB

People: Alan CoxBrian GerstAaron Tiensivu

Modems

Thomas Mertes asked what would be a good sound card to use under Linux, given that some companies don't release their specs. His Soundblaster 16 Vibra from Creative is not really supported, due to the specs problem.

Alan Cox wrote:

Trident are currently about to contribute a GPL'd driver to the ALSA project. That will be the first good vendor provided PCI sound driver for a 'current' sound card if so. If they have now contributed it, it is GPL and all the promises have been met then I would recommend you look at the ALSA list and the cards that new PCI driver supports.

http://alsa.jcu.cz

If you wait another year my guess is the PCI sound card will be mostly dead for consumer applications, killed by USB sound. USB sound is an open standard for most stuff, its freely downloadable and the USB project could do with more people working on it BTW 8)

There followed a brief discussion about problems with USB. Aaron Tiensivu pointed out that USBs passed most of the work to the CPU. Brian Gerst mournfully compared this to the WinModem problem, though Alan added, "Its a combination. Basically you get what the USB kit can do and the rest is a mix of software and the USB chips (they do all the DMA and the like for you including fairly hard real time output)."

7. Buffer Overflow Attacks; Big Memory Machines

3 Mar 1999 - 7 Mar 1999 (25 posts) Archive Link: "Linux Buffer Overflow Security Exploits"

Topics: Big Memory Support, Microsoft, Security

People: Sarah AddamsOliver XymoronRogier WolffAlan CoxTuomas HeinoLinus TorvaldsStephen Tweedie

Sarah Addams said, "Excuse my ignorance, but would someone explain to me why Linux and other Unices are vulnerable to buffer overflow exploits? I suspect it's because the code, data and stack for a given process is kept in a single memory segment, but I'm not at all sure about that. If however, I'm right, would Linux Alpha, running on a Harvard architecture 21164, be immune from this weakness?"

Oliver Xymoron responded:

No. The problem can be broken down into the following subproblems:

There's currently a good fix for the compiler problem known as Stackguard which is as close to a real fix as currently exists. Patches exist to disable execution on the stack, but exploits to bypass it have been shown.

In a later post, regarding the fourth item above, he commented, "I'm going to rescind my statement that this is less common - most overflows occur inside library functions operating on stack variables in an application. This is actually worse - unless the library was compiled with Stackguard, you can't protect against it, as the exploit occurs when the library function returns rather than when the caller does," and added the interesting tidbit, "Stacks growing downward is not completely arbitrary, either. It's very convenient to have a known starting point for both code and data. Since code likes to run upwards, it makes sense to have code start near the bottom. Since the optimal way to put two growing structures (code+heap and stack) in a linear array is to start them at opposite ends and have them grow inward, the stack gets placed at the top. Less of a concern when you have ridiculously large address spaces like the Alpha, but still not arbitrary."

In response to Sarah's original post, Rogier Wolff reminisced:

I had the chance of reading a part of a Microsoft Windows NT driver. I found a buffer-overflow within half an hour. I suspect that this will happen with ANY NT driver that you get to inspect. Scanning the source is a bit more comfortable than writing a program that continually tries to pass illegal arguments to system calls, because when you succeeed, you will crash your machine many times before you have an "exploit".

Exploiting that overflow will obliterate all security checks, as the driver executes in the kernel addressing space.

The controversy is that by publishing all security problems that you fix, you give the impression that there are a lot of them, while the parties that DON"T publicly admit the same problems, will give the impression of having less bugs.

Personally I prefer having the bugs known and fixed, over semi-known (*) and unfixed.

(*) To use an old example, read all the SunOS-alerts, and try and apply them to say AIX or HPUX. And the cracker community will pass the bugs that they find around silently without you noticing....

Alan Cox, also responding to Sarah's original question about why UNIX is vulnerable to buffer overrun exploits, said, "Because like basically all computers you don't have hardware type and size tags on all pointers. There are approaches to reduce the probability of that error but reading and checking code is the most productive. Logic errors tend to be as big a problem."

Sarah replied with, "Isn't it the case for Intel 386 and up processors, as is true for other modern processors, that memory segments can be marked execute, read and/or write by a process running at a sufficiently high privilege level. So if you write your kernel to take advantage of these features, you could guard against the case where a buffer overflow is used to sneak code into an otherwise secure system?" She supposed her original question could be boiled down to, "Does a Linux (and/or other Unix) process inhabit a single read/write/execute memory segment?" to which Alan responded, "No. The executable image is read only or copy-on-write. The stack however has to readable (obviously) and x86 has a problem with page level permissions for execute only (it hasnt got them)." He remarked, "Forget about segments, intel has been trying to de-invent them since the 386."

Tuomas Heino demanded, "And how do you use more than 4 gigabytes of RAM on these boring Intel boxes _without_ segmentation? ;) [... and don't say "ram drives" - unless you have a new meaning for that," and that formed the subject of the rest of the thread.

Later on, in a different thread altogether, Linus Torvalds made this comment:

Actually, while I've always looked at the 36-bit extensions with extreme distaste, Stephen Tweedie convinced me that we can really cleanly and fairly easily support it in a perfectly reasonable manner. It won't be 100% support, but it looks better than just using it as a ramdisk (we can basically use it for page caches and anonymous pages without having to get ugly about it - they have enough of an abstraction layer that it doesn't impact the rest of the system much at all)

So we may actually end up with a reasonable support for it. The page cache and anonymous memory is really what people want anyway, other memory uses are basically "fluff" compared to those two.

We'll see. The proof is in the code, and while Stephen made a very good case for something that I would accept, we'll just have to see how it works out. Right now the alpha/sparc64 approach is still the only reasonable choice with Linux.

8. Upgrading To 2.0.37 On Red Hat 4.2

4 Mar 1999 (2 posts) Archive Link: "kernel 2.0.37 and old distro"

People: Alan Cox

Carrer yuri asked what he would have to upgrade (other than his binutils) on his RedHat 4.2 system, to use kernel 2.0.37. Alan Cox replied, Nothing. He also said he was trying to make sure it stayed that way.

End Of Thread.

9. Intel Not Releasing Specs On EtherExpress PRO/100 Server Adapter

4 Mar 1999 - 7 Mar 1999 (15 posts) Archive Link: "Intel EtherExpress PRO/100 Server Adapter"

Topics: I2O, PCI

People: Alan CoxMarek HabersackDeepak SaxenaStephen Williams

Marek Habersack posted looking for a driver or for a tip on who to ask. Alan Cox replied, "I think it is a matter of either Intel or Divine Intervention. What I don't know is if that card is an I2O device. If so then in time the I2O code will handle it - check with Intel"

Marek came back with, "The datasheet on the Intel pages (http://www.intel.com/network/products/pro100int.htm) doesn't explicitly say whether it's an I2O device. It comes with an on-board i960 CPU, but I suppose it doesn't mean it's an I2O device :-)))"

Alan offered, "If you are feeling really brave you can boot my very early I2O patches kernel and see if that thinks its I2O. Its just about able to get as far as talking to an I2O device, setting it up for some subsets of devices and asking for the vendor information and printing it," when Deepak Saxena poured cold water on both of them, with, "The Pro/100 Server Adapter is not an I2O device. The 960 is just used to run the firmware for the card and do buffer mnagement functions under NT and Netware BorderManager. The Pro100 driver won't work since the driver has to go through the firmware and doesn't talk directly to the 82558 MAC chips is my guess."

Marek pined, "But the card works under M$ Winblows with the standard EEPRO/100 driver, so I presume it should work with the standard Linux EEPRO100 driver," to which Deepak replied, "Oh...that's different than. Does the kernel support pci-pci bridges? It's possible that if the 960 firmware isn't initialized by the the Pro/100 server specific driver, it just acts as a dumb PCI bridge...just an educated guess."

Stephen Williams came in with, "I've been trying to get programming information out of Intel for months now. I have a development kit that we use for our own i960Rx boards and would work dandy with this board, I just need a few more bytes of information to write a board support interface," and added, "I have a case # and everything. If anybody can shake loose the information I need, I can get the ball rolling." He also replied to Deepak's "educated guess", with:

And a correct guess, sorta.

Actualy, the i960Rx has a few mode pins that tell it what to do at reset time. The options are:

The last 2 leave the bridge acting like a normal bridge. The first allows the processor to first take over and mark some devices and private before allowing the BIOS through the bridge.

10. /proc Docs; Performance Of SMP Kernels Running On UP Systems

4 Mar 1999 - 5 Mar 1999 (4 posts) Archive Link: "SMP on a UP system, and Hitchhikers guide to the proc file system?"

Topics: SMP

People: Christopher McCroryAlan Cox

Christopher McCrory asked for docs on /proc and was pointed to /usr/src/linux/Documentation/proc.txt. He also asked about performance of SMP kernels running on UP machines, to which Angus Mackay said yes, there would be a performance hit. He went on:

I did some thread creation latency benchmarking across 2.0/2.1,w&woSMP/winNT/BeOS on a UNI machine and 2.1 NON-SMP rocked them all (the only one that could compete was solaris x86 (data mesured using the tsc register)).

here are the results:

OS Average latency Average min lifetime
tb.lin.2.1.su.out: 15429.88 43262.35
tb.lin.2.1smp.out: 24820.78 73181.09
tb.lin.su.out: 36952.43 47082.92
tb.be.out: 97613.29 133678.29
tb.nt.cgw32.out: 150206.23 194500.16
tb.nt.out: 146374.37 177591.14

(solaris was left out because I did it on a pII 400)

these are in clock ticks of an AMD k6-233. all the benchmark did was create a thread and wait for it to finnish, all the thread did was record the value of the tsc register.

this is not a good overall benchmark but you can see that the smp kernel can't schedule thread creation and deletion as well as NON-SMP.

Alan Cox added, "The SMP kernel wont run on some non SMP boxes, and is a bit bigger and slower, maybe 15% in real terms from experience. The question is whether 15% is worth the hassle of maintaing two kernel sets in your environment"

11. CPU Quality

4 Mar 1999 - 7 Mar 1999 (21 posts) Archive Link: "RH5.2 won't install on Cyrix"

People: George BonserAlan CoxBob TracyDenis VoitenkoAnthony BarbachanMike A. HarrisRafael Reilova

Denis Voitenko's machine kept rebooting whenever he reached the "second level setup" message during installation. Alan Cox and George Bonser agreed that the problem was probably the CPU overheating. George had some workarounds:

My fix is to boot the install disk, get to the initial installation menus, then swap floppies with one containing the set6x86 package, I run that, then swap back and continue with the install. The key is running set6x86 as soon as possible in the install process before the CPU gets too hot.

Adding a thin coating of heatsink compound between chip and heatsink helps too. It is available at any electronics parts store and at such places as Tandy/Radio Shack.

Denis asked if there were a way to trick RedHat into thinking the CPU wasn't overheated. Rafael Reilova suggested he underclock the processor as much as possible, while Alan said, "If you managed to trick the chip into not noticing it had overheated and thus doing a thermal powerdown it would just damage itself."

Bob Tracy posted some advice:

AFTER you take care of getting some adequate cooling in the peecee case (heat sink w/thermal compound, ball-bearing CPU fan, case fan OTHER than the puny throwaway in the power supply, etc.), look up a nifty utility called 'set6x86' and pay particular attention to how you set suspend-on-halt to drastically reduce the power consumption of your Cyrix. For me, set6x86 made the difference between having a useful machine, and getting signal 11's everytime I tried to compile the kernel. Yeah, I'll be trying something besides Cyrix the next time around unless they are designing their newer CPUs with a bit more margin in the heat dissipation department. I'd hate to think what kind of problems I would be having with a "normal" Cyrix CPU: mine is a 6x86L, as in "low power".

On the plus side, my machine has been stable for well over a year after taking all the measures described above. Bottom line: the Cyrix will work fine for you, but you absolutely HAVE to pay attention to the heat issue or it will eat your lunch. I can envision environments where a Peltier cooler would be necessary... A year ago when I was seriously looking into such things, they could be had for around US $30.

Denis mused that maybe the best thing would be just getting generic Intel chips, adding "you get what you pay for, right?"

Anthony Barbachan replied, "their are plenty of both buggy and bad intel chips out there as well," and Mike A. Harris said, "Since when has Intel been known for providing chips that run cool? They never have to my knowledge."

Alan added:

A lot of problems are actually simply caused by dumb PC builders. They do stuff like stick a price tag on top of the CPU, on what _was_ a nice extremely flat heat contact surface.

I've had trouble with the Cyrix-MII and heat which got solved by using proper heatsink compound (I cannot believe vendors skip this - its virtually zero cost to them and it makes a lot of difference).

The K6 I've had no problems with, although I know friends who had dumb vendors doing the price tag on the CPU top trick who did have problems. Special bonus goes to the vendor who put an "if removed this voids warranty" serial number sticker on the CPU. Duhhhh.

If you look on Usenet you'll find people reporting "mpeg encoding mysteriously slows down on my new PIII after 15 minutes" - that appears to be PIII cooling problems they are having (the PIII seems to drop to 1/4 speed instead of just executing a shutdown cycle or dying randomly as the Cyrix does)

12. Philosophy Of Kernel Development

4 Mar 1999 - 5 Mar 1999 (12 posts) Archive Link: "*sigh* Please give me something..."

Topics: Development Philosophy

People: Zygo Blaxell

A very frustrated Jeremy Hansen was having daily crashes with 2.2.2ac7 with raid patches. He complained that 2.2.2 should never have been released as 'stable'. In reply, Zygo Blaxell summed up the philosophy of the even numbered releases: "Later 2.2.x kernels will of course work for more and more people, but the early 2.2.x kernels are really only a week or two different from the late 2.1.x kernels. "Normal" users who rely on hard stability should wait until a real Linux vendor sorts out the bugs and shrink-wraps the kernel into a nice well-tested package for them. Everyone else will be busily banging on the 2.2.x until then."

One of the strengths of free software is that it's possible to have stability as a goal rather than a guarantee. Commercial venders have their reputation on the line, so they have to stand by their buggy products; while free software writers can persistently improve things.

13. Some Explanation Of I/O Caching

4 Mar 1999 - 5 Mar 1999 (3 posts) Archive Link: "90% mem for buff??? (2.2.2ac7 other not tested...)"

People: Kurt Garloff

An often repeated question:

Gregoire Favre noticed his buffers were huge, making it look as though all his memory was tied up. Kurt Garloff said, "this also puzzled me, when I started Linux hacking in 1994. You think, why does Linux need so much memory?" He explained, "You should look at it from the other side: There is hardware, CPU, RAM, etc. and Linux tries to make the best possible use of it. Now, provided no process needs your RAM, it would be just stupid to leave it unused. Linux uses it for buffering (caching) disk I/O and speeding up I/O operations this way. As soon as your processes need more memory, they will get what they ask for and the buffer will shrink, until it is only a few percent of total mem. Only then, Linux will start to page(swap) to disk."

14. Scheduling On SMP Machines

4 Mar 1999 - 5 Mar 1999 (4 posts) Archive Link: "Scheduler oddness"

Topics: SMP, Scheduler

People: Mathew G Monroe

Rui Sousa noticed that when running 3 similar processes on his SMP machine, one of them seemed to get the lion's share of the CPU time. He noticed that running 4 similar processes got rid of the problem. Mathew G Monroe replied, "Yes, on SMP machines an advantage is given for processes to stay on the same CPU (preventing cache thrashing). As such basically with three processes, you get one on a one cpu, and two processes on the second CPU. They will CPU jump a little, but it is discouraged by the scheduler to prevent worse overall preformance. So you endup with one process getting a lot more CPU time."

15. Socket Filtering Broken In 2.2.2

5 Mar 1999 - 6 Feb 1999 (2 posts) Archive Link: "linux-2.2.2 won't compile"

People: Tuomas Heino

Ingo Saitz was getting the error "dereferencing pointer to incomplete type" in /usr/src/linux/drivers/net/loopback.c, when doing 'make oldconfig' for 2.2.2; doing a 'make -ki' pointed him to /usr/src/linux/include/net/sock.h

Tuomas Heino replied, "It's a known problem: The socket filtering patches weren't properly merged in 2.2.2; disable socket filtering if you don't need it; if you need it then check out the mailing list archives for the million messages on the subject...]"

16. Portable I/O Memory Access

4 Mar 1999 - 7 Mar 1999 (16 posts) Archive Link: "user space writel() etc. in 2.2.2"

Topics: PCI

People: Vassili Leonov

Vassili Leonov wrote, "I'm trying to redo our LML33 (http://linuxmedialabs.com/lml33doc.html) card driver and configuration scripts and have run into the following: in the io.h file functions to read/write IO memory (writel(),readl() etc.) are only defined with __KERNEL__ macro defined. Under 2.0.36 that was not the case." He added, "I don't like setting __KERNEL__ for user space allications, besides it produces a lot of errors when I try to compile," and asked, "Q: what is the right way to access IO memory under 2.2.2?"

There followed a discussion of the technical issues involved. Finally, Vassili summed up, "OK, finally I am enlightened (by Alan and others). The story is that there is no portable solution for IO memory acces in current Linux. It is platform dependent in a non-trivial way."

He quoted a map posted by Alan:

ARM - no cache coherency
Alpha - offset PCI mapping. PCI I/O space is mapped in strange ways varies by platform
PMac - PCI bus is endian swapped in hardware
x86 - Sane (well its the reference bus)
Sparc64 - Basically sane - again PCI I/O space is mapped as memory

17. HP Begins Closed Port To Merced (To Be GPLed Eventually)

5 Mar 1999 - 7 Mar 1999 (8 posts) Archive Link: "Linux/IA-64 (Merced)"

People: David Mosberger-TangAlan CoxLarry McVoyGerard RoudierKai HenningsenAndrea ArcangeliDavid Mosberger

David Mosberger-Tang wrote:

Disclaimer: I happen to be an HP employee but I'm making the following comment as an individual and what I'm saying is my own opinion. I decided to send this mail because there seems to have been some confusion as to what state Linux for IA-64 (and Merced) is in.

First off, if you have a chance, I'd like to invite you to attend my talk on Linux/ia64 at Linux Expo '99 in Raleigh, NC (http://www.linuxexpo.org/). As a quick summary, let me say that HP Labs has been working on bringing Linux to IA-64 (and Merced) for over a year now. We view this project as a regular Linux project except that there is a bit of a constraint due to the fact that IA-64 is still under non-disclosure. Thus, at this point we can't invite everybody to join the development, but as usual in a free software project, everybody who can contribute is welcome to do so. Also, the result will obviously be available under the GPL as soon as IA-64 machines are generally available (and yes, we're very careful to do everything according to the letter of the GPL, so please don't let this degenerate into a legal issues discussion).

I realize the above may read a bit cryptic, but what I'm trying to say is that I really hope that the fact that companies are starting to make real money from Linux won't mean that Linux developers will stop working together.

Andrea Arcangeli's only comment was to quote the two lines, "we can't invite everybody to join the development" and "... "make real money from Linux won't mean that Linux developers will stop working together," to which Alan Cox replied, "Nod. But Intel are unlikely to risk AMD cloning their new chip before its out. Hopefully VA Research and others who have announced other ports (I assume they are seperate) will decide to work together though." He added, "If the entire rumour mill is correct there are IBM and SGI porting projects too 8). But then Linux/3090 was a rumour mill item that hasnt occured so who knows the real count."

Larry McVoy added, "There is one point which perhaps should be made more clear: if you are under NDA for IA-64 and you want to help with the IA-64 port, then you should contact David and coordinate with him. Just because you don't work at the same company doesn't mean you can't talk to each other about this port. I know of at least 4 companies who are all NDA'd on IA-64 and are all doing work on it and may or may not be talking to each other. If you are interested in contacts and have IA-64 NDA, let me know, I can put you in touch with the people inside the other companies."

Gerard Roudier gave the rye remark, "Times have changed. Seems Linux folk is now talking about NDAs without any flame," to which Kai Henningsen replied, "The important difference here, IMHO, is that the NDA is limited to the time period where the hardware isn't available." He added, "That's extremely different from the typical flame case where the hardware has been available for a while, but information is still available only under NDA and there's no hope the NDA will *ever* be lifted."

18. Toshiba Spec And Licensing Silliness

5 Mar 1999 - 7 Mar 1999 (7 posts) Archive Link: "Hard Drive Cache information"

Topics: Microsoft

People: Alan Olsen

Alan Olsen had the hilight of this thread, with, "Toshiba has started shipping their latest laptops in a plastic wrapper that states that by opening the wrapper you agree to the Microsoft Licence Agreement. (i.e. you cannot return the MS portions to get out of paying cash to Bill.) They seem to be sucking up to Microsoft in a big way."

The general concensus on the list was, Toshiba is not releasing specs, so boycott Toshiba till they get a clue. Sounds like a good idea to me.

19. FAT Speedup

6 Mar 1999 - 8 Mar 1999 (4 posts) Archive Link: "FAT speedup patch revisited for 2.2.1"

Topics: FS: FAT, FS: VFAT, Microsoft

People: Jukka Tapani SantalaRiley Williams

Jukka Tapani Santala posted this very interesting exposition:

Sometimes, operations on the old FAT based filesystems (VFAT, MSDOS etc.) on Linux appear almost draggingly slow. There's no magic bullet to solve this (Or if there is, let me know;) but after performing extensive profiling on the 2.1.x series of kernels I pinpointed some places of the code that are crucial during high filesystem-load times, and embarked on optimizing them. I have now revisited that patch, making sure it applies and works for 2.2.x series kernels, fixed things up and performed some further benchmarking. Now, the catch is that this patch is assuming:

It seems to me these two conditions are true in general; no exceptions have been brought to my attention since I've started posting these optimizations. Some parts of the code appear to already be more or less assuming the above. But be forewarned, if for some reason either one isn't true, applying and running this patch may lead to filesystem corruption. I'd like to know of any such instances. Now, with the geral babble out of the way, few notes on the choices made in this:

The major improvement is replacing cluster-size by "cluster-shift", the two's power for the effective cluster-size. Most of the time either multiplying or dividing by the cluster-size is required, and because cluster-size itself is a variable, the compilers have no way to automatically optimize these operations to shifts when needed. Division is so expensive operations the performance-boost from this is noticeable.

Even in the one tricky modulus-operation on the first chunk of the patch, the bit-operation is much preferable to taking real modulus. Further improvement might be gained by calculating and keeping around another bit-mask for the modulus, though, but I have not conssidered that worth increasing the memory-profile.

The second improvement involves the function used to translate standard FAT style filenames into *nix filesystem type filenames. In this process the 8.3 type filename pieces are read from the FAT record, put into lower case and concatenated into result string for display. Unfortunately, quick look at the code shows that this bit of code gets so deeply nested it ends up with pretty poor optimization. I tried several approachs to get all out from the processor on this code, and finally settled on the attached approach.

Here, to avoid running out of registers to use, the actual task has been divided up to a number of very simple, straightforward tasks executed after each another. Namely, we find the length of the filename portion; we copy it down. Next we find the length of the extension and copy it after, and finally we turn the whole result string to lower case. This appears to work pretty well in practice.

With some experimenting I finally settled on the current approach here, where the filename-portions are scanned in reverse order to find their length - this way we won't have to scan the whole string for last non-whitespace, and also most filenames are close to 8 characters, so loop-count will stay low.

Finally, in an earlier version of this patch, the case-conversion was done using a partial case-table. After some dry-profiling, I found out this approach is <10% faster even when the table can be counted to stay in cache all the time, so in this version I have reverted back to the approach using two compares and increment used elsewhere in the code, for a grand saving of 32 bytes.

So... if you're using FAT based filesystems on your computer, feel free to give this patch a try. If, for some reason, the message "fat_read_super: Cluster size was not power of two!" appears during bootup or mounting, do not use the patch, but let me know of the details instantly. Ofcourse, this should not happen <g>. Any other problems, let me know as well. I'm hoping this patch can be included in the next developmental kernel version (2.3.x?)

Riley Williams found an interesting exception:

I can't comment on the first of those two assumptions as I've no specific information on it, other than that all of the implementations I'm aware of follow it. However, I can make the following observations regarding the second assumption:

  1. In MS-DOS 1.xx filenames were padded with spaces, but extensions were inconsistantly padded as follows:
    1. If the entire extension was unused, three spaces were used to fill the field.
    2. If at least one character was used, the field was padded with NUL (zero) bytes rather than spaces.
  2. According to MicroSoft's own documentation, from MS-DOS 2.00 the file system would always pad both fields with spaces where required.

I'll agree that systems running MS-DOS 1.xx are probably not in the majority, but that's the only variation from that specification that I am aware of...

Jukka replied:

Actually, a nul-termination doesn't matter in the extension bit, as this will just lead to few more nulls at the end of the filename, which get skipped. I'm not sure if it'd make difference with long filenames, but somehow I suspect there aren't many MS-DOS 1.xx VFAT-setups out there...

Anyway, there's another "problem" now, altough I've rewritten the actual copy-over code in fat_readdirx() about as fast as it goes, the problem is that it seems to change the code placement and optimizations in subtle ways causing the rest of the function, unchanged, get dirt-poor performance at least on -O2.

The whole fat_readdirx() function is actually an optimizers nightmare, consisting of over 200 lines of deeply nested code. I've tried breakign it up to parts a few times, without success, as the different parts are dependent on so much state-information. Perhaps if I manage to optimize the rest of the code, too, at least that will change the code arrangement and possible optimization results.

Ofcourse, there's a philosophical question, if you write bit of source as fast as it can go, and then the compiler/optimizer messes it up making it slower, is it still better? Oh well. The bit-shift part of the patch does speed things up currently (On ALPHA it may not, I'm told, but how many people use *FAT on ALPHA?) but the changes to fs/fat/dir.c lead to slower overall performance, 'though the optimized part flies.

20. Binary Module Famewar

6 Mar 1999 - 9 Mar 1999 (37 posts) Archive Link: "Lets get this right (WAS RE:MOSIX and kernel mods)"

Topics: Binary-Only Modules, Clustering: Mosix

People: Larry McVoyRichard Gooch

A straightforward flame war growing out of the MOSIX binary module issue. The most interesting thing to come out of it was that Larry McVoy and Richard Gooch, who were at each other's throats last week over Issue #8, Section #2  (24 Feb 1999: Real-Time Scheduling Flame War) issues, found themselves on the same side this time, to their great surprise. at one point Larry said, "I'm really giving myself a bad rep here, because I have to keep agreeing with Richard :-) But hey, when he's right, he's right, what can I do :-)"

21. videodev/mm Fix

6 Mar 1999 - 7 Mar 1999 (10 posts) Archive Link: "BUG in videodev.c [PATCH]"

People: Richard GuentherLinus Torvalds

Richard Guenther offered a patch and said, "A bug in videodev.c lets you mmap device memory and free it via close() - this leaves you with a nice "Blanko" mapping of some memory. For example the bttv driver is vulnerable as it frees its buffer in close(). (The drivers itself cannot fix it, since they dont get the vm_area_struct)"

Linus Torvalds replied, "This patch looks like extra code to get around some other driver bug, and I'd much rather just see the driver fixed properly." In a later post he added, "I see what the problem is now: it appears that the driver has a completely unrelated bug, which is that it doesn't save away the file in the vma, which is why the system gets confused: it doesn't know that the mapping still holds on to driver resources." He gave his own fix, but replied to himself a few minutes later with, "Duh. Looking at the original patch, this was what the patch did, I was just confused by the other stuff. Never mind, the patch looks fine (except all the confusion has just convinced me that I should do this in the mm layer so that it can never happen again, and so that drivers don't even have to care)," and replied to himself again half an hour later with, "done. Adding it in one place allowed me to remove it in 17 other places (and then not apply the videodev.c patch because now it is no longer needed)." He added, "One less cause of hard-to-find bugs."

22. Gnome CDROM Bug Uncovered

6 Mar 1999 - 9 Mar 1999 (6 posts) Archive Link: "cdrom eject fails in 2.2.2-ac7"

People: Keith BennettJens Axboe

Keith Bennett posted noticed that 'eject' no longer worked for him. Jens Axboe told him to upgrade to eject 2.0; he did so and started getting "device or resource busy" errors. Jens asked if Keith had a cd player in the background, and Keith said, "doh! okay, you win. it seems that gnome doesnt kill off its panel applets when you remove them from the panel. time to write a bug report to gnome (i'll just add it to the list)"

 

 

 

 

 

 

Sharon And Joy
 

Kernel Traffic is grateful to be developed on a computer donated by Professor Greg Benson and Professor Allan Cruse in the Department of Computer Science at the University of San Francisco. This is the same department that invented FlashMob Computing. Kernel Traffic is hosted by the generous folks at kernel.org. All pages on this site are copyright their original authors, and distributed under the terms of the GNU General Public License version 2.0.