Kernel Traffic #117 For 7�May�2001

By Zack Brown

linux-kernel FAQ (http://www.tux.org/lkml/) | subscribe to linux-kernel (http://www.tux.org/lkml/#s3-1) | linux-kernel Archives (http://www.uwsg.indiana.edu/hypermail/linux/kernel/index.html) | kernelnotes.org (http://www.kernelnotes.org/) | LxR Kernel Source Browser (http://lxr.linux.no/) | All Kernels (http://www.memalpha.cx/Linux/Kernel/) | Kernel Ports (http://perso.wanadoo.es/xose/linux/linux_ports.html) | Kernel Docs (http://jungla.dit.upm.es/~jmseyas/linux/kernel/hackers-docs.html) | Gary's Encyclopedia: Linux Kernel (http://members.aa.net/~swear/pedia/kernel.html) | #kernelnewbies (http://kernelnewbies.org/)

Table Of Contents

Mailing List Stats For This Week

We looked at 1022 posts in 4035K.

There were 386 different contributors. 185 posted more than once. 158 posted last week too.

The top posters of the week were:

1. Floating-Point-Corruption In 2.2

19�Apr�2001�-�30�Apr�2001 (37 posts) Archive Link: "BUG: Global FPU corruption in 2.2"

People: Victor Zandy,�Richard B. Johnson,�Alan Cox,�Christian Ehrhardt

Victor Zandy reported, "We have found that one of our programs can cause system-wide corruption of the x86 FPU under 2.2.16 and 2.2.17. That is, after we run this program, the FPU gives bad results to all subsequent processes." He could confirm this on dozens of 550MHz Xeon systems, and posted some code to reproduce the effect. Later, he added, "We have now tested 2.4.2 and 2.2.19. 2.2.19 has the same problem. 2.4.3 does not seem to be affected." Richard B. Johnson posted a simple program to reinitialize the FPU, and remarked, "If it "fixes" it, there is no problem with the FPU, but with the 'C' runtime library which doesn't initialize the FPU to a known state before it uses it. It is possible for the kernel to work around th 'C' library problem by clearing the FPU after every fork(). The last time I checked (years ago), 'finit' was executed during the fork. Maybe it isn't anymore because it takes many machine-cycles to complete." He suggested that if the program did not fix the problem, that the hardware was probably at fault. Victor replied that no, reinitializing the FPU had no effect, but added, "If it were a hardware problem, I would expect the problem to occur under 2.4.2 as well as 2.2.*, and I would be surprised that we can consistently produce the behavior across our 64 node cluster." Richard replied, "Then, if the FPU is fine, you have just proven that the storage where the FPU context is saved, gets overwritten. Further, once the initial write occurs, all subsequent fnsave/frestore operations also encounter the same spurious write. --OR some continuously-running floating-point has sneaked into the kernel." There was no reply to this, but at one point David Konerding asked if anyone had any comments on the original bug report, and Alan Cox replied, "Complete mystification." He went on, "The processor state for the FPU is per task private and each task initializes its own FPU state. In terms of FPU state itself I don't currently see what there is that can be left behind."

Later, Victor added:

Someone else here traced the process flags of a FP-intensive program on a machine before and after it is put in the faulty FPU state. He periodically sampled /proc/pid/stat while the program was running.

He found that PF_USEDFPU was always set before the machine was broken. After he found that it was set about 70% of the time.

Christian Ehrhardt replied:

If I'm not mistaken this actully can cause GLOBAL FPU corruption. Here's why:

Assyme for a moment that we lose either the PF_USEDFPU flag of one process. This not only means that the current process won't have its state saved, it also means that the next process won't have the TS bit set. This in turn means that this new process won't get PF_USEDFPU set and suddenly we have a second process with a corrupted FPU state.

Victor: Could you try to reproduce the system wide corruption if you add an explicit call to stts(); at the very end of __switch_to? This should prevent the FPU corruption from spreading.

NOTE: This is just to prove my theory, it is not and isn't meant to be a fix for the actual problem.

Victor reported, "After adding this call, I cannot reproduce the global corruption. There is still occasional local corruption of individual pi processes while pt is running." There was no reply.

2. Architecture-Specific Source Tree Restructuring

22�Apr�2001�-�27�Apr�2001 (4 posts) Archive Link: "Architecture-specific include files"

Topics: User-Mode Linux

People: Matthew Wilcox,�Jeff Dike,�Jes Sorensen,�Pavel Machek

Matthew Wilcox proposed:

Something which came up in one of the hallway discussions at the kernelsummit was that a lot of the architecture maintainers would find it more convenient if the arch-specific header files were moved from include/asm-$ARCH to arch/$ARCH/include. Since we use a symlink _anyway_, no global changes to include statements are necessary, we'd merely need to change Makefile from

symlinks:

rm -f include/asm
( cd include ; ln -sf asm-$(ARCH) asm)

to

symlinks:

rm -f include/asm
( cd include ; ln -sf ../arch/$(ARCH)/include asm)

Would anyone have a problem with this change? It'll make for a hell of a big patch from Linus, but it really will simplify the lives of the architecture maintainers.

Jeff Dike replied:

UML already has a arch/um/include for private headers that the rest of the kernel is not allowed to see.

It would mean moving it, which is not a big deal.

There was no reply to this, but Jes Sorensen also replied to Matthew's initial post, saying, "I don't see what it saves, except for the fact you just have to run diff -urN once instead of twice when you want to send Linus a large diff. Or am I missing something?" And Pavel Machek replied, "Saving one diff urN is nice, plus you can distribute your architecture as tar file more easily, plus it is easier to put just your arch in cvs. I like it." End Of Thread.

3. Fast User-Space Web Server

27�Apr�2001�-�30�Apr�2001 (18 posts) Archive Link: "X15 alpha release: as fast as TUX but in user space"

People: Ingo Molnar,�Fabio Riccardi

Fabio Riccardi announced the first release of X15, a user space web server which he claimed was faster than the kernel-based TUX. He gave a link to the tarball (http://www.chromium.com/X15-Alpha-1.tgz) , and various people tried it out. It did indeed seem to be as fast as or faster than TUX, but Ingo Molnar pointed out that X15 was not entirely RFC 2616 (http://www.cis.ohio-state.edu/cgi-bin/rfc/rfc2616.html) compliant, since it cached the date fields. Quoting section 14.18 (http://www.cis.ohio-state.edu/cgi-bin/rfc/rfc2616.html#sec-14.18) , which read, "The Date general-header field represents the date and time at which the message was originated, [...] Origin servers MUST include a Date header field in all responses," Ingo pointed out that X15 did not handle this correctly. He added, "i considered the caching of the Date field for TUX too, and avoided it exactly due to this issue, to not violate this 'MUST' item in the RFC. It can be reasonably expected from a web server to have a 1-second accurate Date: field. The header-caching in X15 gives it an edge against TUX, obviously, but IMO it's a questionable practice. if caching of headers was be allowed then we could the obvious trick of sendfile()ing complete web replies (first header, then body)."

Fabio replied that he'd disable header caching and see how that affected performance. Ingo replied that it might not make much of a speed difference, but added, "it will make the results more comparable with TUX." Fabio confirmed that there was only a very small performance his, and give a link to the new version (http://www.chromium.com/X15-Alpha-2.tgz) . Various folks were impressed, and the thread petered out.

4. Sound Corruption Under 2.4.4

28�Apr�2001�-�30�Apr�2001 (16 posts) Archive Link: "2.4.4 Sound corruption"

Topics: Digital Video Broadcasting, Disks: SCSI, FS: devfs, FS: ext2, SMP, Sound, USB

People: Lee Mitchell,�Pierre Rousselet,�Steven Walter,�Mike A. Harris

Lee Mitchell reported, "Playing mp3's under 2.4.4 (SMP) results in bursts of noise overlayed on top of actual music being played." 2.4.3 had no problem. He posted his system information:

Motherboard Gigabyte GA-6BXD
CPU(s) 2 x 400 MHz PII
RAM 128MB
Soundcard Creative AWE64-Gold
Network Card 3Com 3c905-B
SCSI Card Adaptec 2940
Graphics Card Matrox G200 Millenium AGP
Video Captute Hauppauge WinTV Go (bttv)
USB Devices Phillips PCA646WC Webcam
Kernel 2.4.4 (SMP)
Debian 2.2
gcc version 2.95.2 20000220 (Debian GNU/Linux)

Mike A. Harris confirmed seeing the same problem when running xmms on his UP 2.4.2-2 Red Hat kernel, as well as on stock 2.4.4 compiled as either UP or SMP. However the problem would occur only after a half-hour or an hour. It happened on his 300Mhz K6-III and on his dual 1Ghz Xeon Compaq Proliant ML530. However, he couldn't reproduce it on demand. Steven Walter also confirmed seeing very similar behavior, but not when writing directly to /dev/dsp. In other words, not when using xmms. With him the esd program would trigger the problem. He described his UP system:

PCChips M599LMR
1 x AMD-K6/2 500MHz
128MB RAM
C-Media
Kernel 2.4.4
Debian 2.2
gcc version 2.95.2 20000220 (Debian GNU/Linux)

Gregoire Favre also confirmed seeing a similar problem, though only with the output of his DVB-s card, and not with the esd program. He described his system:

UP
PIII
Asus p2b-ls
gcc version 2.96 20000731 (Linux-Mandrake 8.0 2.96-0.49mdk)
Mandrake 8.0
raiserfs
ext2(boot)
no patch on the kernel...

Lee Mitchell also confirmed the problem with esd, but didn't give any system details. Steven asked for some people not experiencing the problem to please step forward. Pierre Rousselet replied, "esd works for me with any 2.4.x including 2.4.4 Pentium III, BE6, ES1370, devfs, Xfree-4.0.3/GNOME esound-0.2.22. Timidity is fine as well." Steven noticed that Pierre's version of esd was newer than his, and said he'd upgrade and test again. There was no reply.

5. No ISO9660 Filesystem Maintainer

30�Apr�2001 (4 posts) Archive Link: "iso9660 maintainer?"

People: H. Peter Anvin,�Alexander Viro,�Andries Brouwer

H. Peter Anvin asked who was maintaining the ISO9960 filesystem, and Alexander Viro replied, no one. He asked if H. Peter felt like volunteering, and H. Peter replied, "I was hoping to avoid it. I don't really have the cycles. However, I might be doing some enhancement work." Andries Brouwer also said that he wasn't the maintainer but had done some work in that area recently, and would be happy to look at any development problems that came up.

6. CANBus Driver

30�Apr�2001�-�1�May�2001 (3 posts) Archive Link: "CANBus driver."

People: Anders Peter Fugmann,�David Woodhouse

Anders Peter Fugmann and some of his classmates had decided to write a driver for a CANbus ISA card ( AROS: A-858D PCCAN -x ver. 1.12). He asked if any work had been done on such a thing already, and if a driver would be wanted. Mark Clayton replied privately that he was very interested in CAN support, and asked for more information on the card in question. Anders replied with a link to the Kvaser home page (http://www.kvaser.se) , and added, "We will contact Kvaser to get the technical Specification. I cannot guarantee that we will be allowed to publish it, but if we are and you are interrested, i can send you a link when and if we get it." Elsewhere, David Woodhouse also replied to Anders' first post, saying, "See the Linux Lab Project at http://www.llp.fu-berlin.de/. ISTR there were CAN drivers there at one point." End Of Thread (tm).

7. Tulip Driver Broken Or Fixed In 2.4.4

1�May�2001 (4 posts) Archive Link: "tulip driver broken in 2.4.4?"

Topics: Networking

People: Ronny Haryanto,�Jeff Garzik

Ronny Haryanto tried 2.4.4 and found that the tulip driver would die on his LinkSys LNE100TX v4.1 card after 5 minutes. He could get it working again with 'ifdown' and 'ifup', but 5 minutes later it would die again. He reported no such problem on 2.2.18. Jeff Garzik asked if 2.4.3 worked, and Ronny replied that yes, it did. He added, "Too bad I can't use 2.4.3; I need 2.4.4 due to the VIA chipset bug." At this point Jacob Luna Lundberg reported that he'd been seeing a similar problem on his identical card, but only on kernels prior to 2.4.4; in fact, 2.4.4 was the first kernel not to break in that way, he said. The thread stopped there.

8. Major Version Numbers

1�May�2001 (3 posts) Archive Link: "Meaning of major kernel version number"

Topics: SMP

People: Erik Hensema,�Oliver Neukum,�George Anzinger,�Linus Torvalds

Erik Hensema asked about the meaning of the Linux kernel's major version number. He wanted to know whether the next stable kernel would be 2.6 or 3.0, and why. He added, "I'm asking this question because I think there isn't going to be a kernel which is as different from the previous one as 2.0 compared to 1.2. As a little reminder: 2.0 brought us SMP, modules, multi-platform support (did 1.2 support Alpha? I don't remember), quota support, MD support, loop device, to name a few." Oliver Neukum replied that the major version number was entirely up to Linus Torvalds, or as Oliver put it, "Our great fearless leader will talk with the penguin beyond the sky." George Anzinger suggested that if user code would have to be relinked to work with the kernel, it was time for a new version number.

Sharon And Joy

Kernel Traffic is grateful to be developed on a computer donated by Professor Greg Benson and Professor Allan Cruse in the Department of Computer Science at the University of San Francisco. This is the same department that invented FlashMob Computing. Kernel Traffic is hosted by the generous folks at kernel.org. All pages on this site are copyright their original authors, and distributed under the terms of the GNU General Public License version 2.0.