Kernel Traffic #46 For 13�Dec�1999

By Zack Brown

Table Of Contents

Mailing List Stats For This Week

We looked at 879 posts in 3501K.

There were 339 different contributors. 141 posted more than once. 152 posted last week too.

The top posters of the week were:

1. Google Bug Hunt

17�Nov�1999�-�30�Nov�1999 (10 posts) Archive Link: "Frequent oops in shrink_mmap"

People: David desJardins,�Stephen C. Tweedie,�Andrea Arcangeli,�Rik van Riel

David desJardins of Google reported a problem on their cluster. He said:

Google runs on a cluster of 2000+ Linux/Intel machines. We have a problem on several of those machines, a symptom of which is kernel oops messages in try_to_free_buffers, called from shrink_mmap. (The machines which have these kernel oops messages tend to have poor performance in general, even at times when the oops aren't being generated.)

We are currently running 2.2.7 on these machines, but I see from searching linux-kernel traffic that the problem of kernel oops in shrink_mmap is a long-standing one which has been reported in many different kernel versions, up to at least 2.2.12, and has also been reported in Linux/Alpha. I've looked at all of the past archives, and I didn't see any definitive response to these problems.

The problem is probably related to heavy disk use and/or heavy network traffic and/or heavy parallel processing with threads. These are the circumstances under which we tend to see the problem, and also are circumstances that I see correlated with past reports of this problem.

I have 56 of these oops messages from 10 different machines, all at exactly the same location. I've attached a sample at the end of this note. Most of them look very similar (to each other, and to other oops messages previously posted to this list). But I would be happy to forward all of them to anyone who can use them. I could collect even more if that would be helpful.

It's possible that the problem is associated with or caused by a hardware problem. Out of thousands of machines, we could easily have 10 with some specific problem. But I don't really know what kind of a problem, or why it would manifest itself only in this particular way. Also, the fact that the same problem occurs in Linux/Alpha makes me suspect some sort of software flaw.

Rik van Riel started thinking about where the bug was most likely to be, but Stephen C. Tweedie pointed out that if the oopsing machines also had poor performance in general as David had said, the problem could very well be in the hardware. Analyzing David's code dump, he added, "the immediate problem is that there is a page in the page map which has a page_map->buffers pointer of 0x40000000. That's one bit away from a legal value of zero. That sort of single-bit error is usually a sign of hardware trouble. It's not guaranteed, but that's the best diagnosis just from looking at one dump." He added, "And yes, depending on your usage patterns it is *entirely* plausible for random one-bit memory corruptions to show up in the same place on a number of different machines, if that happens to be where we stress the kernel most. I've seen similar random memory problems on a different machine which repeatably oopsed somewhere else in the buffer cache, but it was definitely a memory fault and only the load pattern caused it to repeat in the same part of the code."

David replied, "I looked at 56 of these oops messages in try_to_free_buffers, from 10 machines. 50 messages (4 machines) have %eax=80000000, and 6 messages (6 machines) have %eax=40000000. Is this consistent with the single-bit memory error, or not?" Andrea Arcangeli and Stephen replied that it was, and Stephen added:

If there is a design flaw on your motherboards, for example, such that the timing of the outside signals on the bus is borderline, then this is exactly what you would see. In cases like these it is always hard to determine absolutely for sure whether it is hardware or software, but the fact that you see this on only a fixed subset of otherwise identical machines argues strongly for hardware.

There is also the fact that the corruptions are coming from a statically allocated area of memory which is simply never, ever used for kmalloc allocations, and the vast, vast, vast majority of software-triggered memory corruptions are due to accesses to memory which has been allocated dynamically and then freed early. That just doesn't fit the pattern (and I can see no way a corruption to the buffer ring links in a normal buffer head could get propagated back to the static page array without triggering an oops elsewhere).

My bet would still be on hardware.

The thread ended there.

2. Permissions On Doc Files

23�Nov�1999�-�3�Dec�1999 (7 posts) Archive Link: "unreadable doc files in kernel tarball"

People: Peter Samuelson,�Mike Coleman

Joey Hess noticed that the permissions of linux/Documentation/kernel-docs.txt and linux/Documentation/proc.txt in the source tree were not world readable. Mike Coleman suggested that this be changed, so all users of a system could read the kernel docs. But Peter Samuelson replied that this would have no effect on the user actually installing the sources; and other users would not be able to compile the kernel from another user's archive in any case. Mike replied that since he only let root write to /usr/src, he had to install the sources as root; at the same time this would prevent any of his other accounts from reading those files. Peter suggested that Mike use 'chmod' as part of his installation process, and the thread ended.

3. SMP Kernel On Single Processor Dell PowerEdge 1300

25�Nov�1999�-�30�Nov�1999 (10 posts) Archive Link: "SMP kernel with single processor ?"

Topics: Disks: SCSI, PCI, SMP

People: Victor Khimenko,�Brad Larden,�Alan Cox

Brad Larden had a Dell PowerEdge 1300 with a dual-processor motherboard, fitted with only a single processor and a dummy in the extra space. He asked if there were any problems using an SMP kernel on the machine. Victor Khimenko replied, "I'm not know about Dell PowerEdge 1300 but I have Dell PowerEdge 2300 with similar configuration and you MUST use SMP kernel there (even when there are in fact just one processor unit installed). Reason ? APIC ... All external PCI devices on Dell PowerEdge 2300 (I almost sure for PowerEdge 1300 this is true as well) are using APIC-extended interrupts (Linux marks them as IRQ 20, IRQ 21, etc). So with UP kernel you simple will be unable to use your hardware." But he added that in some rare circumstances, SMP kernels could cause a problem, and in any case would be slightly slower than a UP kernel. But he reiterated that with the hardware under discussion, there was simply no choice. As a final piece of advice, he offered, "Check your /proc/interrupts file. If there are APIC interrupts > 15IRQ (last "normal" interrupt) listed as used for something then you have the same situation as I am and SHOULD use SMP kernel."

Alan Cox replied that as far as he knew, UP kernels should be able to use the hardware. He added that Intel designed the hardware to be able to boot DOS. But Victor replied that DOS, Windows 95 and Windows NT would hang on Dell PowerEdge 2300 when trying to access EtherExpress. Booting would go fine, he said, as long as you did not try to use external devices.

Later, Brad reported, "I am now running a UP kernel on the Dell poweredge 1300. No problems, no crashes, it talks to the u-beaut UW-whatever SCSI, talks to the Intel network card and all."

4. Filesystem Corruption Hunt And Fix In Stable And Unstable Kernels

25�Nov�1999�-�1�Dec�1999 (9 posts) Archive Link: "[patch] FS (ext2) corruption generated by BLKFLSBUF/invalidate_buffers (2.2.x and 2.3.x)"

Topics: FS: ext2, Ioctls

People: Andrea Arcangeli,�Jason T Collins,�Alan Cox

Andrea Arcangeli posted a patch (and gave a pointer (ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/patches/v2.2/2.2.14pre7/buffer-races-2.2.14pre7.gz) to it), and said:

Jason and Samuel told me they could reliable reproduce fs corruption by writing to a filesystem while the ioctl(BLKFLSBUF) is running in parallel on the blockdevice where the fs is mounted.

I understood what was going on: BLKFLSBUF is just an interface for invalidate_buffers and invalidate_buffers() trashes away _all_ the buffers beloging to such a blockdevice. _Dirty_ buffers included.

As hdparm is using such ioctl too so a simple:

cp /dev/zero /tmp
hdparm -t /dev/<blockdevice_where_tmp_is_placed>

can just corrupt the hd badly.

Actually the corruption is hided pretty well by both hdparm and the kernel doing a sync on the device before starting invalidate_buffers.

The only two cases where we want invalidate_buffers() to flush away _also_ dirty buffers are:

In all other cases where the kernel want to invalidate the buffers, it should first sync the device (no need to wait I/O completation, so an sync_buffers(dev, 0) is enough if no filesystem is mounted or sync_dev(dev) if the filesystem is mounted) and then calling an invalidate_buffers that won't trash dirty blocks. So if new dirty blocks are generated under invalidate_buffers (legitimate in cases like BLKFLSBUF), they won't be dropped.

In general we should _never_ drop dirty buffers. If we need to drop dirty buffers (excluding the ramdisk case) it means we have some kind of fs corruption going on.

This is my fix against 2.2.14pre7 (probably it won't apply to previous kernel for unrelated but trivially fixable rejects).

The fix also includes my longstanding race fixes in set_blocksize/invalidate_buffers and I also noticed that sync_dev is generating dirty data via the qutoa code _after_ the last sync_buffers (and so sync_dev right now can return with dirty data still present if quota is running).

Please Jason could you confirm this my patch fixes the problem for your reproducible fs corruption? I only given it a try on a floppy and worked fine here.

Jason T Collins reported that Andrea's patch fixed some of the corruption, but not all of it. Instead of getting multifarious incomprehensible warnings, the corruption now seemed restricted to a single repeatable exploit. While running 2.2.14pre9 with Andrea's buffer-races-2.2.14pre7-2 patch applied, Jason started two seperate processes, each of which would do a "create file", "delete file", and "BLKFLSBUF" on two files in the same directory. This would cause corruption after a few seconds, as manifested by a string of errors of the form, "EXT2-fs warning (device sd(8,1)): ext2_free_blocks: bit already cleared for block 257682". This would not trip the error flag for ext2, so he had to run fsck by hand each time to fix the disk.

Andrea went back and found an error in his previous patch. He posted an incremental correction, again with a pointer (ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/patches/v2.2/2.2.14pre9/buffer-races-3.gz) to it on the web.

Jason replied with complete success, saying, "Yippee! That does it! I tried with 2 and 4 parallel processes each for a half hour. Everything ran fine under the kernel with this patch where it generated corruption in seconds before." Alan Cox replied to this, asking if Andrea wanted him to roll the patch into 2.2.14pre10; Andrea replied in the affirmative, and the thread was over.

5. IRQ Timeouts And VESA Framebuffer

26�Nov�1999�-�2�Dec�1999 (29 posts) Archive Link: "fbcon + scrolling = irq timeouts?"

Topics: Framebuffer

People: Pavel Machek,�Alan Cox

Ian Ehrenwald noticed some Interrupt Request (IRQ) timeouts while playing MP3s and scrolling large text files in a VESA framebuffer console. He couldn't find any hardware conflicts on his system, and asked for advice. Alan Cox replied that VESA framebuffer used BIOS functions, which could lock interrupts for a long time. Pavel Machek corrected him, saying that "even if you don't use any bios functions (you do not normally use any vesa bios functions at runtime!) you spend just too much time in kernel space and mp3 player fails to meet that deadline." He half-humorously suggested a few schedules() at strategical places in fbcon. Alan approved that approach, there followed an implementation discussion.

6. Dangerous Fixes To FAT And HPFS

30�Nov�1999�-�3�Dec�1999 (10 posts) Archive Link: "[CFT][PATCH] block_write_*_buffer rewrite"

Topics: FS: FAT

People: Alexander Viro,�Ingo Molnar,�Martin Dalecki,�David S. Miller

Alexander Viro posted an extensive patch to fs/buffer.c, to fix problems with write-beyond-EOF on FAT and HPFS, and cleaned the code up as well. He asked for testers, and added, "WARNING: IT'S ON THE CRITICAL PATH AND IF IT'S BROKEN IT WILL EAT YOUR DATA. SILENTLY. EXTREME DANGER. USE ONLY ON SCRATCH BOXEN."

No one found any major flaws, but Martin Dalecki had some cosmetic suggestions, and Ingo Molnar pointed out the Alexander had removed the optimization that would overlap the issuing/finishing of read requests with copying memory, on 1k filesystems. On smaller boxes, he said, this would make a difference with overall system performance. He also pointed out a deadlock that David S. Miller had noticed in the code some time before: writing a file into a freshly mmap()-ed memory area, where the mmap-ed page is the same as the written page. This would instantly freeze the process (there was a bit of confusion on the list over this point, because Ingo had originally said that the deadlock was caused by a read rather than a write. He quickly corrected himself, and the discussion continued).

Ingo clarified, "the problem is generic_file_write() locking the page while generating a page fault (to the same page)." Alexander replied, "Oh, fsck... It's even more interesting for mutual deadlocks. And then there is a wonderful situation with writing into mmaped area with offset 1 byte... Hrrrrrmmm... I wonder how well other kernels handle that. The most obvious way would be to force page-in before grabbing the destination page and to get_page() on (one or two) pages containing the source. Messy."

Less than a day later, Alexander replied again to Ingo's same post, with a proposed fix. There was a bit of implementation discussion, but nothing conclusive.

Sharon And Joy

Kernel Traffic is grateful to be developed on a computer donated by Professor Greg Benson and Professor Allan Cruse in the Department of Computer Science at the University of San Francisco. This is the same department that invented FlashMob Computing. Kernel Traffic is hosted by the generous folks at kernel.org. All pages on this site are copyright their original authors, and distributed under the terms of the GNU General Public License version 2.0.