Kernel Traffic #12 For 1 Apr 1999

By Zack Brown

Table Of Contents

Introduction

A bit of a delay again this week... our primary wetware neural processing unit and all its peripherals was infected with a virus that seems to be sweeping the nation. Dubbed "flu" in the mainstream press, the attacking program seems immune to even the most powerful anti-virus technology. Its effects appear to include memory corruption and a general shakiness of the entire system, leading to frequent crashes.

Mailing List Stats For This Week

We looked at 1043 posts in 3795K.

There were 440 different contributors. 193 posted more than once. 143 posted last week too.

The top posters of the week were:

1. ext2 Bug Hunt

19 Mar 1999 - 22 Mar 1999 (13 posts) Archive Link: "[patch] fix for buffer hash leakage"

Topics: Debugging, FS: ext2

People: Andrea ArcangeliLinus TorvaldsStephen C. TweedieChuck Lever

Andrea Arcangeli found a bug in bforget() (http://lxr.linux.no/source/fs/buffer.c?v=2.2.2#L856) . Drunk and exhausted at 4AM his time, he posted a patch, saying to Chuck Lever, "bforget gets called by ext2 at truncate time. So right now when you truc a file with tons of clean buffers that are mapping the file blocks, you'll make such buffers unfindable anymore. So most of the buffers were made unuseful for caching purposes and you had always to wait to discard them (try_to_free_buffer()) and read again from disk. This is the reason of your stall and performance drop across bench-passes."

Chuck confirmed Andrea's interpretation, and said he was testing a patch of his own. Andrea took a look and replied, "Agreed, my detection of the bug was ok, but _your_ one is the _right_ fix." He posted his own version, which Stephen C. Tweedie liked. Linus Torvalds said:

The patch looks fine, but I wonder..

I don't think we should move a buffer to the unused list without making damn sure that it doesn't have any IO active on it.

I'll apply it anyway, because it _is_ obviously cleaner (the old code was meant to handle the case where the buffer might still be shared with some MM mapping - something that can't happen any more), and I /think/ it's actually harmless even if we have some old active write to the buffer, but it still makes me worry.

But he replied to himself half an hour later, with:

Having thought about it for five minutes I convinced myself that the patch as Andrea had it was potentially horribly deadly, and could result in total disk corruption. Bad.

Just to give you an idea of why the old one was bad:

It's _extremely_ unlikely, because we put the buffer at the end of the free list and it needs just about pessimal timings to happen anyway, but the unlikely bugs are the worst ones to fix later.

However, it is certainly true that the old bforget() was a piece of crap too. Here's my current one that looks "obviously correct" - can people who saw the performance problems with the original version please test this one out?

Can anybody see any downsides with this patch? Note that the likelihood of the if-statement triggering is pretty low, so I don't think performance should be any different from Andreas patch, but the above looks safe and "feels" right to me. I certainly know that I would have been extremely nervous to release a real kernel with the previous patch posted.

He posted a patch (http://lxr.linux.no/source/fs/buffer.c?v=2.2.4#L850) , which eventually made it into 2.2.4.

Andrea liked Linus' approach, and Chuck confirmed it solved the problem as well as his own patch, but Stephen pointed out that Linus' exploit against Andrea's patch was dependent upon bforget() not being called properly. He said, "I did have a check before giving the patch the OK to make sure that it was being called safely everywhere, and ext2/truncate.c does in fact absolutely guarantee that the buffer has been waited on and count==1 before calling bforget," but he acknowledged, "however, as it stands, the routine is definitely a loaded gun just waiting for an opportunity to do serious damage."

Responding to the fact that ext2/truncate.c guarantees that the buffer's been waited on, Linus replied,

Looking at the ext2 code, that's obviously the case.

However, looking some more, it's also obviously the case that the ext2 code is being really really bad at doing this, and has a ton of assumptions about not only how bforget() used to work, but also about how find_buffer() works.

In particular, as far as I can tell, ext2fs will _busywait_ for the buffer becoming unlocked if it was locked. Uhhuh. Shades of Windows NT.

And I can't even blame anybody: back when that code was written, brelse() used to be rather inefficient and do a "wait_on_buffer()" that caused us to re-schedule if the buffer was locked. Since then, the buffer handling has been cleaned up quite a bit, and now the loop is just a busy loop in kernel space.

But it's certainly an example of a filesystem that is being rather incestuous with the VFS layer, abusing undocumented knowledge of just exactly how brelse() and bforget() used to work (and in this case about exactly what the limitations were).

Anyway, it's probably not a serious problem, because even when it does busy-wait, it eventually does the right thing so it's not actually a correctness bug, and the busy-wait should be so rare as to not really be a performance issue either.

I should probably just remove the re-try logic, and change the busy loop to do a "wait_on_bh()" instead (because we do need to wait on the buffer, otherwise we'd mark the thing free without having finished all the IO on the disk, which has similar race conditions - except now the races are on the disk platter instead of in kernel data structures).

2. FAT Fixes

20 Mar 1999 - 23 Mar 1999 (5 posts) Archive Link: "Odd code in iput() (since 2.1.60). What for?"

Topics: FS: FAT

People: Stephen C. TweedieAlexander Viro

A confused Alexander Viro couldn't figure out why 2.1.60 had introduced a check of I_DIRTY in iput() in inode.c; it didn't seem to do anything, and he wanted to know what was what before he made a bunch of FAT filesystem fixes that would require some slight modification of that file.

Stephen C. Tweedie replied that the code in question maintained a least-recently-used ordering for recently released inodes. He added, "however, right now there's nothing I can see in inode.c which actually relies on that ordering: whenever we do a free_inodes(), we dump all the inodes that we can. In the future, having a sane LRU ordering on the in-use list may be valuable."

Alexander, relieved, said:

Oh, OK. That makes sense. I missed the reodering side-effect. Thanks ;-)

In FAT fixes (I'll post the them for testing tomorrow or on Wednesday) I actually needed a clear way to keep references to inode that wouldn't affect i_count and related behaviour. I did it - it looks so:

  1. Whenever we are scheduling the inode for freeing (free_inodes, invalidate_inodes and two places in iput) we set I_FREEING in ->i_state.
  2. I've added struct inode *igrab(struct inode*) that grabs the spinlock, checks for I_FREEING, if it is set - releases the spinlock and returns NULL, otherwise increases i_count, releases the spinlock, does wait_on_inode() and returns the inode.
  3. If fs wants to manage private references to struct inode it can use igrab()/iput() to get/release a normal reference. It has to supply ->clear_inode() method and forget all private references to inode when it is called. Some simple internal locking may be needed within the filesystem, indeed. igrab() returning NULL should be processed as the case when the reference was already invalidated by foo_clear_inode() call.

Changes to fs/inode.c are minimal (+10 lines) and do not affect existing filesystems. If fs wants to keep an internal hash indexed *not* by i_ino it can do it easily (e.g. FAT - we can keep i_ino constant over the whole life of in-core inode and forget about bothering with 'busy' directory slots. There go 20-odd races in FAT-derived filesystem + tons of ugly code working around other 20-odd races). In case of FAT all code needed to implement hash indexed by directory entry position + glue for clear interaction with icache took about 40 lines. *And* allowed to remove much more cruft.

3. Struggles With WinView

21 Mar 1999 - 22 Mar 1999 (3 posts) Archive Link: "Getting sound out of a Leadtek WinView 601"

People: Ben Pfaff

Jon Tombs has been pounding at his WinView card, trying to figure out its API. He had gotten to the point where sound was working, but volume control seemed to have no rhyme or reason to it. Ben Pfaff pointed out that modifying the BOCHS machine emulator (http://www.bochs.com/) to log all I/O to and from the card registers, and then running Windows 98 under the modified version, was a great way to reverse engineer those sorts of things. He didn't post any BOCHS patches though...

4. Hotswapping CPUs And RAM Chips

22 Mar 1999 - 23 Mar 1999 (13 posts) Archive Link: "CPU Management for Linux?"

Topics: Hot-Plugging

People: Alex Buell

Martin Neumann had the bizarre idea of adding and removing CPUs during a system's normal operations -- only to discover that such insanity is already in the TODO list for 2.3, as is hotplugging ram and other stuff! As Alex Buell put it, we may soon see something like:

% mount -t cpu /dev/cpu0
% umount -t cpu /dev/cpu1

Linux: Already on the Moon, and heading for deep space. Ain't it grand?

5. NFS Development Process

22 Mar 1999 (2 posts) Archive Link: "[patch] VFS inode.i_generation patch"

Topics: FS: NFS

People: Steven N. HirschTrond Myklebust

Steven N. Hirsch asked several developers, "Would it be possible for you guys to coordinate changes and patches with Alan? I'm certainly willing to pound on the NFS client/server system (Lord-knows I'm an expert at evincing misbehavior), but it's starting to get _real_ confusing with all these patches flying around. Linus has implied that he's not super interested in applying anything that isn't relatively cohesive and well-explained, and Alan has been quite diligent about merging and pre-qualifying. So, we have the framework of a reasonable process."

Trond Myklebust (one of the folks named by Steven) found a fly in that ointment, saying:

There is a small problem with passing everything through Alan: as you know Linus' and Alan's trees are very different with respect to the NFS code because Alan accepted to test out the 8k patches. This means that

This is why I generally try to pass patches that for Alan directly to him, and only put out fixes against Linus' tree on linux-kernel. Since Alan has been very quick to put out new releases, I think that has been an acceptable practice.

Note: b) is in part due to the extensive NFS client structure changes, and in part due to a few trivial fixes to the sunrpc client. The latter are what affect fs/lockd and fs/nfsd, and are probably the main cause of problems with the knfsd patches (apart from HJ's fix to fs/nfs/dir.c:nfs_rename which is unnecessary on Alan's tree). Most trouble could therefore perhaps be avoided if we start the merge of the sunrpc fixes in question to Linus' tree.

6. Hard Links Without Write Permissions

23 Mar 1999 - 24 Mar 1999 (12 posts) Archive Link: "[RFC] Rights for hardlinks"

People: Jan KaraAlexander Viro

Jan Kara felt that hard links should only be allowed on files that a user has write-access to, so they can be deleted and not use up quotas. Jan consulted the Single Unix specification (http://www.opengroup.org/onlinepubs/7908799/index.html) and found that such a restriction was allowed. Alexander Viro came down hard on the idea, saying it would be more trouble than it was worth. He also pointed out that it was a regular topic of flame wars on comp.unix.*

7. Diamond Supra 56 Patch For 2.2.4

23 Mar 1999 (5 posts) Archive Link: "modem not hanging up since 2.2.2"

Topics: Modems

People: Alan CoxRichard B. JohnsonLinus Torvalds

brent verner had this problem with his Diamond Supra 56 under 2.2.2, 2.2.3, and 2.2.3ac-2. Richard B. Johnson posted an old patch of his that solved the problem, and Alan Cox said he's sent a variant of the patch to Linus Torvalds for 2.2.4; Richard was cool with this, and the thread was over.

8. Quantum HD 'Optimization' Corrupts Data

23 Mar 1999 - 27 Mar 1999 (9 posts) Archive Link: "[OFFTOPIC] optimized disks drives from quantum"

Topics: Disks

People: Paul Barton-DavisOscar LeviMatthew Jacob

Real-Time

Paul Barton-Davis sprung this shocker on the list:

Has anyone else seen Quantum's comments on their "a/v" drives ?

"Disclaimer: The Quantum AV drives have been optimized for use in digital video and audio environments. These optimizations make Quantum's A/V drives unsuitable for use in applications where data integrity is an absolute requirement. For applications where Quantum drives will not be used exclusively for digital AV, please consider Quantum's non-optimized line of hard disk drives".

Gives a new meaning to the word "optimized", eh ? "We made 'em really fast, but oh, sometime they lose bits".

Matthew Jacob replied that for video, a lost frame here and there was irrelevant, but Oscar Levi put it, "Oh my no. I work in this industry. Much of the work is using uncompressed source frames. A lost frame can be a serious problem when editing and when converting formats. I had thought that this a/v thing disappeared about six years ago when the internal drive controller performance increased sufficiently to permit the device to recal while transferring at the rated throughput. I'd speculate that these drives are meant for the real-time capture/DVD conversion market where the drive is only a buffer. Once the video is compressed, I'd hesitate to keep it on one of these 'a/v' drives."

9. 2.2.4 Announcement; Linus Vacation; Start Of Kernel Newsflash Web Page

23 Mar 1999 - 24 Mar 1999 (26 posts) Archive Link: "Linux-2.2.4.."

Topics: Kernel Release Announcement

People: Jiann-Ming SuRik van RielRichard GoochLinus Torvalds

Linus Torvalds announced 2.2.4 on March 23rd. He also said he'd be going on a two-week vacation, and that folks should test out the kernel before he left. Some folks complained about slow mirrors, and Rik van Riel said he always kept the latest kernel (and only the latest) on his site (ftp://ftp.nl.linux.org) .

Jiann-Ming Su had a compile problem, and Richard Gooch said:

Sigh. We keep getting these kinds of questions long after the fix has been posted N times. I've finally pulled my finger out and gotten around to doing something I've been planning:

There is a new page: http://www.atnf.csiro.au/~rgooch/linux/docs/kernel-newsflash.html

which contains urgent fixes for the latest official kernel. There is also a link to this page from the kernel-list FAQ.

Please, people, if a new kernel falls over in a heap, read this page first. Help conserve electrons.

Linus: feel free to refer to this page in subsequent announcements. It may serve as a pre-emptive reminder. Well, I can hope, anyway...

10. Fix For An Obscure DoS In 2.2.4

24 Mar 1999 - 27 Mar 1999 (16 posts) Archive Link: "Linux-2.2.4 testpatch.."

Topics: FS, Security, Virtual Memory

People: Chuck LeverLinus TorvaldsTheodore Y. Ts'oAndrea Arcangeli

Linus Torvalds posted a one-line patch for a bug that he figured could only bite someone under "pathological" usage conditions, but he wanted to fix it anyway. The bug could result in memory being unavailable to the machine, and in extreme cases, in a locked machine. Andrea Arcangeli posted his own patch, which he felt had a better approach.

Meanwhile Chuck Lever had this response to Linus' patch;

when i first considered this, i agreed with your reasoning, and thought that it would be a good change. however, after trying it under load, i discovered that leaving b_count as 1 for free buffers actually *helps* performance, and doesn't appear to cause the memory shortage problems you feared. i'm genuinely curious to know, btw, what pathological conditions do you think might cause a catastrophic memory shortage?

setting b_count to 0 for free buffers allows try_to_free_buffers() to free more pages-- this is what you wanted it to do. but it's actually bad behavior in terms of performance, since the pages it frees are quite arbitrary in relation to which buffers are most used. in other words, setting b_count to 0 when freeing buffers exacerbates the fact that the Linux buffer cache has no buffer reclamation policy right now. leaving b_count set to 1 reduces the likelihood that a page will be stolen, thus preserving the contents of the buffer cache.

under load, vmstat shows twice as many "bi" when b_count is set to 0 for free buffers; application throughput drops measurably more as offered load increases than when it is running on a kernel that leaves b_count alone. one can also see the buffer cache size fluctuating significantly when the file working set and the page working set don't all fit in memory. this, i argue, is pathological to performance, since it means pages are flowing in and out of the buffer cache quickly, and are not staying in long enough to be of use.

in the big picture, shrinking either the page cache or the buffer cache will result in lower hit rates and worse (i.e. closer to disk speed) performance. you really want to be careful to throw out the oldest or least used page/buffer, because to do otherwise wastes disk bandwidth, and that hurts *both* file system *and* VM performance.

setting b_count to zero for free buffers might be the right thing to do, but it makes more urgent the need to fix the reclamation problem. successfully stealing pages from the buffer cache right now is very hard on file system performance. perhaps adding some logic to regularly supply the free list with the oldest buffers, and allowing try_to_free_buffers() only to free pages containing buffers on the free list, might be a good solution. perhaps calling try_to_free_buffers() or wakeup_bdflush() should automatically reclaim old buffers, thereby slowing down cache growth and reducing the need to steal pages from it. or what would happen if the buffer cache, and not shrink_mmap(), could choose which page gets freed/stolen?

the best solution is to figure out how to allow page stealing while not disturbing the LRU queues in the buffer cache.

Linus replied, "Performance optimization is 15% brains, 85% black magic. I could easily imagine that in many conditions, the extra locked-down free buffers improve performance by making a pool of quick-allocation free buffers available, even at the expense of other things."

Regarding the pathalogical usage that might trigger the bug, he went on:

I said they were unlikely..

Basically, the only real case I can imagine where this actually could result in serious problems is:

Regarding Chuck's best solution, Linus answered, "Indeed. Leaving the b_count at an elevated number may give you some of that advantage, but it's definitely not something I want to count on for good behaviour.."

Theodore Y. Ts'o replied, "I think the point Chuck was trying to make was not that preserving buffers was good because of making a pool of quick-allocation free buffers, but because doing so preserved the *contents* of those buffers ---- this is because try_to_free_buffers() doesn't take into account whether the buffers on any particular pages are often referenced. So if the buffers contain the disk blocks corresponding to some commonly used executable that just doesn't happen to be in use at the moment, try_to_free_buffers will evict the page, thus forcing the disk blocks to be read in from disk the next time they are needed." He added, "While I agree with you that leaving b_coutn as 1 is a really horrible way of fixing this problem, the real solution which is quite urgently needed is to put some more smarts into try_to_free_buffers...."

11. Routing Problem For linux-kernel

26 Mar 1999 - 29 Mar 1999 (3 posts) Archive Link: "Is only me ?"

Topics: Mailing List Administration

People: Matti Aarnio

Riccardo Facchetti noticed an odd silence on linux-kernel, and asked about it. Matti Aarnio replied:

There is a strange routing problem in NORDUnet network, which prevents nic.funet.fi from reaching a part of the world -- I take that to mean that the same part of the world won't reach nic.funet.fi either ...

The list delivery will resume once the routing bug is fixed, as a matter of hours, I hope.

Mind you, I realized what was going on only a couple hours ago, and the situation has been on for 36 hours. (Email queue timeouts threaten at 72 hours, I hope it won't get to that..)

 

 

 

 

 

 

Sharon And Joy
 

Kernel Traffic is grateful to be developed on a computer donated by Professor Greg Benson and Professor Allan Cruse in the Department of Computer Science at the University of San Francisco. This is the same department that invented FlashMob Computing. Kernel Traffic is hosted by the generous folks at kernel.org. All pages on this site are copyright their original authors, and distributed under the terms of the GNU General Public License version 2.0.