Kernel Traffic #47 For 20�Dec�1999

By Zack Brown

Table Of Contents

Introduction

I've been working on a sister-page for Kernel Traffic for awhile, called the Kernel Journal (http://www.linuxcare.com/developers/kernel) , and now it's pretty much ready. It keeps tracks of kernel-related project announcements on the linux-kernel mailing list.

Mailing List Stats For This Week

We looked at 1388 posts in 5397K.

There were 415 different contributors. 193 posted more than once. 116 posted last week too.

The top posters of the week were:

1. spin_unlock() Optimization On Intel

20�Nov�1999�-�7�Dec�1999 (143 posts) Archive Link: "spin_unlock optimization(i386)"

Topics: BSD: FreeBSD, SMP

People: Linus Torvalds,�Jeff V. Merkey,�Erich Boleyn,�Manfred Spraul,�Peter Samuelson,�Ingo Molnar

Manfred Spraul thought he'd found a way to shave spin_unlock() down from about 22 ticks for the "lock; btrl $0,%0" asm code, to 1 tick for a simple "movl $0,%0" instruction, a huge gain. Later, he reported that Ingo Molnar noticed a 4% speed-up in a benchmark test, making the optimization very valuable. Ingo also added that the same optimization cropped up in the FreeBSD mailing list a few days previously. But Linus Torvalds poured cold water on the whole thing, saying:

It does NOT WORK!

Let the FreBSD people use it, and let them get faster timings. They will crash, eventually.

The window may be small, but if you do this, then suddenly spinlocks aren't reliable any more.

The issue is not writes being issued in-order (although all the Intel CPU books warn you NOT to assume that in-order write behaviour - I bet it won't be the case in the long run).

The issue is that you _have_ to have a serializing instruction in order to make sure that the processor doesn't re-order things around the unlock.

For example, with a simple write, the CPU can legally delay a read that happened inside the critical region (maybe it missed a cache line), and get a stale value for any of the reads that _should_ have been serialized by the spinlock.

Note that I actually thought this was a legal optimization, and for a while I had this in the kernel. It crashed. In random ways.

Note that the fact that it does not crash now is quite possibly because of either

I might be proven wrong, but I don't think I am.

Note that another thing is that yes, "btcl" may be the worst possible thing to use for this, and you might test whether a simpler "xor+xchgl" might be better - it's still serializing because it is locked, but it should be the normal 12 cycles that Intel always seems to waste on serializing instructions rather than 22 cycles.

Elsewhere, he gave a potential (though unlikely) exploit:

As a completely made-up example (which will probably never show the problem in real life, but is instructive as an example), imaging running the following test in a loop on multiple CPU's:

int test_locking(void)
{

static int a; /* protected by spinlock */
int b;

spin_lock()
a = 1;
mb();
a = 0;
mb();
b = a;
spin_unlock();
return b;
}

Now, OBVIOUSLY the above always has to return 0, right? All accesses to "a" are inside the spinlock, and we always set it to zero before we read it into "b" and return it. So if we EVER returned anything else, the spinlock would obviously be completely broken, wouldn't you say?

And yes, the above CAN return 1 with the proposed optimization. I doubt you can make it do so in real life, but hey, add another access to another variable in the same cache line that is accessed through another spinlock (to get cache-line ping-pong and timing effects), and I suspect you can make it happen even with a simple example like the above.

The reason it can return 1 quite legally is that your new "spin_unlock()" isnot serializing any more, so there is very little effective ordering between the two actions

b = a;
spin_unlock();

as they access completely different data (ie no data dependencies in sight). So what you could end up doing is equivalent to

CPU#1 CPU#2
b = a; /* cache miss, we'll delay this.. */
spin_unlock();
spin_lock();
a = 1;
/* cache miss satisfied, the "a" line is bouncing back and forth */
b gets the value 1
a = 0;

and it returns "1", which is wrong for any working spinlock.

Unlikely? Yes, definitely. Something we are willing to live with as a potential bug in any real kernel? Definitely not.

Manfred objected that according to the Pentium Processor Family Developers Manual, Vol3, Chapter 19.2 Memory Access Ordering, "to optimize performance, the Pentium processor allows memory reads to be reordered ahead of buffered writes in most situations. Internally, CPU reads (cache hits) can be reordered around buffered writes. Memory reordering does not occur at the pins, reads (cache miss) and writes appear in-order." He concluded from this that the second CPU would never see the spin_unlock() before the "b=a" line. Linus agreed that on a Pentium, Manfred was right. However, he quoted in turn from the Pentium Pro manual, "The only enhancement in the PentiumPro processor is the added support for speculative reads and store-buffer forwarding." He explained:

A Pentium is a in-order machine, without any of the interesting speculation wrt reads etc. So on a Pentium you'll never see the problem.

But a Pentium is also very uninteresting from a SMP standpoint these days. It's just too weak with too little per-CPU cache etc..

This is why the PPro has the MTRR's - exactly to let the core do speculation (a Pentium doesn't need MTRR's, as it won't re-order anything external to the CPU anyway, and in fact won't even re-order things internally).

Jeff V. Merkey added:

What Linus says here is correct for PPro and above. Using a mov instruction to unlock does work fine on a 486 or Pentium SMP system, but as of the PPro, this was no longer the case, though the window is so infintesimally small, most kernels don't hit it (Netware 4/5 uses this method but it's spinlocks understand this and the code is writtne to handle it. The most obvious aberrant behavior was that cache inconsistencies would occur randomly. PPro uses lock to signal that the piplines are no longer invalid and the buffers should be blown out.

I have seen the behavior Linus describes on a hardware analyzer, BUT ONLY ON SYSTEMS THAT WERE PPRO AND ABOVE. I guess the BSD people must still be on older Pentium hardware and that's why they don't know this can bite in some cases.

Erich Boleyn, an Architect in an IA32 development group at Intel, also replied to Linus, pointing out a possible misconception in his proposed exploit. Regarding the code Linus posted, Erich replied:

It will always return 0. You don't need "spin_unlock()" to be serializing.

The only thing you need is to make sure there is a store in "spin_unlock()", and that is kind of true by the fact that you're changing something to be observable on other processors.

The reason for this is that stores can only possibly be observed when all prior instructions have retired (i.e. the store is not sent outside of the processor until it is committed state, and the earlier instructions are already committed by that time), so the any loads, stores, etc absolutely have to have completed first, cache-miss or not.

He went on:

Since the instructions for the store in the spin_unlock have to have been externally observed for spin_lock to be aquired (presuming a correctly functioning spinlock, of course), then the earlier instructions to set "b" to the value of "a" have to have completed first.

In general, IA32 is Processor Ordered for cacheable accesses. Speculation doesn't affect this. Also, stores are not observed speculatively on other processors.

There was a long clarification discussion, resulting in a complete turnaround by Linus:

Everybody has convinced me that yes, the Intel ordering rules _are_ strong enough that all of this really is legal, and that's what I wanted. I've gotten sane explanations for why serialization (as opposed to just the simple locked access) is required for the lock() side but not the unlock() side, and that lack of symmetry was what bothered me the most.

Oliver made a strong case that the lack of symmetry can be adequately explained by just simply the lack of symmetry wrt speculation of reads vs writes. I feel comfortable again.

Thanks, guys, we'll be that much faster due to this..

Erich then argued that serialization was not required for the lock() side either, but after a long and interesting discussion he apparently was unable to win people over.

(

In fact, as Peter Samuelson pointed out to me after KT publication (and many thanks to him for it):

"You report that Linus was convinced to do the spinlock optimization on Intel, but apparently someone has since changed his mind back. See <asm-i386/spinlock.h> from 2.3.30pre5 and above:

/*
* Sadly, some early PPro chips require the locked access,
* otherwise we could just always simply do
*
* #define spin_unlock_string \
* "movb $0,%0"
*
* Which is noticeably faster.
*/
#define spin_unlock_string \
"lock ; btrl $0,%0""

-- Ed: [23 Dec 1999 00:00:00 -0800]

2. Buggy Toshiba Satellite Keyboard Hardware; Parenthood

1�Dec�1999�-�11�Dec�1999 (38 posts) Archive Link: "Toshiba Satellite 2595XDVD"

People: Andrei Pitis,�Pavel Machek,�Linus Torvalds,�Andries Brouwer

Andrei Pitis noticed that the Toshiba Satellite 2595XDVD had a buggy keyboard controller, which would occassionally ignore the repeat delay, and send a lot of key-down interrupts very quickly. He posted a short patch against handle_scancode() in drivers/char/keyboard.c, to enforce a 200 ms delay on all repeated scancodes. Eric Dumazet (on the Satellite 4080XCDT running Windows 98) and Pavel Machek (on the Satellite 4030CDT) confirmed the repeating problem. Pavel suggested that instead of silently correcting the problem, a warning would be nice, and posted a patch.

Elsewhere, under the Subject: Toshiba kbd patch (http://kernelnotes.org/lnxlists/linux-kernel/lk_9912_01/msg01220.html) , Andrei posted his latest version of the patch, enforcing a 250 ms delay. Linus Torvalds pointed out that this would cause missed keystrokes in the case of someone intentionally typing the same key more than four times per second. He asked if anyone could assure him that this didn't matter.

A lot of people replied that yes, in fact, it did matter. Even slow typists were capable of intentionally pressing the same key multiple times per second. Andrei insisted that no one would want to press the same key multiple times in a row, but others disagreed, and Andries Brouwer pointed out that the delay would slow intentional auto-repeats down to a crawl. Andrei posted a new patch, adding, "It's funny, I started by writing a quick'n'dirty patch for my own use and I will probably end up writing kbd software repeat (is it really a demand for such a feature? imagine that we would be able to set up repeat rates of up to 100 cps!!" Pavel Machek came in with a dose of practicality, saying, "Wait with soft repeat for vojtech's input patches and for 2.5. Or look at other architectures, they need soft repeat. Implement minimal fix now."

There was also a cute exchange of recipes for soothing crying newborns, between Linus and Andrei. In his previously quoted post, Andrei continued, "Anyway, I am happy doing this, between two consecutive cries (she has stomach aches, any known remedies for a 1 month baby?) of my first newborn (Andra, 12 nov) late in the night (in my TZ - EET Bucharest/Romania). It would be neat if a child's crying repeat delay/rate would be customizable :-)))" Linus replied, "Try sugarwater. Two-three tablespoons sugar to one deciliter of purified water, if I remember correctly. Spoonfeed. No guarantees." Andrei replied, "Thanks for the advices to all of you who replied :-) I will try some (of course, after verifying them with a doctor, now it's the first time I regret not going to a medical school :-))..."

3. ext2/ext3 Compatibility

1�Dec�1999�-�13�Dec�1999 (46 posts) Archive Link: "Oops with ext3 journaling"

Topics: FS: ext2, FS: ext3, SMP

People: Theodore Y. Ts'o,�Stephen C. Tweedie,�Matti Aarnio,�Brion Vibber,�Byron Stanoszek,�Jens Axboe,�Pavel Machek

Brion Vibber was running Linux 2.2.13 SMP with Jens Axboe's CD-ROM patch and ext3-0.0.2c (with the KDB patch applied but not activated). After a week and a half of use, he got an oops during a big copy, and posted it to the list. It turned out that the problem was his own fault (his copy command overwrote the journal.dat file), but in the course of discussion, Pavel Machek expressed a fear that if Stephen C. Tweedie eventually made the journal.dat file hidden, it would make it much more difficult to switch between ext2 and ext3. But Theodore Y. Ts'o assuaged his fears, with, "Never fear, there will be an very easy way to switch back and forth between ext2 and ext3. A single mount command, or at most a single tune2fs command, should be all that it takes, no matter how the journal is stored." And Stephen confirmed this sentiment, saying:

Absolutely. I am 100% committed to this. Apart from anything else, this is the mechanism which will allow for incompatible revisions of the ext3 journal format: you will always be able to mount a journaled ext3 partition as ext2, and then remount as the new ext3, if you want to upgrade ext3 partitions to a new, incompatible format (which should not happen after final release, but there will be at least one such incompatible format revision required in the next month or so).

Any such format changes will be limited to the journal format: there will be no journaling changes which prevent backing the filesystem revision back down to ext2. Ever.

Elsewhere, under the Subject: unsigned short for nlink_t (http://kernelnotes.org/lnxlists/linux-kernel/lk_9912_01/msg00960.html) , he reiterated, "the on-disk format in ext3 does not change, with the exception of the addition of the journal inode. You can migrate between ext2 and ext3 without a reformat."

Elsewhere, under the Subject: Ext3 Filesystem (http://kernelnotes.org/lnxlists/linux-kernel/lk_9912_01/msg01121.html) , Byron Stanoszek was surprised to find that aside from journal.dat, the two filesystems ext2 and ext3 were identical. He felt that instead of journal.dat, ext3 should just use one of the reserved bits in the ext2 header to label it as a journalling filesystem. A number of people took his suggestion to mean that the journalling functionality should be inserted into the existing ext2 code. Matti Aarnio replied, "Sure yes, it *may* become folded back into EXT2, but while the development is under way, it is better to be called something different so that you may choose which partitions you want to endanger with unknown behaviour of the code.. (Be the name 'ext3', or 'ext2j', or whatever..)" Stephen also replied to Byron, with "The filesystem internals have to be substantially changed for journaling, but the on-disk filesystem *is* just marked by a flag in the ext2 superblock which indicates that there is a journal present. You can thus migrate between ext2 and ext3 on disk easily, but you cannot use the same kernel code to mount ext2 and ext3." Brion Vibber also replied to Byron, adding, "giving it a separate name just makes it safer since the stable ext2 code can be run simultaneously on other filesystems that don't need the journaling while the experimental ext3 eats your test partitions if something goes wrong."

Byron also added in his original post, "I've read recently about the problems with root seeing journal.dat, being able to modify it, etc." Stephen replied that the final release would have the journal in a hidden inode. Byron also asked, in its current visible form, what if root removed journal.dat? Would the system rebuild the log or would it try to restore the filesystem by replaying the empty journal log? Stephen replied, "Right now the world falls apart. Eventually you won't be able to: the journal will be in a hidden inode."

4. Yamaha Spits On The Open Source World

8�Dec�1999�-�9�Dec�1999 (15 posts) Archive Link: "Yamaha Sound Drivers"

Topics: PCI, Sound: ALSA

People: Dan Hollis,�David Whysong,�Vojtech Pavlik,�Alan Cox,�Jeff Garzik,�Alan Olsen

Alan Olsen asked about the status of sound drivers for the newer Yamaha sound chipsets. Dan Hollis replied, "Yamaha refuses to release required documentation for the PCI YMF chips. And no, the supplied PDF files on the yamahayst pages is not enough." David Whysong added that the chipset was "blacklisted by ALSA due to lack of programming information." He added, "I have been in contact with someone at Yamaha about this issue, unfortunately specs are not available without an NDA for now." Jeff Garzik said an NDA would be acceptable as long as source code could be released, but David replied, "I asked but the NDA would not allow source to be released. I'm still hoping that something can be worked out, but it doesn't look too likely unless Yamaha changes their policy." Vojtech Pavlik suggested that "Perhaps an Open Source friendly NDA could be arranged? I've done this a couple times now - I can't give the docs I received to anyone, but I can distribute the driver I made based on them, under GNU GPL. For example the Logitech joystick driver." David replied, "I'm trying to arrange something like this... I'll let you know what happens."

Elsewhere, Alan Cox replied to Alan Olsen's original post, reiterating that the drivers were "basically dead. Even the SB emulation mode seems to need some kind of undocumented setup." Jeff replied that the docs looks sufficient to him, but Dan replied:

I think Alan had a go at it and couldnt get it to work using the docs. Basically there is some undocumented setup required for the ymf744b to work at all. Maybe someone can reverse engineer the DOS init program and figure it out.

Basically this isnt going to work until Yamaha wakes up. Makes you wonder what theyre hiding.

Sharon And Joy

Kernel Traffic is grateful to be developed on a computer donated by Professor Greg Benson and Professor Allan Cruse in the Department of Computer Science at the University of San Francisco. This is the same department that invented FlashMob Computing. Kernel Traffic is hosted by the generous folks at kernel.org. All pages on this site are copyright their original authors, and distributed under the terms of the GNU General Public License version 2.0.