Kernel Traffic #38 For 11 Oct 1999

By Zack Brown

Table Of Contents

Introduction

Well, a lot of people had problems with the formatting of last week's issue. Actually it was just a <pre> tag around a very long line. Special thanks go to Szabolcs Szakacsits, who identified this problem. Thanks, Szaka! And thanks to everyone else who emailed me about it too. Your help is very much appreciated.

Special thanks also to Peter Samuelson, who (yet again) pointed out some significant errors. Thanks, Peter!

Mailing List Stats For This Week

We looked at 1131 posts in 4675K.

There were 441 different contributors. 190 posted more than once. 152 posted last week too.

The top posters of the week were:

1. Reproducible 2.2.12 SMP Crashes Hunted

13 Sep 1999 - 4 Oct 1999 (40 posts) Archive Link: "IDE + SMP Lockup (no OOPS) in 2.2.12, 2.2.10"

Topics: Disk Arrays: RAID, Disks: IDE, Disks: SCSI, Modems, Networking, PCI, SMP

People: Tom LivingstonMario MikocevicAndre HedrickAlan CoxBenjamin LaHaise

Tom Livingston was experiencing regular crashes on the latest stable kernels compiled with SMP, including 2.2.12; he described his system:

Abit BP-6 dual 366 celerons, 128MB PC100 RAM, redhat 6.0, 12 ide disks attached (to primary PIIX4 and three PCI PDC20246's... note, I am NOT using the onboard HPT-366), 3com 3c905b 100mbit ethernet

I have two other very similar systems, both Abit BP-6 dual celeron 366, 128MB RAM, 3c905b's, redhat 6.0, 2.12.12 but using Tekram 390F controllers and UW SCSI disks + raid, and they are both 100% stable.

Benjamin LaHaise confirmed the problem, saying that his PDC20246 with drives on either channel would reproducibly lock up under SMP.

Stephan van Hienen also confirmed the problem. He was running in UP mode because SMP would crash his system immediately. He posted the details of his system:

Asus P2B-D
+ 2 * Promise UDMA/66 controllers
+ 1 * Adaptec 2940UW
+ 1 * 3com 3c905b
and dual p2-450 / 512mb ram
+ 1 * 9.1u2w ibm scsi hd
+ 5 * 25gb ibm ide hd (1 on the onboard controller
and 4 on the promise controllers)

Mario Mikocevic replied, confirming the problem. His sysem:

Dual P2 350
mixed IDE + aic7xxx
2 x 4.5GiB IBM SCSI
1 x 2GiB IDE

He explained, "Hangs (without _any_ output even SysRq) started when I added another 64MiB DIMM to already existing 64MiB one. I thought, bad RAM. Replaced it, the same. I've spent whole day changing all possible combinations of 64MiB DIMMs and every time with 64+64 it wouldn't run into 10min of uptime. Sometimes it hang on fsck, sometimes later. The very same 64MiB DIMMs when used alone worked fine. Day after (today :) I aquired 128MiB DIMM piece. Hangs all around place. _never_ in the same place of booting ! Ofcourse kernel is 2.2.12, compiled for SMP on RH5.2 box with all updates, running as web and dns server."

Alan Cox also replied to Stephan, asking if this was happening with straight 2.7.2; if so he was very interested in debugging it. Stephan replied that he was using version 2.91.66; Andre Hedrick gave a recipe for tracking down the problem:

I know about this, but not a clue to resolve the issue. I have brought up the issue with Promise and they are working with me. There is a duplicate issue with M$'s NT, I hope to get the method of resolution to this problem and not "miniport" code.

First test to verify the problem, build another kernel that is UP. Include raid and rag the hell out of the box. This should not fail, based upon other test reports.

If the box throttles in UP, the driver core is RAID UP Stable.

Second, disable RAID under SMP and push a sequencal access on the last two channels.

What is the irq routing table for this board? Did you remove all the Promise BIOS chips execpt for the one that registers hde/f/g/h ?

Have you patched with "ide.2.2.12.19990921.patch.bz2"? New Promise OEM support.

You are required to set these for more than two cards.
CONFIG_BLK_DEV_PDC202XX=y
PDC202XX_FORCE_BURST_BIT=y

Tom (the original poster) replied that yes, UP mode with RAID was fully stable. He disabled RAID and enabled SMP and got the hard lockup again. Regarding Andre's IRQ routing question, Tom tried many configurations but was unable to reproduce the lockup with any configuration of IRQ overlays. He added that he had not been using the ide.2.2.12.19990921.patch.bz2 patch because it hadn't been released when he was doing his tests; and all his tests had
CONFIG_BLK_DEV_PDC202XX=y
PDC202XX_FORCE_BURST_BIT=y
properly set.

Tom was just getting warmed up. He described his grusome dissections:

Ok, so what have I been doing since I first reported this?

The box that I reported these lockups against is basically a production box. It's an 80 gig RAID array, so you know we must be using it for something ;) To make things worse, it is colocated at an ISP. After I reported the problems, I pretty much had to set it to UP mode and let it run.

However, I have two other abit BP6 based development systems, and my intent is to try to duplicate these results with one of these. They are local to me, and can be fucked with without worries.

They are both Tekram based SCSI system, however, so some modifications were needed. I have temporarily claimed two 14G IBM 7200 udma33 drives for testing, and I ordered two PDC20246's to put in to round out the picture.

So far, I haven't been able to get this configuration to crash. But I fear it's for lack of volume running through the controllers. I hooked up one IBM drive each to the two PDC controllers, and ran it with equal load on the hard disks, but no crashes.

I am going to break apart the rest of my computers here in an attempt to borrow three more IDE drives (two maxtors, and a quantum). I'm going to hook these up so that I have a disk as hda, and one disk on each of the promise channels. I'll then test with the vanilla kernels and with the ide.2.2.12.19990921.patch.bz2 patch and see if I can get it to crash. My guess is that 5 drives, working most of the channels... will be able to produce the same results.

I'll be doing these tests in the next few days, so I will relay my results.

A few days later, he reported his progress:

As an update, I have succeeded in getting both of the boxes to crash very repeatedly. The trick turns out to be running drives on both interfaces on a PDC20246. With SMP enabled and concurrent access to both disks, the machine locks up almost immediately. It doesn't have to have more than one PDC20246, or more than two ide drives attached.

Though the normal kernel doesn't produce an oops, I am able to get an NMI OOPS with 2.2.12 + ikd + NMI oopser. However, I must obtain a null modem cable to capture the OOPS before I can decode it an forward the results. Hopefully that will be tomorrow.

kernels that crash this way: (all smp) 2.2.12 vanilla, 2.2.12 + ide, 2.2.13pre12, 2.2.10

kernels that don't crash: (any UP), anything pre 2.2.6 (I will also test 2.2.7-9 for my full report)

I write prematurely just in case you are working on tracking this down. Since I've done it on three machines now, it might be reproducible on your hardware Andre. I think SMP + PDC20246 with two drives attached will cause it reliably.

Like I said, I will write as soon as possible with more kernel tests and the decoded OOPS that ikd generates.

The thread seemed dead for about a week thereafter, until Stephan asked what the status of the bug was, and Alan replied, "Having read the IDE code in detail chasing some other bugs I think there are a list of things that need fixing before we get into the "why does it still not work" case"

Andre also replied to Stephan, saying:

I just now have a way to produce the crash as of tonight. What I need to know is what parallel races are we referring/hunting and against which kernels?

Alan just pointed out one location last night. It is now 10:00pm here and I have about 4 hours of brain left. I will check back in about 15 minutes. I am off to grab AC's latest and see what "Andrea A." has to offer about finding the races.

Tom and Alan had a brief staircase at this point, resulting in Alan uncovering the nature of the bug. Tom posted oopsen from his tests, and said:

I have seen one race. At this point it appears to me that any time you have concurrent access to two ide channels sharing the same interrupt while running SMP. As you see below, I have duplicated with the hpt-366 controller as well.

I have tried:

2.2.6 (with or without ide patch): works in SMP mode

2.2.7 (same): works in SMP (as I remember, haven't tested recently)

2.2.[89] : haven't tested

2.2.10 and later: crashes in SMP mode

2.2.12 +

2.3.18: crashes in SMP mode

It would seem that is actually a generic ide/smp bug, and not one that is promise controller specific. I was able to cause the same behavior tonight with the onboard hpt-366 controller.

kernel: 2.2.12 + ide.2.2.12.19990925.patch.gz + 2.2.12-ikd1.gz

Tested normal crashing setup, one drive on each channel of pdc20246. Got normal ikd NMI oopser oops. Looked like the other one I reported. oops is attached as text file.

Then I moved the drives, one per channel to the onboard hpt-366. I had previously commented out the two lines in ide-pci.c at line 630 that say:

if (dev2 && hpt363_shared_irq)
   return;

so that I could enable the second channel for the test.

I did my standard crashme (which I have simplified to 'dd if=/dev/hdi of=/dev/null & dd if=/dev/hdk of=/dev/null' you only need one block each to trigger the crash) and got what looks like to me the same lockup as the pdc20246 with the hpt-366. This oops is also attached.

I retested this configuration in UP mode and found it completely stable even with the 2nd channel disabling removed. My abit bp-6 bios is the original LP revision, I have never flashed. Board is a "newer" board, bought about 8/1... it has plastic cpu handles as opposed to metal, like my first one bought in early june. I cannot see any silk screening indicating revision on the board.

I was thinking that this might have caused your impression that there is a buggy chipset/revision out there that needs this 2nd channel disabled on the hpt-366. Or is it two bugs? One is the motherboard, and if that's OK you still have the multichannel + smp bug?

This was all Alan needed. He analyzed the oopsen and said:

Ok whats happened is this

                CPU0                            CPU1
          take hwgroup spinlock
                                        take an irq
          disable_irq
          [wait for IRQ completion]
                                        try to take hwgroup spinlock

Based on this, Tom posted a patch that stopped the crashes, though he acknowledged that there could very well be problems with it that he couldn't see. Alan replied, "That proves the diagnosis which is good. I was trying to work out if it was safe or not but the code needs a bit of cleaning up before I could be sure." At this point the discussion skewed off into other topics.

2. ext3 Filesystem Status; ACLs

16 Sep 1999 - 28 Sep 1999 (77 posts) Archive Link: "Ext3 filesystem info?"

Topics: Access Control Lists, Backward Compatibility, FS: Coda, FS: NTFS, FS: ext2, FS: ext3, POSIX

People: Theodore Y. Ts'oStephen C. TweedieOliver XymoronStefan MonnierBenjamin ScottMiquel van SmoorenburgJesse PollardAlex Buell

Someone asked about the status of the ext3 filesystem. Miquel van Smoorenburg gave a pointer to Stephen C. Tweedie's FTP directory (ftp://ftp.linux.org.uk/pub/linux/sct/fs/jfs/) , and Theodore Y. Ts'o explained:

There is a 0.01 release which came out 1-2 weeks ago. It was against 2.2.2, and has bugs which have since been fixed. I imagine that Stephen will be releasing a new version fairly shortly.

So the good news is that most of the code has been written but we still need to do some bugfixing (and bug finding) before it will be completely production ready.

Later, he added, "Note that the 0.0.1 release has a number of caveats; it only works on Linux 2.2.2, and if used as your root filesystem, it will turn /dev into a socket. :-)" and concluded, "you may want to wish for Stephen's next version, which will fix a number of bugs and will port things to 2.2.12."

Stephen gave some more details:

0.0.2 over the weekend is the plan. I've fixed all of the known bugs except for one, but that one requires a bit of a reworking of how I track committed buffers which are still needed by the transaction code. Once that is done I expect to have a version which is pretty stable against the 2.2.2 kernel. I already have a 2.2.12 version of the 0.0.1 code done so merging the 2.2.2 fixes into the 2.2.12 stream should not be hard.

Currently the released journaling code journals everything, including data. That is deliberate: journaling metadata only turns out to be _lots_ more complex (hint---what happens if you delete a block of metadata and reallocate it as data or vice-versa), so although all of the necessary support is there in the journaling layers, the released ext3 code does not do metadata-only journaling yet.

Once people have had a chance to thrash out any bugs left in the 0.0.2 code, I'll enable the metadata-only journaling and we can all watch the pretty fireworks as filesystems all over the planet explode in a curious and interesting manner...

Elsewhere, Alex Buell asked if there would be tools to convert from ext2 to ext3, and Ted replied, "No real tools are needed; it's just a matter of mount -t ext3 with an appropriate mount option to ask it to create the journal inode. After you unmount an ext3 volume with journaling, you'll be able to mount it using ext2. Some future extensions (like B-tree directories) change this in the future, but if you're only using journalling, it's completely backwards compatible."

Stephen was more circumspect, and replied to Alex:

That isn't finalised. I _might_ make the ext3 journal use a reserved inode, but right now there isn't actually a difference between ext2 and ext3. You just create a journal file on the ext2 filesystem and tell the kernel where to find it. Then when you mount it as ext3, everything is automatically journaled to that file.

If you uncleanly shutdown, then the ext3 filesystem will have a compatibility flag set in the superblock to prevent you from remounting it as ext2 without doing a recovery step first. Once that is done, it becomes mountable as ext2 again.

Elsewhere, someone changed the subject, asking if there was any chance of seeing Access Control Lists (ACLs) in the near future. Stephen gave a link to the Group ACL For ext2 In Live (http://aerobee.informatik.uni-bremen.de/acl_eng.html) page, and there followed a big implementation discussion. Jeff Haumont asked if the CODA project would have the same ACL commands as ext2. Jesse Pollard opined that ACL code should move into the Virtual Filesystem (VFS) layer, leaving the question of support up to individual filesystem implementations. This, he felt, would allow a common set of user utilities and user-to-OS interface. But Theodore Y. Ts'o pointed out:

The problem is that different, already established filesystems: AFS, Coda, NTFS, etc., all have different ACL semantics. For example, AFS only has an ACL on a per-directory basis. I'm not sure about Coda, but it may be the same as AFS. NTFS uses 128 bit UUID's in its ACL's to name users and groups. The POSIX acl interface uses uid_t and gid_t for user and group id's.

So it would be *nice* to do this, but there's quite a lot of design work to make the interfaces similar enough that a single interface could be used at both the UI and system call level. I won't say that it's impossible, but it's definitely non-trivial.

Oliver Xymoron added, "Not to mention that auditing a system using ACLs for security is *much* more difficult. And that such a system would break all mature UNIX tools' notions of what is secure. ACLs tend to be an answer for people who are asking the wrong question anyway."

Stefan Monnier replied, "No, they simply ask a different question. ACLs should normally not be used for the part that one might want to audit, but they'll be used for all kinds of cases where one wants to share some info with some persons. How can you give read access to some persons, and write access to some others without ACLs?" Jesse Pollard also felt there was no other way to grant fine-grained permissions without ACLs. Benjamin Scott agreed with Oliver that, "As far as existing tools and auditing techniques go, yes, ACLs are outside their domain," but added that, "There are a number of things that ACLs let you do, that standard UNIX security does not. I would even argue that ACLs can lead to better security, as it allows you to specify combinations which would be difficult or impossible with traditional UNIX mechanisms."

3. A Use For Winmodems

16 Sep 1999 - 28 Sep 1999 (82 posts) Archive Link: "Turning lucent winmodem into soundcard"

Topics: Modems, Sound

People: Jamie LokierPavel MachekH. Peter Anvin

Pavel Machek got Lucent winmodems to do something useful. With his driver they can be used as soundcards. H. Peter Anvin asked if it could handle telephone hangups, ring detects, and so on; if so, he suggested it might make a usable answering machine. Jamie Lokier replied, "It currently handles detecting dial tone, busy tone, and a few other things, using the on board DSP. Dialing and hangup are done properly. It is not quite ready to be an answering machine, but nearly. Full tone detection, ring detection and so on is coming soon." There followed some implementation discussion.

4. Some History And Explanation Of Kernel Configuration

27 Sep 1999 - 28 Sep 1999 (29 posts) Archive Link: "bug in 2.3.18ac9 net/Config.in"

Topics: Kernel Build System, Microsoft

People: Jes SorensenAlan CoxMichael Elizabeth ChastainEric Youngdale

Someone found an "if" without a "then" in a kernel configuration script. Jes Sorensen replied, "the questions is really why nobody fixes menuconfig instead. It seems that for every little irrelevant change either menuconfig or xconfig breaks for whatever stupid thing. If the situation is not improved we really should remove them from the kernel tree, they seem to cause more grief than good." Alan Cox replied, "There is a spec. The spec says net/Config.in is wrong. End of debate."

Michael Elizabeth Chastain also replied to Jes, giving this fascinating explanation of the past, present, and potential future of kernel configuration:

I've got insomnia so you get a long answer.

First, on a literal level, leaving out a "then" keyword is not a "little irrelevant change". Documentation/kbuild/config-language.txt is the specification for Config Language. Based on legacy "bash" syntax, the Config Language syntax specifies that an "if" statement must have a "then" keyword.

So I think the real question is "how come Menuconfig doesn't print out informative syntax errors?" Because if it did, then the people who write Config.in scripts with syntax errors would get error reports before submitting their patches to Alan or Linus.

The problem goes back a long, long ways to the use of ad hoc scripts for kernel configuration.

First there was some old 'Configure' shell script, which I've never seen. Then sometime in 1993, or even earlier, Raymond Chen of Microsoft wrote the current 'Configure' script.

Next, Eric Youngdale wrote a very kludgy Config.in->tk translator, which works OK most of the time. So you have a choice: you can run the reliable but awkward text-mode script written by a Microsoft engineer, or you can run the flashy, crashy windowing version from somebody else. (Ironic, isn't it?)

After that, William Roadcap (not from Microsoft) wrote Menuconfig, which is a super kludge of a language interpreter, but wow does it have a nice curses interfaces. Mr. Roadcap disappeared, I took responsibility, I fixed a lot of bugs in Menuconfig, and then I semi-disappeared too.

These ad hoc script interpreters simply do not scale up to the size of the development community. Some poor guy trying to add something to the network drivers omits a required keyword and instead of a nice syntax error, dozens of people get a mysterious message. As you noticed, this happens several times a year.

Here is the second Microsoft-ish irony:

<flame-bait>

Linux developers have high standards for the work they produce, but when it comes to the tools they use to do their *own* work, they value features and familiarity and speed more than correctness.

</flame-bait>

Don't believe me? Read scripts/Menuconfig, figure out what the hell it's doing (including the magic awk script in the middle), and then ask yourself if Linus would knowingly accept a driver or a filesystem that behaves so randomly in the presence of errors. But the kbuild procedure is full of this kind of shit: CONFIG_MODVERSIONS, "make modules_install", three different configuration parsers, configuration parameters tucked into the Makefiles.

I believe the right fix is a real grammar, written in bison, with real syntax checking and real error messages. I have one in progress with a generic back-end and a curses front-end. I've actually configured and built kernels with it. Jim Bray is working on a gtk front-end for it.

The code's been available for eight months now and I've gotten some interest in it. Menuconfig is not well-loved among its captive audience (more Microsoft parallels).

My new mconfig runs in about 5% of the CPU time of Menuconfig. (I developed it on a 486, a *slow* 486, so I made it fast, *very* fast). But forget about the speed. The nicest part of mconfig is that when I ran it on 2.3.18ac9 tonight, it said:

    net/Config.in: 70: parse error

What I really need now is a block of hacking time, about 2-4 weeks long, to bring it to a usable level. I expect to use my winter vacation to do this.

Then I am going to try to put Menuconfig out of business. Also Configure and Xconfig, too. Not by taking them away from people -- but by offering something that they prefer to run so that the old ones die from lack of users.

Work in progress:

    ftp://ftp.shout.net/pub/users/mec/experimental/mconfig-0.15.tar.gz

I actually *don't* want much feedback on this yet. I'm just putting it up so people can see the direction I'm going.

There were mixed reactions to this post. Some folks felt Michael was wrong to try to get rid of the older configuration methods, but that those methods ("make menuconfig" and "make xconfig" should be fixed instead).

5. Version Control System Flamewar

27 Sep 1999 - 29 Sep 1999 (38 posts) Archive Link: "The Linux Kernel Project Management System (INITIAL PROPOSAL)"

Topics: Version Control

People: Larry McVoyJordan Mendelson

Jordan Mendelson opened a big can of worms when he posted a long proposal suggesting the Open Source Aegis version-control/bug-tracking system (http://www.canb.auug.org.au/~millerp/aegis/aegis.html) . Later, he gave a link to a March 1999 proposal on linux-kernel (http://kernelnotes.org/lnxlists/linux-kernel/lk_9903_04/msg00650.html) by Aegis' author Peter Miller, who also participated in the discussion this time.

The battle was joined when Larry McVoy started pushing his commercial BitKeeper (http://www.bitkeeper.com/) project. An ugly flamewar quickly ensewed.

Larry's main points seem to be:

The main points of BitKeeper detractors seem to be:

6. Some Discussion Of Windows 2000 Spinlocks

28 Sep 1999 - 2 Oct 1999 (38 posts) Archive Link: "possible spinlock optimizations"

Topics: Assembly, SMP

People: Ingo Molnar

Someone gave a pointer to an article on Windows 2000 spinlocks (http://www.numega.com/drivercentral/resources/spinlocks.shtml) , which claimed that their new kind of spinlocks had better SMP memory bus characteristics than standard spinlocks. Ingo Molnar replied:

they are fixing the symptoms. Windows NT apparently has problems with high contention spinlocks. So instead of reducing the number of spinlocks and reducing lock collisions, they decided to optimize the 'contention case'. This is a step in the wrong direction, as it makes the 'no contention' case slower. No amount of trickery is going to avoid the spinning CPU to waste precious cycles on pure spinning! The 'no contention case' in Linux is highly optimized, it's only 2 inlined assembly instructions to aquire, and 1 inlined assembly instruction to release. Windows NT is digging itself into a deeper and deeper architectural hole me thinks ...

Linux is doing another trick here to reduce bus traffic if lock contention happens: once we go into the slow path we do not do interlocked atomic instructions to poll the state of the spinlock flag, but normal memory access instructions - this also ensures that we generate cross-cache bus traffic only when the spinlock is released. This way we basically get the kind of 'good' (nonintrusive) bus traffic what queued spinlocks are supposed to do primarily - without the overhead of queued spinlocks.

(btw. NT doesnt do the kind of off-line spinlock slow-path thing Linux does, no wonder they see high spinlock overhead. NT calls functions to aquire/release spinlocks, yuck!)

NT designers i believe also have made a mistake with the 'processor-local area' thing which they currently implement through a special segment and %fs. Not only is it slower on x86 (%fs access is a bigger opcode and doesnt optimize as well within the CPU), but they'll also have to waste a whole register on that in IA64 ... Apparently that David Cutler guy has already left the building? ;)

in the semaphore case the picture is completely different - semaphores (mutexes) naturally want to have the kind of queueing characteristics described above. Linux 2.3 does this and more (we do wake-one on semaphores). Linux semaphores are still very lightweight in the no contention case: 2 inlined assembly instructions to aquire and 2 inlined assembly instructions to release.

anyway, Linux tries to have the highest quality SMP core architecture possible physically - if you can poke holes into it, feel free!

7. Mailing-list Problems

30 Sep 1999 (2 posts) Archive Link: "mailing-list-problems"

People: Matti Aarnio

Daniel Wirth wasn't getting any mail from linux-kernel for almost 24 hours, and asked if anything was wrong. Matti Aarnio replied, "Backlog at VGER towards UK+DE exploder, which has got stuck. Circumvention is now in effect, and aside of messages which are stuck inside the sick system, the flow has now resumed to normal (although perhaps not quite in chronological order..)" EOT.

 

 

 

 

 

 

Sharon And Joy
 

Kernel Traffic is grateful to be developed on a computer donated by Professor Greg Benson and Professor Allan Cruse in the Department of Computer Science at the University of San Francisco. This is the same department that invented FlashMob Computing. Kernel Traffic is hosted by the generous folks at kernel.org. All pages on this site are copyright their original authors, and distributed under the terms of the GNU General Public License version 2.0.