Kernel Traffic #66 For 8�May�2000

By Zack Brown

Table Of Contents

Mailing List Stats For This Week

We looked at 1116 posts in 4974K.

There were 397 different contributors. 182 posted more than once. 136 posted last week too.

The top posters of the week were:

1. 'movb' Instruction On Intel

18�Apr�2000�-�29�Apr�2000 (97 posts) Archive Link: "namei() query"

Topics: SMP

People: Linus Torvalds,�Oliver Xymoron,�Andy Glew,�Manfred Spraul,�Jamie Lokier

(Many thanks to Kenneth Topp, who pointed out that this problem was covered earlier, in Issue�#47, Section�#1� (20�Nov�1999:�spin_unlock() Optimization On Intel) . Thanks, Kenneth! -- Ed: [09 May 2000 07:34:00 -0800]

In the course of discussion, Linus Torvalds explained:

I have conflicting reports about the safety of "movb" from Intel. According to some people in there, "movb" is always safe, and there should not be any need for any config option at all.

However, at the same time my original contact at intel was Andy Glew, who probably knows more about the ia32 core than anybody else I know. And Andy says that yes "movb" is legal, but that some very early P6 steppings may be buggy. And Andy is God.

I'd hate to have a kernel that works 99% of the time but then has occasional problems on some very rare machines that are really hard to track down. But I'd _almost_ like to just make the movb the default and have a CONFIG_BROKEN_P6_ORDERING options for the very very special case.

Jamie Lokier replied that he remembered the discussion where the 'movb' issues were hashed out, but that he (and others) still didn't understand exactly what the problems were. Oliver Xymoron speculated that the whole problem might not even exist. He said, "There are very few things that could cause the movb to be a problem. For instance, it can't be in the cache coherency protocol as the unlock can be lazy at it likes and still be safe. My only guess is that somehow the movb can get scheduled ahead of reads or writes inside the critical section. If that's the case, then the whole coherency scheme is broken, no? We'd need to rethink quite a number of things we've presumed safe." He added that before working on the problem, he'd like to see confirmation that there was actually a bug at all.

Some folks offered to test out any sample code, and Oliver replied to himself almost exactly a day later, with a pointer to Intel's Pentium Pro errata ( page. He included some code by Manfred Spraul that tested for the bug, and invited folks with Pentium Pro's to try it out. If it locked up the systems, he said, that meant there was a problem. Jakob �stergaar confirmed the lock-up, but it turned out that Manfred's code was not actually a proper test. Before realizing this, Linus was happy to finally put the issue to rest without a shadow of a doubt; but Oliver replied, explaining the confusion, and adding that actually, some new code he'd posted would be a much better test. In fact, if anyone locked up their Pentium Pro's with it, he said, it would indicate even more serious problems. But the only tests reported, came back negative. No machines locked.

The question remained open until, under the Subject: Re: Linux spin_unlock debate (fwd) ( , Oliver quoted some private email, between Andy Glew and him, in which Andy had said:

Use MOVB already!!!! Tell Linus I told you so!!!! (I've passed this on to about a dozen LINUX people. Send me Linus' email address, and I'll tell him myself. Forward this on, whatever.)

I know why this misunderstanding happened:

Back in 90/91, Intel had a "Memory Ordering Task Force", whose recommendation was that Intel adopt a weakly ordered memory consistency model. To make this work, you have to be able to identify lock releases. Lacking a special "RELEASE LOCK" instruction - on the i486 you could apply the LOCK prefix to an ordinary write, not just a read-modify-write -- this task force said that locks should be released by a locked read-modify-write. Like, and XCHG that wrote zero.

That's still official Intel policy.

However, the P6 processor decided to go for "speculative processir consistency" - we snoop the bus appropriately, and are *not* weakly consistent. Indeed. it would be close to impossible to build a weakly consistent P6 system, without having to require special external hardware.

So, on P6 (and Willamette, I believe) it is correct to use an ordinary write to release the lock.

Now, there *was* a bug on early P6 silicon in this regard. But it would not bite you so long as you used a locked read-modify-write to acquire the lock.

So: if you are willing to write code that should work correctly on every x86 multiprocessor to date (except for some of the earliest i386 and i486 MPs that were weakly ordered) you can use MOVB to release the lock.

Now, I'm in absolutely no position to say that, but I think that I can reasonably say that, if Intel ever builds a MP-capable system that is weakly ordered, there will be a CPUID bit indicating it. In fact, I think it;'s time that we defined what it is - one of the bits that currently reads as 0 - so that you can write forward looking code. I'll see if I can get that defined for you. For now, check that CPUID indicates a P6 or a earlier, or whatever Willamette turns out to be.

Caveat: although Intel's "glueless MP" works as I have described, it is always possible for a system vendor to break the consistency model, e.g. by reordering stores in external inter-bus bridges. So, don't abandon the conservative code - just don't make it the default, since 90% of the x86 MP systems are processor consistent.

Caveat #2: IA64 Merced systems are not, IIRC, processor consistent. (Fortunately, they do define an efficient lock release.) I'm CC'ing Gil Neiger on this - he took over the role of :"memory model maven" when I left Intel, knows what the IA64 memory model is, and will undoubtedly correct any errors in my email.

BOTTOM LINE: use MOVB already!!!!

Linus replied:

Heh. Ok, the next kernel version will definitely use the movb based spinlock unlocks.

In fact, that allows us to do the "lock decb"-based spinlock, so I'll get rid of the slow bit-op-based version entirely.

Thanks for following up on all this.

2. Cleaning Up Unnecessary Kernel Locks

21�Apr�2000�-�25�Apr�2000 (32 posts) Archive Link: "[PATCH] f_op->poll() without lock_kernel()"

Topics: Networking, SMP

People: Linus Torvalds,�Alan Cox,�Manfred Spraul,�David S. Miller

In the course of a previous subthread of the Subject: namei() query ( , while folks were tracking down some performance issues, there was some question of whether 'schedule()' required a full kernel lock. Linus Torvalds finally pointed out that

What schedule() does is to just re-instate a previously held lock (yes, you obviously know that, I'm just stating it clearly). Which means that it's not really schedule() itself that needs the lock, and that you should not look at schedule() as the "offender".

The real offender, of course, is the caller of schedule(). So the "cost" is reall yjust attributed to the wrong function.

In the same post, he went on, "It's almost certainly going to be "poll()" that is the big one contributing to schedule(), judging by your other numbers. In fact, poll() looks to be badly written wrt the big kernel lock." Elsewhere but close by, Manfred Spraul suggested that 'poll()' might not even need the kernel lock. Linus replied:

I suspect that pretty much every poll() function is already SMP-safe. The networking versions of poll() in fact drop the lock explicitly, and those are just about the most complex poll() implementations anywhere.

So the right solution may in fact be to remove the kernel lock altogether, and also make sure that the networking code stops playing with it for their poll() implementation. I doubt you'll find anything that breaks, simply because the way poll() is done under Linux is rather thread-safe (ie the double-calling on failure gets rid of all the hard races). And most poll() implementations just check a flag and add themselves to the wait queues, all of which is perfectly SMP-safe already.

All this took place in the previous thread. Now in this new one, Manfred posted a patch to take the kernel lock out of 'poll()'. Alan Cox said this wasn't a good idea if they were trying to stablize for 2.4; but David S. Miller replied that on the contrary, now was the time to find out if the change would break anything. At this point, Linus said:

I decided to take a look at the actual functions, and having looked at them I definitely think that poll() should run without the kernel lock.

Getting rid of the kernel lock makes the networking poll() cleaner, and of the 25 other poll functions I looked at, every single one seemed safe. I'm told the ISDN one is SMP-unsafe, but that the ISDN code is not SMP-safe for read() or write() either. I didn't look at that one.

The sound drivers, for example, already did all the locking they needed, simply because they needed the locking against their own interrupts anyway. And they obviously didn't use the kernel lock for that.. Some of the bad ones used the global irq lock, which also works but isn't as nice as having your own spinlock.

Most of the other things just didn't need any locking at all (just test a flag and add the poll-queue entry).

The only questionable entry I found was usb_device_poll() which I fixed up with a lock_kernel()/unlock_kernel() pair.

I probably missed one or two. I added comments to the ones I went through, and I definitely think it is worth doing at this point.

A number of people pitched in to find all the problems, and the discussion continued for several more days.

3. 'kswapd' Instability; Debugging Deadlocks

22�Apr�2000�-�28�Apr�2000 (23 posts) Archive Link: "[PATCH] 2.3.99-pre6-3+ VM rebalancing"

Topics: Virtual Memory

People: Andrew Stubbs,�Rik van Riel,�Shane Nay,�Linus Torvalds

There were a lot of scattered reports of hangs and high CPU usage related to 'kswapd' this week. Andrew Stubbs even called it the "kswapd of death."

It may have all started with Rik van Riel's patch, when he announced:

the following patch makes VM in 2.3.99-pre6+ behave more nice than in previous versions. It does that by:

It has done some amazing things in test situations on my machine, but I have no idea what it'll do to kswapd cpu usage on >GB machines. I think that the extra freedom in allocation will offset the slightly more expensive freeing code almost all of the time.

Reports of 'kswapd' using huge amounts of CPU cropped up in many threads, but eventually, Shane Nay summarized his experiences:

I have been having extremely serious problems with kswapd. I noted that changes had occured after Rik's patch to 2.3.99pre3, I grabbed that and my kswapd problem is gone. As a case in point, while running 2.3.99pre6 I was getting 2 gig/hr backing up to my tape drive, and my system was fully unusable in the process. With 2.3.99pre3 my system was really usable during the process and transfered 5gig/hr. I removed all my swap partitions and the problem still persisted under 2.3.pre6 . So I'm pretty sure that Rik's patch was the cause of the problem... but why? He just got rid of the silly while, right? Didn't someone mention that basically that just kept the process spinning around for the rest of it's time slice? Maybe it really did something during that time that prevented a context switch problem? (That doesn't make any sense to me..., but it feels right if that makes any sense at all... which it doesn't I know. i.e. obviously didn't do anything..., but maybe somebody knows what I mean because I'm having trouble putting it to words)

Anyway, just wanted to throw out a "me too", and bring up the history of it. Rik if you want to send me a patch to test, I'd be happy to help you out in testing, and might mess arround with the code later this week. (Oh, BTW Rik, I don't mean to "blame" you, just wanted to bring up the history of it) Has anyone tried copying Riks changes into a 2.3.99pre3 based kernel to isolate that one change and tried? Rik if you forward me your original patch against 2.3.99pre3 I'll do that (or send me a URL). Maybe the problem creaped in from another change. I noted however, that pre6 seemed a lot faster until kswapd started going crazy..., was there another change to the caching code that may have "interacted" with Rik's patch? (Caching seems to be much much more aggressive)

One other note. This is ALL related to disk i/o. But not swapping to disk. It seems that the caching of disk files is causing the problem... over aggressive maybe? Because basically anything I do that's disk intensive, it fills the cache, and then it's spending loads of CPU time swapping stuff in and out of *memory* not virtual memory. (Like I noted, the problem persists even if you turn off all swap partitions)

One great tidbit came out of the experience. Linus Torvalds described a way to debug deadlocks:

Note that if you have an EIP, debugging these kinds of things is usually quite easy. You should not be discouraged at all by the fact that it is "somewhere in stext_lock" - with the EIP it is very easy to figure out exactly which lock it is, and which caller to the lock routine it is that failed.

For example, if I knew that I had a lock-up, and the EIP I got was 0xc024b5f9 on my machine, I'd do:

gdb vmlinux
(gdb) x/5i 0xc024b5f9
0xc024b5f9 <stext_lock+1833>: jle 0xc024b5f0 <stext_lock+1824>
0xc024b5fb <stext_lock+1835>: jmp 0xc0119164 <schedule+296>
0xc024b600 <stext_lock+1840>: cmpb $0x0,0xc02c46c0
0xc024b607 <stext_lock+1847>: repz nop
0xc024b609 <stext_lock+1849>: jle 0xc024b600 <stext_lock+1840>

which tells me that yes, it seems to be in the stext_lock region, but more than that it also tells me that the lock stuff will exit to 0xc0119164, or in the middle of schedule. So then just disassemble that area:

(gdb) x/5i 0xc0119164
0xc0119164 <schedule+296>: lock decb 0xc02c46c0
0xc011916b <schedule+303>: js 0xc024b5f0 <stext_lock+1824>
0xc0119171 <schedule+309>: mov 0xffffffc8(%ebp),%ebx
0xc0119174 <schedule+312>: cmpl $0x2,0x28(%ebx)
0xc0119178 <schedule+316>: je 0xc0119b00 <schedule+2756>

which tells us that it's a spinlock at address 0xc02c46c0, and the out-of-line code for the contention case starts at 0xc024b5f0 (which was roughly where we were: the whole sequence was

(gdb) x/4i 0xc024b5f0
0xc024b5f0 <stext_lock+1824>: cmpb $0x0,0xc02c46c0
0xc024b5f7 <stext_lock+1831>: repz nop
0xc024b5f9 <stext_lock+1833>: jle 0xc024b5f0 <stext_lock+1824>
0xc024b5fb <stext_lock+1835>: jmp 0xc0119164 <schedule+296>

which includes the EIP that we were found looping at.

More than that, you can then look at the spinlock (this only works for static spinlocks, but 99% of all spinlocks are of that kind):

(gdb) x/x 0xc02c46c0
0xc02c46c0 <runqueue_lock>: 0x00000001

which shows us that the spinlock in question was the runqueue_lock in thismade up example. So this told us that somebody got stuck in schedule() waiting for the runqueue lock, and we know which lock it is that has problems. We do NOT know how that lock came to be locked forever, but by this time we have much better information... It is often useful to look at where the other CPU seems to be spinning at this point, because that will often show what lock _that_ CPU is waiting for, and that in turn often gives the deadlock sequence at which point you go "DUH!" and fix it.

Now, this gets a bit more complex if you have semaphore trouble, because when a semaphore blocks forever you will just find the machine idle with processes blocked in "D" state, and it looks worse as a debugging issue because you have so little to go on. But semaphores can very easily be turned into "debugging semaphores" with this trivial change to __down() in arch/i386/kernel/semaphore.c:

- schedule();
+ if (!schedule_timeout(20*HZ)) BUG();

which is not actually 100% correct in the general case (having a semaphore that sleeps for more than 20 seconds is possible in theory, but in 99.9% of all cases it is indicative of a kernel bug and a deadlock on the semaphore).

Now you'll get a nice Oops when the lockup happens (or rather, 20seconds after the lockup happened), with full stack-trace etc. Again, this way you can see exactly which semaphore and where it was that it blocked on.

(Btw - careful here. You want to make sure you only check the first oops. Quite often you can get secondary oopses due to killing a process in the middle of a critical region, so it's usually the first oops that tells you the most. But sometimes the secondary oopses can give you more deadlock information - like who was the other process involved in the deadlock if it wasn't simply a recursive one)..

Thus endeth this lesson on debugging deadlocks. I've done it often enough..

PS. If the deadlock occurs with interrupts disabled, you won't get the EIP with the "alt+scroll-lock" method, so they used to be quite horrible to debug. These days those are the trivial cases, because the automatic irq deadlock detector will kick in and give you a nice oops when they happen without you having to do anything extra.

4. 'modutils' Cleanup

23�Apr�2000�-�28�Apr�2000 (8 posts) Archive Link: "RE: Announce: modutils 2.3.11 is available - the debugger's helper"

Topics: Backward Compatibility

People: Keith Owens,�Jeff Garzik,�Jamie Lokier

Continuing the discussion from Issue�#65, Section�#15� (21�Apr�2000:�Organization Of Kernel Modules) , Jamie Lokier suggested naming the '/lib/modules/index.html' directory tree after the kernel source's 'driver' and 'net' trees, for consistancy. Keith Owens replied:

That would work. Define /lib/modules/.../kernel with the same internal structure as the kernel. modutils scans all directories under /lib/modules/..., kernel first, then any defined path entries then any directrory not already scanned.

It preserves the existing scan order, kernel before third party modules. Putting everything under kernel lets "make modules_install" erase all old kernel modules before installing a new set of kernel modules, leaving third party modules alone. It keeps modules separate for those people who dislike a single flat directory. And most importantly it removes the version skew between the list of module directories in the kernel and modutils.

Unless anybody has a violent objection to this, I will add the .../kernel directory to modutils 2.3.12 then do a kernel patch to "make modules_install" under .../kernel and remove old modules. It will all be backwards compatible.

Jeff Garzik concluded, "Sounds ok; this sort of scanning definitely needs to occur, to prevent the current problem of having an unknown directory get skipped in the scan. If it exists under /lib/modules/..., modprobe should be able to find it :)"

5. To Do Before 2.4: Saga Continues

24�Apr�2000�-�30�Apr�2000 (68 posts) Archive Link: "Linux Jobs as of 2.3.99pre6-5"

Topics: Compression, Disk Arrays: RAID, Disks: IDE, Disks: SCSI, FS: Coda, FS: FAT, FS: NFS, FS: NTFS, FS: UMSDOS, I2O, Modems, Networking, PCI, Power Management: ACPI, SMP, Samba, Security, USB, Virtual Memory, VisWS

People: Alan Cox,�Andrea Arcangeli,�David Ford,�Linus Torvalds,�Randy Dunlap,�Tim Waugh

Alan Cox's list of things to do before 2.4 could come out, was first covered in Issue�#52, Section�#1� (4�Jan�2000:�ToDo Before 2.4) , then again in Issue�#54, Section�#1� (28�Jan�2000:�ToDo Before 2.4: Saga Continues) , Issue�#56, Section�#3� (10�Feb�2000:�To Do For 2.4: Saga Continues) , Issue�#60, Section�#3� (10�Mar�2000:�Alan's Task List For 2.4: Saga Continues) , Issue�#62, Section�#8� (28�Mar�2000:�Things To Do Before 2.4: Saga Continues) , Issue�#63, Section�#6� (3�Apr�2000:�2.4 Jobs List: Saga Continues) , and most recently in Issue�#64, Section�#10� (16�Apr�2000:�Things To Do Before 2.4: Saga Continues) . This week Alan posted:

  1. Fixed
    1. Tulip hang on rmmod (fixed in .51 ?)
    2. Incredibly slow loopback tcp bug (believed fixed about 2.3.48)
    3. COMX series WAN now merged
    4. VM needs rebalancing or we have a bad leak
    5. SHM works chroot
    6. SHM back compatibility
    7. Intel i960 problems with I2O
    8. Symbol clashes and other mess from _three_ copies of zlib!
    9. PCI buffer overruns
    10. Shared memory changes change the API breaking applications (eg gimp)
    11. Finish softnet driver port over and cleanups
    12. via rhine oopses under load ?
    13. SCSI generic driver crashes controllers (need to pass PCI_DIR_UNKNOWN..)
    14. UMSDOS fixups resync (not quite done)
    15. Make NTFS sort of work
    16. Any user can crash FAT fs code with ftruncate
    17. AFFS fixups
    18. Directory race fix for UFS
    19. Security holes in execve()
    20. Lan Media WAN update for 2.3
    21. Get the Emu10K merged

  2. In Progress
    1. Merge the network fixes (DaveM)
    2. Merge 2.2.15 changes (Alan)
    3. Get RAID 0.90 in (Ingo)
    4. Finish I2O merge
    5. Still some SHM bug reports

  3. Fix Exists In -AC Tree
    1. Signals leak kernel memory (security)
    2. S/390 Merge
    3. 1.07 AMI MegaRAID
    4. Fix Space.c duplicate string/write to constants
    5. Merge the RIO driver (probably do post 2.4.0 as it is large)

  4. Fix Exists But Isnt Merged
    1. msync fails on NFS
    2. Semaphore races
    3. Semaphore memory leak
    4. Exploitable leak in file locking
    5. Mark SGI VisWS obsolete
    6. 64bit lockf support
    7. TTY and N_HDLC layer called poll_wait twice per fd and corrupt memory
    8. ATM layer calls poll_wait twice per fd and corrupts memory
    9. Random calls poll_wait twice per fd and corrupts memory
    10. PCI sound calls poll_wait twice per fd and corrupts memory
    11. sbus audio calls poll_wait twice per fd and corrupts memory
    12. Support MP table above 1Gig
    13. Finish sorting out VM balancing (Rik Van Riel)

  5. To Do
    1. Restore O_SYNC functionality
    2. Fix eth= command line
    3. Trace numerous random crashes in the inode cache
    4. VM kswapd has some serious problems
    5. vmalloc(GFP_DMA) is needed for DMA drivers
    6. put_user is broken for i386 machines (security)
    7. Fix module remove race bug (mostly done - Al Viro)
    8. Test other file systems on write
    9. access_process_mm oops/lockup if task->mm changes
    10. Audit all char and block drivers to ensure they are safe with the 2.3 locking - a lot of them are not especially on the open() path.
    11. Stick lock_kernel() calls around driver with issues to hard to fix nicely for 2.4 itself
    12. IDE fails on some VIA boards (eg the i-opener)
    13. PCMCIA/Cardbus hangs, IRQ problems, Keyboard/mouse problem (may be fixed ?)
    14. Use PCI DMA by default in IDE is unsafe (must not do so on via VPx x<3)
    15. Use PCI DMA 'lost interrupt' problem with some hw [which ?]
    16. Crashes on boot on some Compaqs ?
    17. pci_set_master forces a 64 latency on low latency setting devices.Some boards require all cards have latency <= 32
    18. usbfs hangs on mount sometimes
    19. Loopback fs hangs
    20. Problems with ip autoconfig according to Zaitcev
    21. SMP affinity code creates multiple dirs with the same name
    22. TLB flush should use highest priority
    23. Set SMP affinity mask to actual cpu online mask (needed for some boards)
    24. pci_socket crash on unload
    25. truncate_inode_pages does unsafe page cache operations
    26. heavy swapping corrupts ptes
    27. Linux sends a 1K buffer with SCSI inquiries. The ANSI-SCSI limit is 255.
    28. Linux uses TEST_UNIT_READY to chck for device presence on a PUN/LUN. The INQUIRY is the only valid test allowed by the spec.

  6. To Do But Non Showstopper
    1. Make syncppp use new ppp code
    2. Finish 64bit vfs merges (lockf64 and friends missing)
    3. NCR5380 isnt smp safe
    4. DMFE is not SMP safe
    5. ACPI hangs on boot for some systems
    6. Go through as 2.4pre kicks in and figure what we should mark obsolete for the final 2.4
    7. Per Process rtsigio limit
    8. Fix SPX socket code
    9. Boot hangs on a range of Dell docking stations (Latitude)
    10. HFS is still broken
    11. iget abuse in knfsd
    12. Paride seems to need fixes for the block changes yet
    13. Some people report 2.3.x serial problems
    14. AIC7xxx doesnt work non PCI ?
    15. USB hangs on APM suspend on some machines
    16. PCMCIA crashes on unloading pci_socket
    17. DEFXX driver appears broken
    18. ISAPnP IRQ handling failing on SB1000 + resource handling bug
    19. TB Multisound driver hasnt been updated for new isa I/O totally.

  7. Compatibility Errors
  8. Probably Post 2.4
    1. per super block write_super needs an async flag
    2. addres_space needs a VM pressure/flush callback
    3. per file_op rw_kiovec
    4. enhanced disk statistics

  9. Drivers In 2.2 not 2.4
  10. To Check
    1. Truncate races (Debian apt shows it nicely) [done ? - all but Coda]
    2. Elevator and block handling queue change errors are all sorted
    3. Check O_APPEND atomicity bug fixing is complete
    4. Make sure all drivers return 1 from their __setup functions (Done ?)
    5. Protection on isize (sct) [Al Viro mostly done]
    6. Mikulas claims we need to fix the getblk/mark_buffer_uptodate thing for 2.3.x as well
    7. Network block device seems broken by block device changes
    8. Fbcon races
    9. Fix all remaining PCI code to use new resources and enable_Device
    10. VFS?VM - mmap/write deadlock (demo code seems to show lock is there)
    11. rw sempahores on page faults (mmap_sem)
    12. kiobuf seperate lock functions/bounce/page_address fixes
    13. Fix routing by fwmark
    14. Some FB drivers check the A000 area and find it busy then bomb out
    15. rw semaphores on inodes to fix read/truncate races ? [Probably fixed]
    16. Not all device drivers are safe now the write inode lock isnt taken on write
    17. File locking needs checking for races
    18. Multiwrite IDE breaks on a disk error [minor issue at best]
    19. ACPI/APM suspend issue
    20. NFS bugs are fixed
    21. BusLogic crashes when you cat /proc/scsi/BusLogic/0
    22. Floppy last block cache flush error
    23. NFS causes dup kmem_create on reload
    24. Chase reports of SMB not working

For item 5.18 (usbfs hangs on mount sometimes), Randy Dunlap asked for more details, and Alan replied that this had been reported for 2.3.51, so was probably out of date. He asked anyone still experiencing the problem to tell him, so he wouldn't delete it from the list.

For item 6.12 (Paride seems to need fixes for the block changes yet), Tim Waugh said he thought this was fixed, though he didn't have the hardware to verify it. Andrea Arcangeli replied, "I don't have hardware either but people with hardware tested the fix and I merged the fix with Linus a few weeks ago. There are been no futher changes since that time so, yes, it should be fixed now."

David Ford commented on a number of items in the list. For item 1.1 (Tulip hang on rmmod (fixed in .51 ?)), which was supposed to be fixed, he reported that it was definitely not fixed, and "modprobe/rmmod w/ a sleep inbetween causes an oops on the second cycle." He also confirmed that item 5.19 (Loopback fs hangs) was still a problem. For item 6.16 (PCMCIA crashes on unloading pci_socket) he amended the description, saying, "crash is not just pcmcia system, it kills the whole machine with oopses until it's dead dead dead." Alan replied, "Yeah it seems to vary by box. Its a known bug and a known explanation it just lacks a known fix 8)"

Finally, he confirmed that item 5.13 (PCMCIA/Cardbus hangs, IRQ problems, Keyboard/mouse problem (may be fixed ?)) was still a problem. He reported, "sadly not fixed. hangs, oopses, irq, keyboard, mouse, all still faulty in pre6-3." Alan asked if it was okay in pre6-5, and David did some testing. After a few posts and replies to himself, David reported:

In summary: things are actually worse. Every pcmcia card I have causes something bad even a generic modem, a Hitachi cellular capable 28.8/14.4 xjack modem. (came as an unknown bonus in my ambicom box...gotta love Fry's, sometimes you get more than you paid for).

Linus Torvalds asked, "Can you send the bootup messages that are relevant to the PCMCIA stuff? In particular, the interrupt routing info is interesting, and you might want to turn on debugging in arch/i386/kernel/pci-i386.h to get more of those messages..." but if there was any reply, it was made in private.

6. Scheduler Problems And Patches

25�Apr�2000�-�29�Apr�2000 (11 posts) Archive Link: "SCHED_RR is broken + patch"

Topics: Real-Time, SMP

People: Christian Ehrhardt,�Borislav Deianov,�Dimitris Michailidis,�Artur Skawina,�George Anzinger

Christian Ehrhardt posted some test code, and reported, "I think I found a Bug in the linux scheduler: A running SCHED_RR process is not preempted when its time slice is exhausted but the process is still running. The following program demonstrates the problem: In theory the child should be preempted when the time slice is exhausted and the parent should be allowed to run. Unfortunately this doesn't work." Borislav Deianov added ominously:

Yes, this and other problems with the scheduler have been known for quite a while and are still present in 2.3.99:

  1. SCHED_RR threads don't work
  2. sched_yield doesn't work for RT threads
  3. sched_yield doesn't always yield for SCHED_OTHER threads
  4. wrong wake up order for RT threads
  5. sched_rr_get_interval returns constant bogus value
  6. counter for SCHED_FIFO threads is never reset
  7. several wrong and misleading comments in the code

Alan, _please_ add the first item (at least) to the 2.4 jobs list. It's a documented feature that just plain doesn't work and I think it's a bloody shame, especially since a fix exists (by Artur Skawina).

There were two brief subthreads, consisting of several threadlets, coming off of this post (for the purposes of KT, a subthread is a chain of individual posts, while a threadlet is a single topic discussed in a thread or subthread. So there may be many threadlets in a single subthread). Dimitris Michailidis felt that item 3 of Borislav's list was not actually a problem, because "Processes should yield only to higher or equal priority processes. A SCHED_OTHER thread executing sched_yield doesn't have to yield just because there may be other SCHED_OTHER processes available." Artur Skawina replied, "depends on the priority model chosen. (ie if all SCHED_OTHER threads run with equal static priority then having a sched_yield() that defers to any SCHED_OTHER thread eligible to run makes sense)" Dimitris agreed with this, but the threadlet dropped off there.

In the same subthread, Dimitris replied to item 6 of Borislav's list, "It should not expire in the first place. SCHED_FIFO processes do not have slices." Borislav agreed, but Artur pointed out, "it's not about expiring; it's about entering the scheduler on every clock tick, because of the way the counter is handled." Dimitris in turn explained, "And that, in turn, is because the counter for SCHED_FIFO processes is allowed to reach 0 and then it stays there causing a trip to the scheduler at every tick. Ideally, counter should be ingored for SCHED_FIFO, but that would require special case code in the timer interrupt. In my scheduler patch I set SCHED_FIFO counters to LONG_MAX, which effectively gives them an infinite slice." Artur had the last word of the threadlet, with:

last time i looked at the timer interrupt overhead it was so big that the extra check+branch wouldn't make any significant difference. the extra scheduler invocations did show up though.

In general, for real time the only thing that counts is worst case; loosing a bit of performance isn't a problem, having some latency peaks might be. Esp. ones that occur very rarely; ie won't happen during testing, but show up several weeks later... No, it probably doesn't matter in this case, but still, i'm not sure postponing the bug by a few weeks is an improvment :)

In the other subthread, George Anzinger posted a patch and replied to Borislav's list:

Have a look at this patch for the 2.2.14 sched.c. It fixes the first 4 of the above items and also fixes the "misses 'need_resched' if set after selection of the new task but before the switch". For this latter I also test the flag just before exit and go round again if set. On my system this makes a real difference, probably due to missing the need_resched flag in idle or in the interrupt return path.

This patch reduces the loop for UP systems but adds a test for SMP systems. Also, the SCHED_YIELD survives a counter reset.

Note prev_goodness() is no longer used, but not patched out.

For the SCHED_FIFO counter not reset, I notice later versions of the scheduler are not counting down the counter if it is negative to avoid counting down the idle tasks. Given this, I suggest that sched_setscheduler just set the counter negative for SCHED_FIFO tasks.

Later, he reported, "My patch does have an interesting flaw, however. If the only non-idle task in the run list is the one yielding, the patch will go into a loop resetting the counters until some other task enters the list (which can happen as the interrupts are turned on for part of the loop). Not what one wants from yield. The fix, of course, is to detect the second counter reset and just making next=prev at that point." There was a bit more discussion, and a new patch from Christian Ehrhardt addressing other problems.

7. C++ And The Kernel

27�Apr�2000�-�1�May�2000 (39 posts) Archive Link: "[PATCH] C++ breaks on linux/ioport.h"

Topics: Assembly

People: Paul Barton-Davis,�Alexander Viro,�Jamie Lokier,�Jes Sorensen,�Manfred Spraul,�Victor Khimenko,�Alan Cox,�Richard B. Johnson,�Andries Brouwer,�Bill Huey,�Dominik Kubla

This debate was first covered in Issue�#1, Section�#4� (8�Jan�1999:�C Or C++ For Kernel Code?) ; then again in Issue�#6, Section�#4� (13�Feb�1999:�modutil Ownership Dispute; C Vs. C++) ; and most recently in Issue�#37, Section�#2� (17�Sep�1999:�Including Kernel Headers In C++ Programs) . This week, Stepan Kasal wanted to include kernel headers in his C++ program, but found that files like 'ioport.h' used the C++ reserved keyword 'new' as a variable name, which caused errors. He posted a patch to change the occurrences of that variable name, but Jes Sorensen replied that kernel headers simply should not be included in C++ programs (more specifically, he added later, one should not write kernel modules in C++). Over the course of discussion, Andries Brouwer added that really, no user-space program should directly include kernel headers, ever. Instead they should simply copy the code they need from the headers into their own source. That way user binaries would continue to run regardless of any future kernel changes. At some point, Paul Barton-Davis argued:

The issue is using the keyword "new" for a declaration's argument. It would be nice if this was avoided, just like "delete", "catch", "throw" and the rest of the C++ keyword superset.

If the relevant declaration is inside __KERNEL__ and so are some constants that user-space needs, the programmer is out of luck if these keywords are used in the declarations.

Alexander Viro replied:

everything under __KERNEL__ belongs to the kernel and is subject to change without notice. If you really need something from that area - you've either

  1. found a bug or
  2. missed a portable way to achieve your goal.

Userland should never, ever define __KERNEL__, unless it's consent with breakage upon the upgrade from 2.2.18-pre9 to 2.2.18-pre10-pre1.

Elsewhere, Dominik Kubla felt that including C headers in C++ programs should be feasible, with something like:

extern "C" {
#include <linux/header1.h>
#include <linux/header2.h>

Manfred Spraul pointed out that this wouldn't help if the headers contained C++ reserved words such as 'new' and 'virtual'; but Jamie Lokier amended Dominik's code to be:

extern "C" {
#define new c_new
#define virtual c_virtual
#include <linux/header1.h>
#include <linux/header2.h>
#undef new
#undef virtual

Manfred pointed out that this would only work if the user's code didn't dereference variables with those names, and Jamie said, "Just leave out the #undefs. You're not actually using new & virtual in your code, are you? :-)"

Earlier, Manfred had argued that kernel code should not use C++ reserved words. Jes argued in reply, "Making it easy to write C++ driver modules means somebody will do it. Once somebody has done it, somebody else will try to use some of the brain dread C++ features such as exceptions - therefore totally preventing people from writing any bits of the kernel in C++ is a good thing." Manfred replied sardonically, "What about stopping to distribute the Linux source code? Somebody will write broken drivers, therefore totally preventing people from writing any bits of of the kernel is a good thing." Jes replied that opening the kernel up for C++ would introduce terrible problems for maintainability and usability. Bill Huey asserted that these would not be more significant problems than the assembler linkage layer in 'egcs'; but Victor Khimenko disagreed, with "Hmm. This is not true. Lots of nice features of C++ need support from run-time systems and using lots of space on stack. NOT what you want in kernel where even gcc run-time was stripped out (kernel is NOT linked with libgcc and it's done deliberately), where you have only 4-5KiB on stack to play with and where heap is bonus (kmalloc, vmalloc and other things exist but should not be used if not REALLY necessary). I'm not sure if C++ is needed even in userspace but in kernel space it's not appropriate." Alan Cox added, "C++ is just a syntax wrapper over C with some bad exeception handling included. It doesn't even do a good job on object orientation compared to stuff like smalltalk. People do object oriented programming in cobol. Its a programmer feature not a language feature." [...] "You also need to provide an interrupt safe, thread safe and smp safe lockless exception handling mechanism for the C++ exceptions. You can't really avoid that as you will want to do memory allocation and a memory allocation can and must fail gracefully." Finally Richard B. Johnson concluded the thread, with:

It's interesting to observe the advocates of a specific computer language. Most often a discussion about a particular compiler of similar tool results from the failure of persons to understand the basic nature of any computer language.

Computers don't communicate very well, even with other computers. When humans try to communicate with them, they have to use certain tools. These tools may range from devices such as keyboards to software tools such as compilers and editors.

Since computers don't understand the human languages very well, although there is continual work in this area, humans have to learn computers' languages. Since the computers' internal languages do not interface well with human languages, we have created tools to translate. There are called assemblers, compilers, and interpreters.

Every one of these tools, and probably those to be created in the future, pose major communications and control limitations. It becomes necessary, for programmers using these tools, to provide work-arounds for the limitations of these tools.

An expert in a particular computer language is really an expert in the work-arounds necessary to use this language to perform useful work. An ideal computer language would do exactly what it was told simply from reading a specification. In the absence of a specification, it would ask enough questions to produce such a specification, then it would generate the code necessary to perform the specified functions.

So a programmer becomes a captive of the tools used to communicate with the computer. With experience, the programmer starts to identify with its captors and starts to believe that the language mastered, is in fact, the only true language. Once captured, the programmer becomes an advocate. Psychology teaches the name of this effect as the "Stockholm Syndrome". It was first recognized during the detention of World-Games competitors in Stockholm, Sweden.

You see this problem mostly with programmers who have mastered only one language. If you have been around computers since 4-bit nibbles on paper-tape, you have long ago abandoned the notion that there is only one true language. But the first language you truly mastered still seems to have been the best. The Stockholm Syndrome affects us all to some extent.

I advise to not get trapped into the notion of the "correct" tool for a particular use. Just because you have become expert in C++, don't presume that it is the "correct" language for the kernel.

Even C has its shortcomings which have to be handled with assembly language extensions. A Master Carpenter has many tools and is expert with most of them. If you only know how to use a hammer, every problem begins to look like a nail. Stay away from that trap. It bytes (sic).

8. 2.0.x Development Continues

30�Apr�2000 (1 post) Archive Link: "[Announcement] pre-patch-2.0.39-4"

Topics: CREDITS File, MAINTAINERS File, Networking

People: David Weinehall,�Matthew Grant,�Jan Kara,�Andrea Arcangeli,�Michal Jaegermann,�Jean Tourrilhes,�Andries Brouwer,�Andre Hedrick

David Weinehall announced:

Sorry to all of you for taking so long for getting this one out. This one addresses some more things in v2.0.38 that I considered relatively safe to change. There are several persons that have reported the ability to resource-starve or crash the kernel by various methods. However, most of these things would demand quite significant changes to the memory-management to be addressed, and I will not carry out any such. There is a reason that there are a v2.2 kernel series and soon also a v2.4 kernel series. I hope you all can understand this point of view.

Most of v2.0.39 seems to be ready for release. Now is a perfectly good time to yell "Snarfel!" if you consider this opinion of mine to be totally beyond the edges of sanity.

What I'm most curious about to know is whether the Large-disk fixes works properly for everyone (of course, any problems with the other fixes should be reported too...) The only thing I am considering to adding now is Andre Hedrick's ide-patch.





There was no reply.

Sharon And Joy

Kernel Traffic is grateful to be developed on a computer donated by Professor Greg Benson and Professor Allan Cruse in the Department of Computer Science at the University of San Francisco. This is the same department that invented FlashMob Computing. Kernel Traffic is hosted by the generous folks at All pages on this site are copyright their original authors, and distributed under the terms of the GNU General Public License version 2.0.