Kernel Traffic #62 For 10 Apr 2000

By Zack Brown

Table Of Contents

Introduction

Seth M LaForge felt that Issue #61, Section #11  (21 Mar 2000: Video CD Under Linux) was misleading. As he put it, "in the discussion on VCD you include a list of links from Quang Nguy. However, the links all point to DVD info - DVD is a very different beast from VCD. VCDs are regular CD-ROMs with MPEG-1 encoded movies on them. They're very popular in Asia (usually illegal copies of movies, available for about $2 each on the street), but never caught on (I'm sure pressure from the movie industry has a lot to do with this) in the West. You might want to either point out that DVD and VCD are different or remove the list of links, to avoid confusing readers." That's a good point, Seth. Thanks!

Many thanks go to Antony West, for catching an old bug in the XML of Issue #32 (kt19990830_32.html) , which caused the mailing list stats to render incorrectly; and for reminding me about some runaway <pre> tags in various other issues. Thanks, Antony! Those back-issue bugs can languish for years.

Mailing List Stats For This Week

We looked at 1680 posts in 7272K.

There were 482 different contributors. 236 posted more than once. 184 posted last week too.

The top posters of the week were:

 

1. Driver Return Values
21 Mar 2000 - 27 Mar 2000 (4 posts) Archive Link: "__setup return value"
People: Tim WaughAlan CoxRussell King

Tim Waugh asked, "When is a driver supposed to return 0 from a __setup function? When it can't parse the options? Or when there's a possibility that the option is intended for another driver?" When no one replied for almost a week, he asked, "Is it worth me making a patch to change the behaviour of those drivers that return 0 on error (rather that when the option could be used by another driver) to return 1 instead?" Alan Cox replied, "I've done some of them but not all. So yes." And seven hours later, Russell King said that Alan had sent him the ARM-specific parts of a patch Tim had presumably sent to Alan. Russell affirmed that it would be going into the Linus tree, and thanked Tim heartily.

 

2. Real Data Corruption Under ext2 In The Stable And Unstable Kernels, And A Fix
21 Mar 2000 - 27 Mar 2000 (4 posts) Archive Link: "ext2fs bug : files are disapeared, unable to delete, two files' contents are switched etc."
Topics: FS: ext2
People: Theodore Y. Ts'o

Soohoon Lee reported serious data corruption with ext2, in cases of low memory and frequent file creation and deletion operations. He explained that under those conditions, files would spontaneously disappear or be undeletable, the contents of two files would spontaneously switch, and fsck would find inconsistancies in properly unmounted filesystems. He posted a one-line patch to fs/ext2/namei.c to fix it; and Theodore Y. Ts'o confirmed that this was indeed a problem, and not just some local glitch. In response to Soohoon's patch, Ted replied, "This apparently solves the problem for Linux 2.2, but I'd prefer a cleaner patch for Linux 2.3. Enclosed find the patch, which I will be sending on to Linus. I haven't had a chance to backport this patch to Linux 2.2 yet, but it shoudl be relatively simple." He posted his patch, which was much longer than Soohoon's, and there was some talk about other changes to make to fs/ext2/namei.c, if they were going to be changing it at all.

 

3. GCC 2.95.2 Bug; Workaround; And Fix
24 Mar 2000 - 28 Mar 2000 (9 posts) Archive Link: "[2.3.99-pre3] via-rhine.o died again!"
Topics: Assembly
People: Urban WidmarkJean-Luc Pedneault

After a bit of private discussion between Jean-Luc Pedneault, Justin Guyett, Urban Widmark, and other unknown assailants; Urban posted a patch for a problem Jean-Luc had been having with via-rhine under 2.3.99-pre3. But he acknowledged being mystified over "why a value that was just written is no longer there, except if you add an if or printk then the value written "stays" written." Jean-Luc replied, "Like you said, it seems that by doing the "if" instruction, the old value doesn't get wiped. The inside loop doesn't ever get run!! It's weird, because it may suggest that the system runs too fast or something.. like if the value wasn't written in memory." He added, "It could be a bug in GCC 2.95.2, a bug that does an optimization wrongly. I'm using this compiler. I haven't tested with egcs-1.1.2 yet (and I don't have GCC 2.7.2.3 compiled at all)."

A couple of hours later, he went on to say, "I'd like to point out that egcs-1.1.2 executed the code fine even with #if 0 (ie. without compiling the additional code). GCC 2.95.2's optimizations breaks the code, at least that's what I think." Urban agreed, but rather than attempt to understand the inscrutable assembly code generated, he posted another patch which tried to do the same thing in a different way. He also suggested generating a smaller, non-kernel test-case to share with the GCC developers, and offered to do it himself if no one else stepped forward. Stephane Casset replied with confirmation that Urban's patch seemed to work.

In the end, Urban concluded the thread by pointing out that the latest GCC snapshot appeared not to have the problem anymore, and he also gave a link to Code Sourcery (http://www.codesourcery.com/gcc-compile.html) , a really cool page that would compile any code on the latest snapshot.

 

4. POSIX Threads; Philosophy Of Kernel Development
24 Mar 2000 - 30 Mar 2000 (97 posts) Archive Link: "Slow pthread_create() under high load"
Topics: POSIX, SMP
People: Ulrich DrepperAlvin StarrRichard GoochLinus TorvaldsAlan Cox

An interesting threadlet came out of this interesting discussion about Linux implementation of POSIX threading. In the course of discussion, Ulrich Drepper led into it by saying, "some additional functionality required to implement the correct POSIX threads behaviour is missing." Alvin Starr let himself in for it when he replied, "I am sure that if the next version of the thread library required a set of kernel patches to run effectivly then those patches would end up in the kernel source tree within a version or so." Richard Gooch replied, "Mate, where have you been? The day Linus lets user-space dictate what goes into the kernel is the day hell freezes over. If you want a patch to go into the kernel, you need to convince him it's a good idea. Adding a dependency in user-space, expecting it to "force his hand", will not help. It will probably just piss him off. Or make him laugh."

A bit later, Linus Torvalds made a case about PID sharing and gave some example code. He went on to say, about threading:

Note that the reason the kernel is not POSIX-compliant is:

  • the POSIX standard is technically stupid. It's much better to use a cleaner fundamental threading model and build on top of that.
  • things like the above are just so much better and more easily done in user space anyway.

The reason LinuxThreads has a hard time becoming POSIX-compliant is that I refuse to apply stupid patches, and a lot of the patches sent to me have been frankly stupid. They've often implemented pthreads functionality without any actual thought of how it _could_ be done more cleanly with a user/kernel split.

A post or so down the line, Linus added

Note that when I started doing clone(), I basically said: "this is how I think threads should be done". I added a few example flags to show the concept, without really having a firm plan on what the final situation would be. Some of those flags got expanded upon (CLONE_PARENT is only the latest addition), while some ended up not being very useful at all (CLONE_PID is basically useless - the only use for it is to start up the original idle threads under SMP, and that code is so specialized anyway that it could basically do the CLONE_PID logic by hand).

There are bound to be more issues. I've seen patches floating around that expand it, and especially in signal handling SOMETHING has to be done. I don't think the "share all signal queues" is the right answer: I suspect the right answer to the signal handling issue is to have a "private" queue (the regular one) along with a separate method of handling "shared" queues and a way to attach to a shared signal queue.

Shared signals are potentially useful outside pure threading models too, and I'm looking for something more generic. I suspect that what I'm looking for is more like a message list, along with some thin compatibility code to make it easy for pthreads emulation that looks like signals..

That's kind of my gripe in general - I think there is a bigger picture than just plain pthreads. Like clone(), let's do this right.

Some earlier discussion of clone() took place in Issue #1, Section #3  (8 Jan 1999: Porting The vfork() Syscall) ; a LinuxThreads announcement appeared in Issue #30, Section #10  (30 Jul 1999: CLONE_PPID Support In LinuxThreads) ; a little bit of clone() history appeared in Issue #32, Section #14  (17 Aug 1999: Threads In Linux) ; some general discussion of threading took place in Issue #34, Section #22  (1 Sep 1999: Some Explanation Of Threading) ; then in Issue #35, Section #13  (7 Sep 1999: CLONE_PID Problems) , Alan Cox said CLONE_PID would be going away in 2.3.x, but according to Linus' statements above, this has not happened yet. A vfork() flamewar (with some good technical goodies) took place in Issue #45, Section #1  (1 Nov 1999: vfork() Discussion And Flame Fest) .

Discussions of development philosophy occur throughout KT, but some specific articles are Issue #4, Section #1  (27 Jan 1999: Philosophy Of The Stable Series) , Issue #5, Section #10  (3 Feb 1999: Philosophy Of Binary-Only Modules) , Issue #9, Section #12  (4 Mar 1999: Philosophy Of Kernel Development) , Issue #18, Section #5  (2 May 1999: Philosophy Of Open Source; Maintainer Conflict) , and Issue #60, Section #8  (15 Mar 2000: Philosophy Of Having Debugging Code In The Kernel) .

 

5. Keyboard Repeat Rate
24 Mar 2000 - 1 Apr 2000 (17 posts) Archive Link: "Keyboard rate question.."
People: Andrew MortonRussell KingMike A. Harris

Mike A. Harris had a problem with his keyboard repeat rate slowing down when switching a shared keyboard between two machines. In the course of discussion, it came out that if there was any loss of power to the keyboard, this would be the inevitable result. However, at some point Andrew Morton mentioned:

Russell King says he has a patch which does autorepeat in s/w. This is most definitely the best way. I did this in an OS many years ago - took one look at the XT keyboard specs and said "nope".

It's very easy to do. Just a little state machine which squirts out the most recent 'make' code and stops doing that when it sees a 'break'. It also gives you infinite control over the autorepeat speed, although topping out at HZ seems reasonable.

Russell replied, "I'll be sorting it out later today when I do my next set of patches for Linus et al."

 

6. kswapd Speedups
26 Mar 2000 - 27 Mar 2000 (11 posts) Archive Link: "[PATCH] Re: kswapd"
Topics: Virtual Memory
People: Rik van RielKanoj SarcarLinus TorvaldsChristoph RohlandMark Hahn

Rik van Riel posted a patch against 2.3.99pre3, to take some code out from the interior of a while loop in kswapd; and added, "I wonder who sent the brown-paper-bag patch with the superfluous while loop to Linus ..." Kanoj Sarcar replied red handed, "That would be me ..." He asked what Rik's patch actually fixed, aside from the while loop, which appeared cosmetic. Rik replied that ditching the loop was actually a serious fix by itself. Without his patch, he reported that kswapd used between 50% and 70% of the CPU in a particular workload. With his patch, it used between 3% and 5% of the CPU. Kanoj pointed out that the while loop had been in the kernel since way back in 2.3.43, and Linus Torvalds, replying elsewhere, said, "you're definitely right that this is not a new bug introduced by you, Kanoj - this seems to be just a thinko that has been there for a long long time. And I suspect I may have been the original perpetrator of the crime." Kanoj let out a facetious sigh of relief that he hadn't introduced the bug, and they discussed some of the ins and outs of Rik's patch. It turned out that, as Linus described the situation prior to Rik's patch:

What happens is that kswapd is only woken up when needed, so most of the time it is sleeping. It's only when it is woken up and when it has done its work when the loop turns into a CPU-burner, but it can easily mean that kswapd will just spend CPU time for no good reason until its time-slice is exhausted.

So think of the bug as "kswapd will waste the final part of its timeslice doing nothing useful".

Elsewhere, under the Subject: [RFT] balancing patch (http://kernelnotes.org/lnxlists/linux-kernel/lk_0003_04/msg01025.html) , Kanoj posted a separate patch to ease kswapd's CPU usage. Mark Hahn didn't notice any improvement, and Christoph Rohland noticed various Out Of Memory breakages when the system started to use swap. However, Rik reported:

I'm now testing Kanoj' balancing patch together with my kswapd infinite-loop-removal patch. The system seems to work quite well, I haven't seen any big strangeness in the VM load (the variance in the amount of free memory is a bit bigger, naturally, but that's to be expected) and interactive performance from the console seems unaffected.

It would be nice if a few more people tested the combination of 2.3.99-pre3 with Kanoj' balancing patch and my infinite-loop- removal patch ... (because YMMV)

 

7. Mount Code Cleanup
26 Mar 2000 - 29 Mar 2000 (4 posts) Archive Link: "[Announce][CFT] loopback mounts and stuff"
Topics: FS: ext2, FS: procfs
People: Alexander Viro

Alexander Viro called for testers, and announced:

Folks, there is a cleanup of mount-related stuff underway. Right now the patch seems to be usable for testing.

It doesn't include pieces in nfsd and auotfs*, so these simply wont compile, but it should <knocking the wood> work with everything else.

It allows

  1. to mount the same filesystem several times. Yup, ext2 included. No cache coherency problems - all instances share the dentry tree.
  2. to work with shmfs without mounting it.
  3. to do explicit loopback mount. As in

    # mount -t bind /usr/X11R6 /mnt
    # ls /mnt
    bin doc include lib man share
    # cd /mnt
    # ls ..
     
    bin dev home lost+found mnt proc sbin usr
    boot etc lib misc opt root tmp var

    (IOW, unlike the usage of symlinks it gives correct behaviour on ..)

  4. to mark filesystem type as 'single'. One superblock will be created when you initialize the driver, all later mounts of that type will be aliases to that one. IOW, we can start storing procfs immediately in dcache - just a normal tree. Moreover, kernel can access that tree even if it's not mounted by user. That's how the shmfs stuff is done. Oh, and all instances will share the device number, so you can create a thousand chroot jails, mount devpts on each and spend 1 (one) anonymous device. Ditto for procfs, etc.

Patch (against 2.3.99-pre3) lives on ftp://ftp.math.psu.edu/pub/viro/mount-patch-4.

Folks, give it a try. There may be bugs. I think that it's cleaner than the old code, but don't let it play with critical data. Bug reports are more than welcome, indeed. Another thing that is _very_ welcome is a discussion of the export rules in situation when filesystems on server may be mounted several times (as well as the current implementation - I'ld really like to hear comments on it, in particular on the exp_parent(). It's either buggy and doesn't what it is supposed to do or contains seriously superfluous code). General comments on the patch are also welcome, indeed.

 

8. Things To Do Before 2.4: Saga Continues
28 Mar 2000 - 30 Mar 2000 (21 posts) Archive Link: "The 2.3.x Job List (Updated)"
Topics: Compression, Disk Arrays: RAID, Disks: IDE, Disks: SCSI, FS: Coda, FS: FAT, FS: NFS, FS: NTFS, FS: UMSDOS, I2O, Networking, Power Management: ACPI, SMP, Security, USB, Virtual Memory, VisWS
People: Alan CoxWakko Warner

Alan Cox posted his latest list of things to do before 2.4 could come out:

  1. Fixed
    1. Tulip hang on rmmod (fixed in .51 ?)
    2. Incredibly slow loopback tcp bug (believed fixed about 2.3.48)
    3. COMX series WAN now merged
    4. VM needs rebalancing or we have a bad leak
    5. SHM works chroot
    6. SHM back compatibility
    7. Intel i960 problems with I2O
  2. In Progress
    1. Merge the network fixes (DaveM)
    2. Merge 2.2.15 changes (Alan)
    3. Get RAID 0.90 in (Ingo)
  3. Fix Exists But Isnt Merged
    1. Signals leak kernel memory (security)
    2. msync fails on NFS
    3. Semaphore races
    4. Sempahore memory leak
    5. Exploitable leak in file locking
    6. Symbol clashes and other mess from _three_ copies of zlib!
    7. Shared memory changes change the API breaking applications (eg gimp)
    8. Merge the RIO driver (probably do post 2.4.0 as it is large)
    9. S/390 Merge (merged in AC tree)
    10. via rhine oopses under load ?
    11. 1.07 AMI MegaRAID
    12. PCI buffer overruns
    13. SCSI generic driver crashes controllers (need to pass PCI_DIR_UNKNOWN..)
    14. Finish softnet driver port over and cleanups
  4. To Do
    1. Restore O_SYNC functionality
    2. Fix eth= command line
    3. Trace numerous random crashes in the inode cache
    4. Fix Space.c duplicate string/write to constants
    5. VM kswapd has some serious problems
    6. vmalloc(GFP_DMA) is needed for DMA drivers
    7. put_user appears to be broken for i386 machines
    8. Fix module remove race bug (mostly done - Al Viro)
    9. Test other file systems on write
    10. Directory race fix for UFS
    11. Audit all char and block drivers to ensure they are safe with the 2.3 locking - a lot of them are not especially on the open() path.
    12. Stick lock_kernel() calls around driver with issues to hard to fix nicely for 2.4 itself
    13. PCMCIA/Cardbus hangs, IRQ problems, Keyboard/mouse problem (may be fixed ?)
    14. Use PCI DMA by default in IDE is unsafe (must not do so on via VPx x<3)
    15. Use PCI DMA 'lost interrupt' problem with some hw [which ?]
    16. Crashes on boot on some Compaqs ?
    17. pci_set_master forces a 64 latency on low latency setting devices.Some boards require all cards have latency <= 32
    18. usbfs hangs on mount sometimes
    19. Loopback fs hangs
    20. Problems with ip autoconfig according to Zaitcev
    21. Still some SHM bug reports
    22. Any user can crash FAT fs code with ftruncate
  5. To Do But Non Showstopper
    1. Make syncppp use new ppp code
    2. Finish 64bit vfs merges (lockf64 and friends missing)
    3. NCR5380 isnt smp safe
    4. DMFE is not SMP safe
    5. ACPI hangs on boot for some systems
    6. Get the Emu10K merged
    7. Finish I2O merge
    8. Go through as 2.4pre kicks in and figure what we should mark obsolete for the final 2.4
    9. Per Process rtsigio limit
    10. Fix SPX socket code
    11. Boot hangs on a range of Dell docking stations (Latitude)
    12. Port SGI VisWS to 2.3.x or mark obsolete
    13. HFS is still broken
    14. iget abuse in knfsd
    15. Mark NTFS as obsolete
    16. Paride seems to need fixes for the block changes yet
    17. PIII FXSAVE/FXRESTORE support
    18. Some people report 2.3.x serial problems
    19. AIC7xxx doesnt work non PCI ?
    20. USB hangs on APM suspend on some machines
    21. PCMCIA crashes on unloading pci_socket
    22. DEFXX driver appears broken
    23. ISAPnP IRQ handling failing on SB1000 + resource handling bug
  6. Compatibility Errors
  7. Probably Post 2.4
    1. per super block write_super needs an async flag
    2. addres_space needs a VM pressure/flush callback
    3. per file_op rw_kiovec
    4. enhanced disk statistics
    5. AFFS fixups
    6. UMSDOS fixups resync
  8. Drivers In 2.2 not 2.4
    1. Lan Media WAN
  9. To Check
    1. Truncate races (Debian apt shows it nicely) [done ? - all but Coda]
    2. Elevator and block handling queue change errors are all sorted
    3. Check O_APPEND atomicity bug fixing is complete
    4. Make sure all drivers return 1 from their __setup functions
    5. Protection on isize (sct) [Al Viro mostly done]
    6. Mikulas claims we need to fix the getblk/mark_buffer_uptodate thing for 2.3.x as well
    7. Network block device seems broken by block device changes
    8. Fbcon races
    9. Fix all remaining PCI code to use new resources and enable_Device
    10. VFS?VM - mmap/write deadlock
    11. rw sempahores on page faults (mmap_sem)
    12. kiobuf seperate lock functions/bounce/page_address fixes
    13. Fix routing by fwmark
    14. Some FB drivers check the A000 area and find it busy then bomb out
    15. rw semaphores on inodes to fix read/truncate races ? [Probably fixed]
    16. Not all device drivers are safe now the write inode lock isnt taken on write
    17. File locking needs checking for races
    18. Multiwrite IDE breaks on a disk error
    19. AFFS doesn't work on current page cache
    20. ACPI/APM suspend issue

Wakko Warner replied to item 4.13 (PCMCIA/Cardbus hangs, IRQ problems, Keyboard/mouse problem (may be fixed ?)), with, "Fixed for me. Since yenta doesn't probe irq12, it doesn't cause me any lockups."

There were some other scattered comments as well, but nothing conclusive.

 

9. Problems With kernel.org Mirrors
28 Mar 2000 - 29 Mar 2000 (6 posts) Archive Link: "Linux 2.3.99pre3-ac1"
Topics: Kernel Release Announcement
People: Alan CoxH. Peter AnvinJames H. Cloos Jr.James H. CloosArjan van de Ven

Alan Cox announced a patch against 2.3.99pre3, so people could keep up with him in their debugging expeditions, but Arjan van de Ven noticed that it wasn't on any of the kernel.org mirrors. Alan replied, "It does appear not to be mirroring right. I've put a copy on ftp.linux.org.uk:/pub/linux/alan/ (ftp://ftp.linux.org.uk/pub/linux/alan/) " . H. Peter Anvin offered, "Please report broken kernel.org mirrors **including IP address** to ftpadmin@kernel.org (mailto:ftpadmin@kernel.org) as soon as you can tell, please." James H. Cloos Jr. also explained:

The notify from ftpadmin to lka-change didn't go out until Tue, 28 Mar 2000 15:43:24 -0800 or about 70 minutes after Alan's note (quoted above), or about six hours after Alan's initial announcement....

I'm sure most of us rsync only once or twice a day from cron(8), plus whenever lka-change mail arrives, hense the delay.

 

10. New Scheduler Code; Locking Issues
28 Mar 2000 - 29 Mar 2000 (5 posts) Archive Link: "locking problems"
Topics: SMP
People: Rik van RielJun SunAndrew Morton

Rik van Riel posted a patch against 2.3.99, to implement a low overhead fair process scheduler. As he explained in the code comments, "It works by handing out CPU time like we do at the normal recalculation. The catch is that we move the list head (where the for_each_task() loop starts) to _after_ the first task where we ran out of quota. This means that if a user has too many runnable processes, his tasks will get extra CPU time here in turns." But he reported to the list:

Unfortunately it hangs on taking locks in the recalculation code :(

I'm somewhat amazed by why it hangs and interested in any explanations...

Jun Sun gave his tentative opinion, "Interrupt handlers sometimes call kernel functions that would require a lock on tasklist_lock. If that interrupt happens during the time you hold write lock on tasklist_lock, a deadlock would happen." He suggested using write_lock_irq() to fix it, and added, "BTW, I really think interrupt handlers acquiring the same locks which can be acquired by processes is a *BIG* problem in Linux." Andrew Morton asked why Jun though this, and Jun replied, "Linus told me so. I believe him. :-)" On a more technical note, he added, "I did sniff around the source code and spotted a couple of places where locks COULD be acquired by ISRs, but I never did a RUN-TIME check to catch this situation," and went on to say:

I believe the problem here is that Linux does not have a CLEAR notion and separation of task-context code and interrupt-context code.

Imagine if a kernel function needs to read task list, then it must acquire a read lock on tasklist_lock. However, the function might be called from both process and ISR, then we will have the ISR acquiring lock problem.

I don't know if this has been a problem to Linux in the past. I am relatively new to Linux kernel.

There was no reply to this, but Rik, replying to Jun's suggested write_lock_irq(), posted a new patch, and said:

Indeed, that was the problem. I was lucky to get a few good traces by the NMI oopser that identified this problem. Now things are fixed.

The new patch is attached, for adventurous users. I'm testing it on my SMP system now.

That was it.

 

11. Network Load Balancing
28 Mar 2000 - 30 Mar 2000 (12 posts) Archive Link: "iproute and 2.3 question"
Topics: Networking
People: Guus SliepenAndi KleenAlexey KuznetsovGeorge Bonser

George Bonser noticed in the ip-route cref document, a description of the 'equalize' modifier: "allow packet by packet randomization on multipath routes. Without this modifier route will be frozen to one selected nexthop, so that load splitting will occur only on per-flow base." The same document said that the kernel had to be patched to make use of this feature. George asked if this was still the case, or if 'equalize' had been integrated into the main kernel sources. Andi Kleen and Guus Sliepen (author of the patch) replied that it had not been integrated. Guus gave a pointer to the patch (ftp://sliepen.warande.net/pub/eql/) , and explained:

It works indeed by throwing out cache entries every time they have been used. This works for up to at least 20 Mbit/s of traffic, but not for 400 Mbit/s (both cases have been tried, the former works fine, the latter does not). You can try it if you like.

Route based load balancing does have certain advantages over bonding devices. But it's really hard to implement in a clean way in current kernels. I'd rather see the complete networking code being a module, and those who want to use different routing/firewalling/scheduling schemes can load (or even create their own) different modules. The current code is (to my eye) just a messy bunch of hooks and checks.

He added that he'd propose this for 2.5 when the time came. There was no reply to this, but Andi's reply to the original post, was that equalization on the routing cache layer was too slow or not fine-grained enough, for inclusion in the main sources. Someone asked for more explanation, and Andi replied:

Linux has a routing cache that caches routing table lookups. This is called the destination cache. A destination cache entry is tied to a specific destination, which means only a single neighbour on a multipath route. To use multipath routing for load balancing requires dropping the destination entry after every use, so that another neighbour in the multipath could be looked up (the destination cache knows nothing about multipaths, that is all encapsulated in the FIB or routing table)

Dropping them all the time does not work well and is slow. It is also not finegrained enough (because the decision occurs to early) to get an even load balancing

Multipath routing is only useful for failover when a device is down in Linux.

For load balancing you can use the existing eql, teql and bonding devices, which work at a lower layer and avoid these problems.

Alexey Kuznetsov was critical of this explanation, and said that multipath routing worked "perfectly when you need to split load on servers talking to enough large number of clients. Any http server is good example." He added that Andi's suggestion of the existing eql, teql and bonding devices, would introcude "even worse problem of strong tcp reordering. Actually, experiments show that load balancing works only in the situations, when congestion window is bounded by 3 packets. If it is not made artificially, it occurs automatically on each connection after some amount of excessive retransmissions. Total single TCP connection throughput is never better in this case. Actually, it hints to the thought that "true load blalancing" has to involve tracking connections and avoiding reordering TCP packets." There was no reply to this, but there was a bit of implementation discussion elsewhere, along the lines of Andi's explanations.

 

12. AFFS Support And Discussion
29 Mar 2000 - 3 Apr 2000 (37 posts) Archive Link: "AFFS progress."
Topics: FS: FAT, FS: ext2
People: Dave JonesAlexander ViroMatthias AndreeNicholai BenalalRask Ingemann Lambertsen

Dave Jones posted several patches to update AFFS, although he proclaimed loudly that this code was possibly dangerous and should not be used without extensive backups. At one point, he added, "I'm coding without tools to test this, so I need help from the people who are going to be using this.. I'd also appreciate feedback from other filesys/vfs guys about anything in this patch that just 'doesn't look right' I've gone from knowing nothing about fs/VFS to this diff in four days, and now my head hurts. I wouldn't be at all surprised if I've done _something_ wrong somewhere." To Dave's discussion of ongoing work and problems, Alexander Viro replied:

real problems with AFFS are different. I'll bring the pre-patch I've done back in September from backups tomorrow and then you'll get more detailed description, but right now I can recall the following:

  1. AFFS handles links horribly. It has pseudo-inodes for all links and they point to the real one. Unfortunately, that "real" inode _must_ belong to some directory. Which means that if you create a link to file and remove the original link you are in for pain. You can't just remove the original entry from its hash chain. So the bloody thing finds some other link, moves the name into original one, inserts the original into hash chain of the other and kills other. It means that unlink() in one directory may reshuffle another. And you've got _no_ protection by i_sem on another directory - any attempt to get it will lead to easy deadlocks. Consequence: _easy_ races.
  2. You have to account for situations when link() and unlink() race with each other. Again, not done in the current code.
  3. Links on directories easily kill VFS. Don't.
  4. Since some operations (e.g. rename) involve a _lot_ of hash chains walking and pointers switching - beware of the failure modes when you abort in the middle of modification. It may easily leave you with fucked up filesystem.

It's a lot of crap to fix and I gave up on that when I got more pressing things to do. I can pass you the patch along with notes. I can remove the swearwords - you will reinsert them as soon as you'll play with this beast yourself.

The bottom line: AFFS design is a festering pile of dung and attempts to make it look like UNIX filesystem only made it uglier. Judging by dejanews search, AmigaOS itself doesn't handle it well. Hell knows what had stopped them from replacing it with decent filesystem - with the thing outside of kernel it wasn't that hard to do... Damnit, FAT is not so braindead compared to that abortion.

Matthias Andree added his sentiments, "These problems you mention have been persisting in AmigaOS for years ever since links had been introduced with AmigaOS 2, PLUS, AmigaOS shell commands have had bugs so that only the GNU ported fileutils along with the FileSystem-based (as opposed to dos.library based) ixemul.library could get you rid of directory symlinks; this is partly because links had originally not been present in AmigaOS 1.x, partly because they fucked things up really bad." He went on:

There are some commercial (AFS/PFS - some versions of these are also fucked beyond repair) and at least to freeware (SFS, Berkeley FFS Amiga port) filesystems, I never checked them for completeness or stability, though. Amiga users seem to be quite satisfied with SFS which is claimed to be journalling.

The Dircache (DCFS) option of OFS/FFS (DOS\4 and DOS\5) are also claimed to be fucked somewhat and slow things down for disks with low random access times such as hard disks.

If you consider seriously revamping the AFFS support, I strongly urge to get hold of Ralph Babel's "The Amiga Guru Book" which contains carefully collected information on DOS internals, while you will still need the Amiga Developer's CD for information on Dircache.

I've given up almost all my AmigaOS activities for the sake of Unix, AmigaOS has given me too much grief with all its troubles, it had just shoot one filesystem beyond recovery (into crash-on-access state) once again.

And since that stuff is so severely fucked, I suggest marking write support EXPERIMENTAL and add a kernel compile-time option for that.

Dave also replied to Alexander's harsh critique, adding, "Which is probably why someone came along and wrote PFS (Professional File System) for it, and made a whole load of money from it." He asked, "Anyone know if any tech-documentation on this exists? That may be an interesting filesystem to hack on when I'm done with AFFS.." Nicholai didn't know of anything for PFS, but he added, "Smart File System (SFS) on the other hand is a free (as in beer) filesystem with good technical documentation. It has gained ground in the amiga community lately." He gave a pointer to an SFS page (http://www.xs4all.nl/~hjohn/SFS) .

But Nicholai Benalal came somewhat to AmigaOS' defense in response to Alexander's stark depiction, saying, "AFFS works allright under AmigaOS. It has limitations but generally it's ok. Still, a lot of people use other filesystems for AmigaOS but there are no Linux drivers for these. So the best way to transfer files between the Amiga side and Linux is still the buggy Linux affs driver :-)" A lot of people pointed out that the 'tar' command might be a decent alternative. Matthias added that lack of Linux drivers for the other filesystems might have something to do with the lack of available sources for those primarily commercial filesystems. He also disagreed with Nicholai's assertion that AFFS worked passably under AmigaOS. He expressed, "It blows itself off its very feet when crashing at the wrong time and leaving a system behind that's corrupted beyond repair. This is happening all over the place every now and then."

Rask Ingemann Lambertsen replied, "AFFS is as good (or bad) as ext2 in this regard." But Matthias and Alexander both objected to this. Alexander said, "It is not. You need much more atomic operations to get from one valid state of filesystem to another. And I mean _much_ more - as minimum twice. Data structure is designed by complete loonie - just look at it and write down the worst-case set of modifications to be done upon rename(). Pay attention to the size of critical part - _some_ steps can be performed without corrupting the structure if you fail after them, but there is a nice lump that should be taken together or not at all. Now, do the same for FFS (_real_ one, designed by sane people). Or its descendants - ext2, ufs... Compare the results." And Matthias also said, "AFFS is much worse. It needs reordering hash chains, touching several and so on even on a single rename. If your machine crashes before all blocks have been written, you're in trouble, since tools that handle this do not come with AmigaOS. I've NEVER lost so much data with ext2 because of corruption as I recently lost with affs (on AmigaOS). AFFS is fine only as long as you don't touch anything."

 

13. Intel eepro100 Driver To Be GPL-Compatible?
30 Mar 2000 - 31 Mar 2000 (14 posts) Archive Link: "[PATCH] eepro100.c"
People: Alan CoxDragan Stancevic

Dragan Stancevic posted a patch to the non-Intel eepro100 driver, to provide more detailed information about installed devices, and in the course of discussion, Alan Cox mentioned, "I've had some discussion with intel about fixing the licensing for the eepro100 driver they released so that we can merge the two (they have support for more boards, ucode for interrupt mitigation, errata workarounds and portability and locking flaws." He added, "I have a positive answer I dont quite understand 8) from the Intel lawyers." Dragon replied, "I was not aware that intel is reconsidering their license... Did they give you any time frames to when the intel driver might be release under a more compatible license?" And Alan Concluded, "Basically once I finish talking to the lawyer. I've just not had time and I'll be busy next week too."

 

14. This Year's April Fool's Joke
1 Apr 2000 - 2 Apr 2000 (21 posts) Archive Link: "Linux 2000(tm)(r)"
Topics: Microsoft
People: Michael Talbot-WilsonLinus Torvalds

Someone purporting to be Linus Torvalds announced that he'd partnered with Microsoft and would be selling Linux from now on. Everyone rolled their eyes, but the interesting part is how quickly the identity of the poster was established. Within a day his name and various personal details were known and published. It looked as though a serious manhunt would soon be under way, until Michael Talbot-Wilson said, "Hey, guys. Enough, huh? He intentionally made it clear enough that it was a joke. Fun for a couple of minutes, right?"

 

15. New Networking HOWTOs And LVM HOWTO
2 Apr 2000 (1 post) Archive Link: "[DOCUMENTATION] 3 2.4 HOWTOs. Traffic Shaping, iproute2 and LVM"
Topics: Disk Arrays: LVM, Version Control
People: Bert Hubert

Continuing from last week in Issue #61, Section #15  (22 Mar 2000: iproute2 And netfilter HOWTO) , Bert Hubert announced:

3 new HOWTOs:

  • Linux 2.4 Advanced Routing & Traffic Shaping HOWTO http://www.ds9a.nl/2.4Routing

    Do interesting stuff with netfilter, ip, tc and other tools. Already quite long and considered useful by a lot of people. Cooperative project with 4 authors working together via CVS

  • Linux 2.4 Networking http://www.ds9a.nl/2.4Networking

    iproute2 HOWTO which also tries to impart understanding of basic Linux networking. Still in its very early stages and desperately needs more authors.

  • Linux Logical Volume Manager HOWTO http://www.ds9a.nl/lvm-howto

    A very hands on HOWTO about LVM. In its early stages as well but already quite useful - progressing rapidly.

 

 

 

 

 

 

We Hope You Enjoy Kernel Traffic
 

Kernel Traffic is grateful to be developed on a computer donated by Professor Greg Benson and Professor Allan Cruse in the Department of Computer Science at the University of San Francisco. This is the same department that invented FlashMob Computing. Kernel Traffic is hosted by the generous folks at kernel.org. All pages on this site are copyright their original authors, and distributed under the terms of the GNU General Public License, version 2.0.