Table Of Contents
|1.||24 Feb 1999 - 25 Feb 1999||(14 posts)||sysvinit Causing Spontaneous 2.2 Reboots|
|2.||24 Feb 1999 - 3 Mar 1999||(62 posts)||Real-Time Scheduling Flame War|
|3.||24 Feb 1999 - 2 Mar 1999||(17 posts)||Compensating For Bugs In Other OSes|
|4.||24 Feb 1999 - 27 Feb 1999||(27 posts)||Tripping Bugs In Userland Programs|
|5.||24 Feb 1999 - 26 Feb 1999||(4 posts)||Scheduler Resolution|
|6.||25 Feb 1999 - 27 Feb 1999||(12 posts)||Tulip Driver Troubles|
|7.||25 Feb 1999 - 28 Feb 1999||(11 posts)||BIOS Writers Make Linux-Oriented Fixes|
|8.||25 Feb 1999 - 2 Mar 1999||(8 posts)||Timestamp Tuning Code|
|9.||25 Feb 1999 - 26 Feb 1999||(5 posts)||Saving State Information|
|10.||25 Feb 1999 - 3 Mar 1999||(16 posts)||Beowulf Performance Optimization|
|11.||25 Feb 1999 - 27 Feb 1999||(8 posts)||Fixes To 2.0 But Not 2.2; Working Around Solaris Bugs|
|12.||26 Feb 1999 - 27 Feb 1999||(15 posts)||RealPlayer Problems Under Linux|
|13.||27 Feb 1999||(9 posts)||Possible GPL Violation By Mosix|
We have a number of new Kernel Traffic mailing lists (lists.html) now (thanks Mark!). We already had Announce, which is a moderated list just for announcements of new issues; Contrib is for folks who want to participate in writing Kernel Traffic: we don't know quite how that list will shape up, why not join and help shape it; and Forum is a place for KT readers to discuss the issues raised in current issues of KT. If the Forum list turns out to be popular, we may implement a Slashdot-esque discussion area for each article in each issue, or some such thing like that. Let us know (mailto:firstname.lastname@example.org) what you'd like to see in KT.
Mailing List Stats For This Week
We looked at 949 posts in 3549K.
There were 409 different contributors. 159 posted more than once. 148 posted last week too.
The top posters of the week were:
1. sysvinit Causing Spontaneous 2.2 Reboots
24 Feb 1999 - 25 Feb 1999 (14 posts) Archive Link: "Kernel 2.2.1 and sysvinit 2.76 possible bug"
People: Kevin L. Mitchell, Alan Cox, B. James Phillippe, Miquel van Smoorenburg, Oliver Xymoron, Richard B. Johnson, Craig Milo Rogers, H. Peter Anvin
This thread started off because Dan Srebnick had trouble with sysvinit spontaneously rebooting the machine periodically after being upgraded. Kevin L. Mitchell pointed out that "2.2 introduces a "feature" which causes the kernel to reboot if init dies for any reason," and that became the subject of the thread.
The init question is very interesting. Kev added that he felt the spontaneous reboot was a bug and should be fixed. Alan Cox disagreed, saying, "Its not a bug. If init dies the machine is in deep crap. Init should not die. If init dies something very bad has occured. Simply carrying on with no pid 1 to reap processes wont work very well." But a number of people felt there must be a better solution than just rebooting. As H. Peter Anvin said, a panic might be reasonable, and could be configured to mimic the reboot feature. Miquel van Smoorenburg (the author of init) agreed, and added that a spontaneous reboot makes it hard to diagnose what happened.
B. James Phillippe was also opposed to spontaneous reboots, and compared the situation to MS Windows (just in jest). He added, "Seriously, I agree that losing init is going to cause Big Grief, and is an unrecoverable situation. But perhaps the user wishes to preserve some flexibility as to how to proceed in this dire situation (ie. an opportunity to carry out some other task perhaps, before cycling)."
Alan and Oliver Xymoron seemed to agree that rebooting was really the only option. Alan said, "If there is no init your machine is dead. How can the kernel reparent a task to a task that doesnt exist for example." B. James replied that a panic would at least allow some chance of resyncing the filesystem.
Craig Milo Rogers had the wild suggestion of a 'standby init', or a queue of standby inits. When the main init died, another would take its place and either do debugging, print error messages, shut down the system, or just do whatever the sysadmin felt was justified.
Richard B. Johnson stated that inits that die are just broken and need to be fixed either at the source or at compile-time. Alan agreed and the thread was over.
2. Real-Time Scheduling Flame War
24 Feb 1999 - 3 Mar 1999 (62 posts) Archive Link: "2.2.2: 2 thumbs up from lm"
Topics: Real-Time, Scheduler, Virtual Memory
People: Rik van Riel, Richard Gooch, Larry McVoy, Ingo Molnar
Larry McVoy started off with some straightforward praise of Linux. He ran his machine under very heavy load, and still found it useful, though sluggish.
Rik van Riel said, "The good news is that we haven't even reached the limit yet with scheduler and VM optimizations. With a bit of luck, some newer version of the kernel will blow away 2.2.2 in the near future..." Richard Gooch agreed, and added that he was preparing some such changes himself. Ingo Molnar asked for some elaboration, and Richard said it came out of his Real Time queue work. Specifically, he said, "my basic idea is to pack all the per-task information necessary for the scheduler to walk the run queue (and no more) into a new structure. Call it struct task_sched. This will be the framework for the run queue. All the other cruft is left in struct task and a forwards pointer (and possibly a backwards pointer as well) takes you from the struct task_sched to the struct task."
At this point things started to get ugly. Larry had some harsh words to say, and from then on there were essentially three factions: the Larry faction, the Richard faction, and the faction that tried to bring the discussion back to the relevent technical issues. Basically, Larry felt Richard was deciding to gratuitously tinker with fundamental parts of the code (the scheduler) that would affect everyone, without much hope of reward (even Richard admitted he wasn't sure his ideas would lead to any gain). Meanwhile, Richard felt that it couldn't hurt to try, and might possibly improve things; and that in any case, he was willing to do it.
Apparently this is not the first time the issue has come up. There seem to be fully formed camps on all sides of the issue; and a number of people were trying to head off the expected flame war.
By KT press time, the contention seems to have subsided, and the rest of the discussion seems focused on technical issues.
3. Compensating For Bugs In Other OSes
24 Feb 1999 - 2 Mar 1999 (17 posts) Archive Link: "very poor TCP performance with 2.2.2"
Topics: FS, Microsoft, Networking
People: Mark G. Adams, David S. Miller, David C Niemi, Pete Wyckoff, Alan Cox
Matt Ranney noticed that his web servers were really slow under 2.2.2. He posted some tcpdumps for the 2.2 client (http://www.ranney.com/~mjr/pornucopia_tcpdump.txt) with 2.0.36 server (http://www.ranney.com/~mjr/lisa_tcpdump.txt) , and 2.0.36 client (http://www.ranney.com/~mjr/lisa_tcpdump2.txt) with 2.2 server (http://www.ranney.com/~mjr/pornucopia_tcpdump2.txt) . Mark G. Adams confirmed that he had noticed a similar problem, in 2.2.2ac3, where transfers of large files went 12 times slower than under 2.0.36. He posted tcpdumps for 2.2.2ac3 and 2.0.36 (ftp://www.livepage.com/pub/LinuxKernelBug/) , and described his setup: "the files were being served up by a light-weight multi-threaded web server we wrote/use internally. The client was running NT4 and using a program that cycled repeatedly through requests for the particular pages. All pages were already loaded in the server's cache before running the benchmarks so that the filesystem wouldn't come into play."
Pete Wyckoff was most active in debugging the dumps. Most of the discussion took place in a single day, with Pete posting analysis and asking for particular experiments from Matt and Mark. At a certain point several other folks broke in with comments: David S. Miller analyzed the dumps and thought the problem was a bug in NT. An excellent tidbit came in here: David added, "A side note, all the bitching about the bad interactions we have with Solaris boxes wrt. FIN's, look what NT does here, the same like 2.2.x by attaching the FIN to the final data frame," to which David C Niemi replied, "The difference is that the Linux developers can be convinced to CARE whether their kernel interacts badly with Solaris. I'd expect Microsoft to be PLEASED they break Solaris and it's Solaris' fault ;^)" and added, "Now that I've reproduced the Solaris TCP FIN bug under HTTP, I'm convinced it explains the many HTTP sessions here at work that hang at 100% going to NT sites as well. A very compelling reason to apply the patch for bug #4083814 on Solaris firewalls and proxies. But reacting gracefully to bugs in other OSes is still a plus for Linux when it isn't too ugly to implement."
Meanwhile, Pete said, "speculation: 2.0 should not have sent out two outstanding segments at the beginning of slow start, but NT sees them and must ACK immediately," to which Alan Cox replied, "Correct. Although the latest bloodbath in the tcp working groups has been approving using two segments as the initial cwnd so thats one to change 8)"
The discussion seems over.
4. Tripping Bugs In Userland Programs
24 Feb 1999 - 27 Feb 1999 (27 posts) Archive Link: "PROBLEM: Sending mail-attachment > 45k with Netscape via sendmail hangs"
People: Alan Cox, Pete Wyckoff, David S. Miller
Marc Jauvin posted this problem. Alan Cox said, "Netscape has a pile of known netscape problems that 2.2.2 trips a lot more ofen. strace the netscape process (strace -p [netscapepid]) if its sitting failing to write to something -ie you keep seeing -EAGAIN blah -EAGAIN blah, its a netscape problem."
There was also some debugging, virtually all of which took place in a single day. Marc disappeared, but Matthias Moeller took his place with confirmation of the problem, and was active throughout the thread. Alan also played a prominent role in the thread, having a 9-post, day long staircase with Matthias, in which Alan analyzed tcpdumps and strace output, asking for details and giving explanations. They didn't seem to reach any really solid solutions at the end of it though.
Pete Wyckoff joined in, trying to isolate the relevant code. Thinking 2.2.1 was unaffected, he went to the 2.2.2 patch and located what he felt were the relevant 20 or so lines. He offered this summary and conclusion:
Just to summarize the data:
|Proto||Rcv||Snd||Local Address||Foreign Address||State|
Feels like a race condition. Any ideas?
Matthias poured water on Pete's (and Alan Cox's and David S. Miller's subsequent) patch analysis, though. Apparently he got the same problem with 2.2.1, though 2.2.2 was worse.
The thread seems finished. It's a strange thing: often a thread will suddenly go belly up for no apparent reason. Maybe folks take it to private email, or maybe they start a new thread under a different Subject:. Or maybe the problem is solved and I just don't get it (Could happen).
5. Scheduler Resolution
24 Feb 1999 - 26 Feb 1999 (4 posts) Archive Link: "RFD: use of `xtime. in 2.2.1 and nanoseconds"
Topics: Assembly, Scheduler
People: Ulrich Windl, Colin Plumb, David S. Miller, H. Peter Anvin, Andrea Arcangeli
A 4-post, 2-day staircase between Ulrich Windl and Andrea Arcangeli. Ulrich posted a patch intended to help move process scheduling from 10 microsecond granularity to nanosecond granularity. His patch addressed places where the kernel directly accessed 'xtime'.
Andrea said that in his 2.2.2_arca2 patch (ftp://e-mind.com/pub/linux/arca-tree/2.2.2_arca2.gz) , he used a get_xtime() function that avoided races, instead of directly accessing 'xtime'. His position was that any direct access of 'xtime' would cause problems.
Ulrich had a problem with Andrea's patch declaring a lock in the wrong file; but Andrea defended his choice.
Ulrich started a new thread, which lasted a single day and which Andrea did not participated in, with the Subject: RFD: nanoseconds, rdtsc and SMP (http://www.uwsg.indiana.edu/hypermail/linux/kernel/9902.3/0158.html) . He said:
With microseconds resolution there were many CPU cycles per microsecond, but for nanoseconds there are multiple nanoseconds per CPU cycle (at least this year).
Therefore variations when reading the cycle counter result in time jitter at least, eventually maybe even in time running virtually backwards occasionally.
The question is: "what assumptions can be made?" On the i386 architecture, will all cycle counters start at the same moment, and will they be bound to the same oscillator? If not one has to calibrate each CPU, and remember the cycle counter of each CPU during timer interrupt. When getting the time one must find the cycle counter of the own CPU and subtract that counter at the last interrupt to get the difference. Other architectures maybe even worse.
Colin Plumb suggested to synchronize the cycle counters on i386 architecture, assuming they'll remain in sync. This would make the time code much easier, but break things terrible, if the counters drift apart.
Colin Plumb was quick to point out that this was actually someone else's suggestion, but that in any case he agreed. He added:
Currently they are always synchronous on x86, but there is sometimes an offset. Since the counters can be written, that can be fixed at boot time and forgotten about. (Until someone starts doing evil things with APM on multiprocessors.)
The problem is that some architectures might, and the Alpha currently *does*, use asynchronous clocks. This makes things much hairier, because the phases drift as the system runs, and you need to know the processor you're running on to interpret the cycle counter.
On an x86, I've thought of just exporting the conversion factors (from cycle counter to real time) to userland through shared memory and letting the library implement gettimeofday() without a system call. (There would, of course, be an exported "valid" flag without which the library would do an old-fashioned system call.)
This doesn't work, however, unless you can determine the cycle counter and the processor number atomically with respect to process scheduling. (The Alpha only provides a 32-bit cycle counter; the upper 32 bits of the register are software controllable, which permits a fix for it, but other systems may have problems.)
My ambition is to phase-lock the counters almost exactly, to within 1 tick (3 ns or less). This requires symmetric round-trip message timing, like what NTP does over the net, but over the bus. That's the only way to correct for the ~50-clock one-way multiprocessor synchronization overhead.
This doesn't take a lot of wall-clock time to do once at boot time, but it requires a lot of blocking on spin locks, which is extremely painful on a running system. I'm not sure how to accurately measure the interprocessor phases on a running system.
(Measuring phases without bidirectional communication is *very* difficult. I'm working on algorithms for it, but they're a pain and I'm not sure about their stability.)
He started a new thread in turn, with Subject: How to read xtime (http://www.uwsg.indiana.edu/hypermail/linux/kernel/9902.3/0343.html) , which apparently stemmed directly from Ulrich's and Andrea's discussion. He said that locks were slow and to be avoided, and wrapping xtime seemed unnecessary to him. He posted some sample code to illustrate what he meant, and followed with this explanation:
In the first half of the second, the danger is that you'll get an old seconds field, and read 1.00 seconds when it's supposed to be 2.00 because the update from 1.99 hasn't finished. Thus, we use xtime_sec1, which is written before xtime_usec and read after it, so there is no chance of a problem.
In the second half of a second, the danger is that you'll get a new seconds field, and read 2.99 seconds when it's really 1.99 being updated to 2.0. In this case, we use xtime_sec2, which is written after xtime_usec and read before it, so again there is no chance of a problem.
As long as interrupt handling doesn't make get_xtime() take too long, there should be no problem. Will interrupt handlers ever take longer than that often enough that slightly wonky results will matter?
David S. Miller replied:
Perhaps I am missing something, but I believe this is dealt with on Sparc already in a more clever and efficient way (the idea is actually by Van Jacobson to the best of my knowledge, his is the earliest implementation of this "trick" that I am aware of)). The algorithm, in pseudo assembly, is:
|retry:||ldd||[xtime], %reg1||/* load all 8 bytes of xtime */|
|ld||[timer_reg], %reg2||/* snapshot timer counter */|
|ldd||[xtime], %reg3||/* load second copy of xtime */|
|cmp||%reg1, %reg3||/* Did it change in between? */|
|bne||retry||/* Yep, try once more */|
No locks, no interrupt disabling, etc. I suppose using this technique would be extremely troublesome if the architecture in question has no method to load an aligned doublet of time_t's at once. But I believe ix86 actually can.
But he added, "Perhaps this solves a different problem than the one you are tackling..."
H. Peter Anvin posted an improved algorithm:
if you have a reasonable limit on how long you may be interrupted (1/2 of the major increment) you can do this without loops:
|cmp %eax,500000||; 1/2 of the major increment|
|1:||; Result in %edx:%eax|
If the minor counter is less than halfway to rollover, it is safe to assume the value read immediately *after* is stable; if it's more than halfway, that the value read immediately *before* is stable. One advantage with not having a loop is that you don't bias the output.
Andrea replied, "Maybe we aren't talking about the same thing, but my point is that if you don't use any lock and you don't __cli() the other cpu, you could _always_ have %ecx == %edx even if the other CPU has updated only usec (because it's been interrupted a bit before having the time to increase tv_sec)."
Andrea also replied to David's pseudocode, with:
I am not sure we are talking about the same thing, but time_t is a 32bit on i386. While xtime is a 64 bit (it has both time_t and the suseconds_t). If I remeber well only movq (MMX) can move 64 bit at once on i386. But I don't think the 64 bit thing is the issue.
I think the issue is timer_reg. What is timer_reg? To avoid a xtime_lock we must make sure that the other CPU is not in the middle of the xtime updating (so that only half of the xtime struct is been updated).
Van Jacobson is not enough to assure this (I am assuming that timer_reg is something like iffies on Linux).
6. Tulip Driver Troubles
25 Feb 1999 - 27 Feb 1999 (12 posts) Archive Link: "2.2.1 and Tulip weirdness"
People: David Ford
There were a few related threads about tulip troubles. David Ford's advice was to upgrade to the latest tulip driver because the stock kernel's tulip.c was broken. He also suggested upgrading net tools.
7. BIOS Writers Make Linux-Oriented Fixes
25 Feb 1999 - 28 Feb 1999 (11 posts) Archive Link: "[2.2.2] APM poweroff problem"
People: Nils Philippsen, Stephen Rothwell, Peter Hofmann
This thread was a sign of the times.
Nils Philippsen had a problem where poweroff with 'halt -p' would give "myriads of hex numbers in square brackets after the "System halted."/index.html"Power down" message." There was some debugging discussion, trying to isolate the problem, until Stephen Rothwell said, "This is a known bug with some BIOS implementations. They assume that we will switch back to real mode before we ask them to power off. Under Linux we do not do this." He suggested complaining to Gigabyte (the manufacturer of Nils' GA6BX-E mainboard and BIOS).
Peter Hofmann stole the show, though, with "I had the same problem with my EPOX EP-51-MVP3E-M board. This was fixed by the latest BIOS upgrade that I downloaded from the EPOX web site. The README to the BIOS upgrade explicitly stated that it was supposed to fix the "Linux power-off problem"."
Stephen was overjoyed, and Nils followed up with, "After I contacted Gigabyte I immediately got a reply which contained a new BIOS (v2.6) for my board (GA6BX/E). With this the power down works perfectly. It's nice when hardware vendors wake up and are swift keeping us happy :-) Nice work, Gigabyte."
8. Timestamp Tuning Code
25 Feb 1999 - 2 Mar 1999 (8 posts) Archive Link: "Kernel 2.2.1 hangs on boot"
People: Alan Cox
Knut Neumann had this problem. Alan Cox posted a short patch. LJP reported success, and Alan replied, "Excellent that proves the hypothesis - the timestamp tuning code in 2.2.* is indeed broken."
9. Saving State Information
25 Feb 1999 - 26 Feb 1999 (5 posts) Archive Link: "When to save/restore_flags() vs cli/sti()"
People: Alan Cox, B. James Phillippe, Jes Sorensen, Doug Ledford
B. James Phillippe asked this question. The ubiquitous Alan Cox replied that cli/sti() should only be used "when you know interrupts were previously enabled and also know nobody will ever call the function now or in the future with interrupts disabled. In general I think "don't do it" is the answer." He added, "The net code used to use both according to need but even by 1.2 it was mostly using save/restore - it was just causing too many bugs being clever"
B. James was relieved that he had not been too anal by using only save_flags()/restore_flags(). For the benefit of the reading public, Doug Ledford added that save_flags()/restore_flags() is only useful in the following context:
|save_flags(flags);||/* save the current cpu flags */|
|cli();||/* actually turn interrupts off */|
|... do stuff|
|restore_flags(flags);||/* restore the old cpu state */|
The point being that "since you save the state and then turn interrupts off, you code is safe, and whether it was called with interrupts already off or still on doesn't matter because the state will get set back to exactly what it was before you executed your protected code."
Jes Sorensen said:
Ok lets take it one step further then ;-)
On SMP cli(), save_flags() etc. are really expensive as they need to synchronize across all processors. Therefore using save_flag();cli() ... restore_flags(); in time critical code is a very bad idea.
In the past, before we had spin locks, we used to do it as a quick way to avoid anybody else fiddling with a data structure while we were messing with it, however this is now in a way coming back to bite us.
So for time critical code it is a very good idea to look at spin_lock_irqsave()/spin_unlock_irqrestore() - they will disable interrupts on the local processor take a spin lock thus not blocking the other processors until they try to get beyond the lock in question. For UP machines they will simply turn into the old save_flags();cli();/restore_flags(); set.
10. Beowulf Performance Optimization
25 Feb 1999 - 3 Mar 1999 (16 posts) Archive Link: "Linux 2.2.2 TCP delays every 41st small packet by 10-20 ms"
Topics: Clustering: Beowulf, Networking
People: Josip Loncaric, Donald Becker, Andrea Arcangeli
Josip Loncaric had a Beowulf cluster of Pentium II nodes. He wrote:
I tested Linux TCP streaming using a modified netpipe-2.3 code which collected timestamps from the Pentium II tick counter. This has the CPU clock frequency resolution (400 MHz in our case). Our systems use NetGear FA310TX cards (some with DEC chips, most with Lite-On chips) and the latest testing version 0.90Q of tulip.c driver.
The conclusion is this: in Linux TCP, every N-th small packet is delayed by 1-2 "jiffies" (defined by the 100 Hz system clock). For Linux kernel 2.0.36, N=35; for kernels 2.2.1 and 2.2.2, N=41. "Small" means smaller than K bytes, where K is about 509 in kernel 2.0.36 and about 93 and 125 in kernels 2.2.1 and 2.2.2, respectively.
What does this mean? Well, if MPI is streaming small messages (e.g. 16 bytes each) via TCP in the latest Linux kernel 2.2.2, the first 40 messages will be spaced about 17 microseconds apart. Every 41st packet will be delayed by 10,000 or 20,000 microseconds. For some of our MPI-based codes, these delayed packets are very, very bad news.
Interestingly enough, larger packets do not suffer from this problem. Also, spacing small packets by 100 microseconds at the sending side does not change the result. There is some reason to suspect that ACK logic is to blame, where the send/receive/ack process stalls until something times out 1-2 jiffies later. Whatever the cause, this problem is still present in the latest Linux kernel 2.2.2.
I suspect that this accounts for very uneven MPI performance in mpich-1.1.2. Some of our codes stalled completely. This can happen both with mpich-1.1.2 and with lam-6.1 using the -c2c flag. The code runs fine using lam-6.1 without the -c2c flag, but slower.
Finally, I found it rather curious that a stalled MPI code would sometimes resume running if we sent a single "ping" to all hosts.
Andrea Arcangeli posted a patch, and Josip said, "Your patch improved things by a factor of 20," and included an image:
This plots intermessage time in clock ticks (400 MHz clock), sorted from largest to smallest. Red dots are the unmodified Linux kernel 2.2.2; green dots are after Andrea's patch. Horizontal scales (message number after sorting by delay) are not directly comparable because a lot more messages could be sent after Andrea's patch, so the green curve is shifted to the right in this log-log plot.
Atypically large delays now cluster around 1 millisecond instead of 20 milliseconds. In both cases, a vast majority of the messages are spaced by less than 100 microseconds. Even after the patch, five delays of 200 milliseconds out of the 420,602 measured were seen, but this might be due to some unavoidable system activity during the test.
Donald Becker requested, "Could you please try the following? Bump the RX_RING_SIZE and TX_RING_SIZE in the driver up to 128 or 256 entries each. With enough memory this should have trivial free memory impact and remove any queue-limit noise in the measurments," and added, "This should have no significant average performance impact. If it does, there is something unexpected going on."
Josip replied, "Increasing both ring sizes to 256 had no significant impact for most messages, but the number of messages spaced by 200 milliseconds shot up to 30. Prior to this test, the machines were rebooted and I believe that they were completely dedicated to this test, so I have no explanation for the 200 ms delays that were seen. I'll repeat the test using a crossover cable, but this made no difference in previous tests. With larger ring buffers, 70 messages were spaced by more than 1 ms, and Andrea's patch did did not help much in those cases. BTW, this test measured 414,374 intermessage delays." Several days later, he added, "Andrea's patch speeds up acknowledgments and reduces the 10-20 ms delays to about 1 ms (i.e. 1000 microseconds), but the root cause of the delays is not yet clear. Under ideal circumstances, 1-byte messages can arrive about 3-4 microseconds apart, so a delay of 1000 microseconds is still a significant interruption in the flow."
The thread may be ongoing.
11. Fixes To 2.0 But Not 2.2; Working Around Solaris Bugs
25 Feb 1999 - 27 Feb 1999 (8 posts) Archive Link: "[patch] workaround for solaris 2.5.1 and 2.6 FIN bug (ID 4083814)"
Topics: FS: sysfs, Networking
People: Andrea Arcangeli, Philip Gladstone, David S. Miller
Andrea Arcangeli said:
Solaris has a stupid bug in its TCP stack that cause it to hang the connection between linux-2.2.x and Solaris (with linux as server) once the connection closes.
The reason is that linux-2.2.x is smart enough to save some good packet on the network by setting the FIN flag in the latest packet of data if there is pending data in write queue. But if such last packet gets reordered by the network it seems that Solaris ignores the FIN flag and as first it will think that the last octect is full of data while it's the FIN advance (if I am thinking right I think this can cause some troubles), and as second Solaris will not send back to Linux the FIN to allow Linux to go to FINWAIT-2 from FINWAIT-1. So the connection will hang and the end of the data has no way to be transferred correctly.
Sun just released a fix for Solaris but since I think there are still many buggy Solaris 2.6 and 2.5.1 TCP stacks on the Internet, I taken the time to implement a workaround for Linux.
The workaround is implemented as a sysctl. As default there is no workaround. But if you:
echo 1 >//proc/sys/net/ipv4/tcp_solaris_fin_bug_4083814_workaround
you'll enable the buggy-solaris-compatibilty-mode on the Linux TCP stack.
Note that this make a difference only when Linux is the server. So large FTP sites are likely to want to apply my patch and enable the workaround in the meantime the bogus Solaris tcp stack will be fixed properly everywhere.
This is probably useful also in Unversity like mine where there are many old slowww Sun machines that are not likely to be patched until it will be discovered the next root exploit ;).
Once the workaround is enabled you risk to generate 1 more packet on the network every time you close a TCP socket.
Andrea then had a staircase with Philip Gladstone, who uncorked this shocker: "This bug was fixed in the 2.0.3x series about a year ago. The approach there didn't have any adverse side effects -- since the bug only happens when the FIN/Data segment arrives out of order, and the data will get ack'ed by Solaris, the approach was (on the retransmit from the linux end) to *not* retransmit the ack'ed data. This leaves a packet with just a FIN in it. Arguably that was what should have been happening anyway. Once I understood what was happening, I seem to recall that the fix was pretty trivial," and added, "I wonder how many other bugs have been fixed in 2.0.36ff and not yet fixed in 2.2.x?"
David S. Miller said he would put some sort of fix in 2.2.3, but disgruntledly added, "What really ticks me off is that the response from people about Andrea's patch was "INSTALL THIS ON ALL YOUR FTP HTTP SERVERS" without one mention of "people with solaris systems should install the appropriate fixes", this makes me ill, and this leads us to have an internet with tcp implementations full of bandaids." He ended with, "I still think this is all a crock."
12. RealPlayer Problems Under Linux
26 Feb 1999 - 27 Feb 1999 (15 posts) Archive Link: "RealPlayer"
People: Mark Orr
Dan Egli asked when RealPlayer would be fixed for Linux (and pointed out that apparently the latest kernels fix a bug Real Player depended on). Hetz replied, saying he had spoken with one of their support people, who had told him there would soon be a new G2 Player port. Mark Orr posted the URL of a workaround (http://onramp.i2k.com/~jeffd/rpopen/) that would let Real Player 5 work under the new kernels.
13. Possible GPL Violation By Mosix
27 Feb 1999 (9 posts) Archive Link: "[OFFTOPIC] Potential GPL violation of Linux kernel by MOSIX?"
Topics: Clustering: Mosix, Licencing, Microsoft, Sound: OSS
People: Alan Cox, Tim Smith, Mike A. Harris, Richard M. Stallman
An unknown person posted about the recent slashdot discussion of how MOSIX might be violating the GPL by distributing binary modules that require kernel modifications.
Alan Cox replied, "I consider it a violation of the GPL. Its not like OSS sound where the module interface is simply used (that is viewing the _existing_ exported symbol set as an API) they actually hack all the code up to call their modules in ways it was never intended to and then to cripple the resulting code so it only works across a group of 6 machines." He added, "Linus has sent them a polite explanation of his viewpoint. We shall now see what happens the friendly way. Hopefully there will be no "after that"."
Tim Smith objected, "I don't _think your emphasis on existing_ interfaces in the OSS case is legally relevant, because GPL allows kernel development to fork. If someone wants to fork off and go in a different direction, with new exported symbols and new interfaces for modules, that's their right, as long as they release their kernel sources and GPL their kernel changes. That the reason for the fork is to better support some proprietary kernel module doesn't seem to be legally relevant."
But Alan came back with:
Its not GPL thats quite the issue here, its as Mike says a bit more complicated. The GPL itself is quite clear and the answer is "no".
The two fun questions are
There's a fun question that comes before those:
0. Is any additional permission even needed for binary modules?
Considered from a copyright point of view, I don't see any difference between kernel modules and applications. From a copyright point of view, in both cases, you've got some blob of code that makes use of exported services of some other blob of code. Copyright law is not going to care that one runs in user mode and one runs in kernel mode, or that they run in separate address spaces. It will just care whether or not one blob contains copyrighted material from the other blob.
Just as I don't need Microsoft's permission to write Windows drivers and distribute them under any license I wish, I don't think I actually need anyone's permission to write Linux modules and distribute them under any terms I wish, assuming, in both the Microsoft and Linux cases, that I'm careful not to use any code from system header files that might lead to Microsoft or Linux code in my object files.
Things are going to get really interesting if/when component-based software becomes the norm. Over in the Windows world, MS seems to be turning everything into collections of OLE2 components. Perhaps the same will happen with CORBA and Linux. It is not clear that GPL works well in a componentized world--it seems to be somewhat of a relic from a time when applications were monolithic and statically linked, running under proprietary kernels.
At this point they both agreed that the discussion should be taken to a different list.
Mike A. Harris started another thread, with Subject: MOSIX and kernel mods. (http://www.uwsg.indiana.edu/hypermail/linux/kernel/9902.3/0588.html) He said:
My current understanding is that MOSIX is a NON-GPL binary only module, and that the modifications to the kernel that they make, and that are needed for proper MOSIX operation *ARE* GPL'd.
This opens up some "Gates" IMHO. So, if someone wants to hack a feature into the kernel, and it requires kernel source modification, they can just GPL their kernel source mods, and then put whatever they like into binary modules?
IMHO, this SUCKS. I say, *NO* thanks. Then Microsoft can put a big 10Mb GPL'd mod in the kernel that provides what is needed for their external embrace and extend addon binary modules, and we are FUCKED.
I'm getting sick of seeing stuff like this, and loopholes found in our licensing mechanisms. I love free software, but every day that passes by, I'm feeling more and more in agreement with Richard M. Stallman. Perhaps RMS is much more visionary than he gets credit for.
Richard M. Stallman wrote:
I'm not surprised that non-free kernel modules have resulted in a big loophole--because the idea is a loophole in the first place. They enable a kernel to support more hardware, at the cost of not being entirely free anymore.
The whole package becomes "more useful" by implicitly altering the goal (a free operating system). Now, if we want to achieve that goal, we have a big effort on our hands.
Linus, if he wants to, could begin closing the loophole, by attaching more restrictive statements about non-free modules to future kernel releases. There are many different ways this could be done, to close the loophole either more or less.
Sharon And Joy
Kernel Traffic is grateful to be developed on a computer donated by Professor Greg Benson and Professor Allan Cruse in the Department of Computer Science at the University of San Francisco. This is the same department that invented FlashMob Computing. Kernel Traffic is hosted by the generous folks at kernel.org. All pages on this site are copyright their original authors, and distributed under the terms of the GNU General Public License version 2.0.