Kernel Traffic #76 For 17 Jul 2000

By Zack Brown

Table Of Contents

Introduction

For almost the first time, folks emailed me to recommend covering particular threads, in this case the low-latency discussions. This is very helpful. If you follow linux-kernel yourself, and see a discussion you think should be covered in KT, try to tell me about it as soon as you see that it's valuable. I can't guarantee to cover it, but I can promise to give it a more thorough reading than I might have otherwise.

Mailing List Stats For This Week

We looked at 1202 posts in 4968K.

There were 393 different contributors. 170 posted more than once. 140 posted last week too.

The top posters of the week were:

 

1. Lucent Violates The GPL
21 Jun 2000 - 9 Jun 2000 (48 posts) Archive Link: "Symbols match 2.2 <-> 2.4"
Topics: Modems
People: Kurt GarloffReto BaettigTheodore Y. Ts'o

Someone asked for some kernel information, in order to write a module that would keep a binary-only WinModem driver working across kernel versions. Kurt Garloff issued the standard disclaimer, saying, "Linux was never designed to provide any binary compatibility for modules. NEVER. Modules are likely to break if you upgrade from one stable kernel (2.2.15, e.g.) to the next one (2.2.16). Even between kernel compiles with different settings, some modules may break. In pratice you may be successful with one module for all 2.2.1x kernels by sheer luck." Reto Baettig replied, "please tell that the guys at lucent. They distributed an "inofficial" linux driver (binary only) for that f** lt winmodem." Kurt replied:

It can't be that hard to recompile the module for 2.4.0 even for L*cent engineers. There might be some slight changes necessary at the places where the module interfaces the kernel, but that should be a work of a couple of minutes. As they stole code from serial driver, probably only the changes there need to be redone ...

(And I wonder, whether they have a written statement by Ted that allows them to use the serial code with a license different from GPL?)

Theodore Y. Ts'o revealed:

No, they don't have a such a statement from me, and I've already let them know that they're in violation of the GPL. Lucent was *supposed* to have split the propietary code into its own .o file, and kept released the Linux glue code as a .c. This solves the GPL problem, since the user is doing the linking....

I think they should have released a version that you should be able to compile for 2.4.0; I'm surprised that hasn't happened yet.

 

2. Closing The File Descriptors Of Arbitrary Processes
23 Jun 2000 - 4 Jul 2000 (45 posts) Archive Link: "closefd: closes a file of any process"
Topics: FS: Coda, FS: NFS, POSIX
People: Tigran AivazianWerner AlmesbergerTim Waugh

Ulisses Alonso Camaró posted a URL to a FreshMeat announcement of closefd (http://freshmeat.net/appindex/2000/06/22/961682188.html) , a module to close the file descriptor of any process. Tigran Aivazian had some serious objections to Ulisses' code:

closing a file descriptor of another process context is a serious task - one could probably write a 200-page book about it and still not know how to do it quite yet.

Anyway, of course, your simple module has serious problems:

  1. POSIX (and probably non-POSIX) locks subsystem needs rewrite before one can release locks from non-current context.
  2. you bravely access p->files without any locking - that is broken
  3. you need not emulate put_unused_fd() for non-current context - this is the *very* reason why I wrote __put_unused_fd() to allow passing p->files as a parameter (of course caller needs to lock files->file_lock) which you do not do, btw.

I spent about 3 weeks working on these (and related) issues and had to write a separate filesystem to overcome the various problems. It should be available late July.

You could write another module called munmap.o and another fchroot and another fchdir and another catch_rights_dgram_in_flight.o or if you see what I mean :)

Elsewhere, Tim Waugh said 'ptrace()' could also accomplish Ulisses' task, although Tigran replied that while this was literally true, it stuck too close to the letter of the task, closing only a file descriptor of a given process. The more interesting and global problem, he felt, was to forcibly 'umount' a given filesystem. For this broader task, he said, the 'ptrace()' method wouldn't work.

There followed a medium/long technical discussion, which Werner Almesberger eventually summarized (quoted in full):

I think the discussion circles around three problems, which are similar in appearance, but quite different in the possible solutions. Let me attempt a classification:

A) hide resource unavailability from a process

This is the "tell process to go elsewhere while ejecting the floppy" case. May be generalized (with some difficulty) to more general fd replacement. I think there are four possible cases:

1) process/fd is idle, can be removed/changed just by changing pointers

2) fd is busy (system call in progress) but we can just wait for the operation to terminate (maybe prod it a little)

3) fd is not busy, but process is doing something that may make it undesirable to mess with its fds (e.g. it's in the middle of exit(2) or close(2))

4) fd is in the middle of an operation that won't come back anytime soon

I'm afraid there aren't many cases of 2), and the prodding may get a little difficult for, say, NFS, so whenever the fd is busy, we probably must assume 4). 3) may simply mean that whenever the process is in the kernel, nothing should happen to it. 3) of course also covers 4), but one may special-case the latter. (After all, we may need this function just because we're in case 4). 1) is too good to be true and of little hack value.

For 4), I can imagine the following two approaches:

a) add a layer that allows a clean split between the process and the driver side of an fd, e.g. by handing all requests to some proxy process. May be slow.

b) hackish, but probably quite efficient: by splicing the thread of execution at the process/driver boundary: driver operation returns immediately, while the still-running driver operation gets a new return address that makes the result evaporate. I think it would be interesting to see if somebody could come up with a clean solution for such a beast ;-)

In both cases, it seems natural to hook the code that maintains the separation into VFS, as a new file system. This would limit things to regular files and directories, but that's probably okay at the beginning. Also, in either case, any access to "current" from the driver side would drastically complicate things.

B) de-access a resource (fd) to allow removal of device

There are again several flavours of this:

1) process is killable and nobody cares, so kill it

2) process should not be killed, so we have to try to close the fd (and hope it doesn't get angry), or replace it with something else

3) process stubbornly refuses to die (e.g. it's blocked in an access to the reource we want to get rid of)

4) like 3), but the block is because of something else (e.g. we want to umount the partition with the log file of the backup process that is currently hung in the tape driver)

For 2), any solution from A) should be suitable. 1) is again trivial. That leaves us with 3) and 4), which are essentially the same. The goal is simply to make that system call return.

Since we're likely in the middle of a driver and may have accumulated an unknown amount of state (e.g. locks, driver flags, etc.), this probably can't be solved from outside the driver. Since we should be in an uninterruptible sleep at this time (anything else would be a serious bug that needs to be fixed no matter what is done about the issues discussed here), the obvious solution seems to be to teach drivers to allow interruption of uninterruptible sleeps. Maybe with a special new signal and, sigh, a new task state.

C) counter denial of service attacks based on keeping devices busy This is a combination of B with counter-measures to keep hostile code from thwarting the initial counter-measure. Some scheduler or kill tricks may be the most efficient way to deal with all such kinds of arms races. (E.g. write a kill(2) that accepts a list of "good" processes, and atomically delivers the signal to all others.)

Of course, A) and B) could be combined. Also, some instrumentation could be added to A) to allow the process doing the replacement to retrieve the exact state of an fd.

A) seems to be the hardest thing to do, particularly if performance is an issue. If performance is a non-issue, one could use some kind of demon that looks like NFS, Coda, or such, and spawns a thread for each IO operation. If we want to revoke an fd, the demon could simply open the substitute, continue working on that, and discard the result of any pending operation as soon as it comes back (if ever). 100% user space.

Solving B) would solve a large class of user-visible problems. Since there may not be all that many TASK_UNINTERRUPTIBLE sleeps that cause problems in real life, it should be feasible just to put the infrastructure in place, and to fix things when a problem is found. I.e. no major code review required.

I'll leave C) to another thread ;-)

 

3. Developers Argue Over Kernel Coding Standards
26 Jun 2000 - 5 Jul 2000 (33 posts) Archive Link: "Re: ANSI C clarifications, with citations (was Re: [patch-2.4"
Topics: BSD
People: Albert D. CahalanTigran AivazianSimon RichterLarry McVoy

In the course of discussion, Albert D. Cahalan said, "Linux will NEVER work with a strict ANSI C compiler." He went on:

I have the damn 1999 ISO C draft. I know full well that a "legal" compiler can put 42 chars of padding after everything and, just for fun, make every type be 101 bits wide.

Moderately portable code assumes a sane compiler and sane hardware. You may assume:

  1. The "char" type is 8 bits. It might be unsigned though. :-/
  2. sizeof(short) == 2
  3. sizeof(int) == 4 (for real Linux, not the 8088 hack)
  4. sizeof(long) == sizeof(void *)
  5. (sizeof(long) == 4) | (sizeof(long) == 8)
  6. sizeof(long long) == 8 (good for 10 years at least)
  7. You can freely cast between any two pointer types
  8. You can freely cast between long and any pointer type
  9. (long)(int*)(long)foo == (long)(char*)(int*)(long)foo
  10. signed integers are represented in two's complement form
  11. integers wrap around instead of causing faults
  12. assuming "good" struct layout, padding only occurs at the end
  13. ... that padding won't happen if you supply a multiple of 16 bytes

One can define "good" struct layout as being an order that puts items with the largest natural alignments first. For example, an array of 6 shorts has a natural alignment of 4 bytes. I suppose you could define natural alignment as gcd(16,sizeof(foo)).

Every day I work with a system that violates rule 9. From experience, I assure you that it is a mess to deal with. Casting gets really nasty. There is no hope of porting Linux to this system.

Larry McVoy came down pretty hard on Albert, disagreeing with most of the assumptions, and including "The Ten Commandments For C Programmers (http://allserv.rug.ac.be/~vfack/files/10ccprog.html) " by Henry Spencer as an alternative. Albert replied, "Note that the kernel makes EVERY ONE of the above assumptions. The kernel even assumes a stricter version of #13, and it assumes that you have big-endian or little-endian hardware." To Henry Spencer's commandments, Albert replied, "That was before the 1989 ANSI C standard. That was before the world decided that all workstation and server CPUs would address 8-bit units of memory and have registers in nice power-of-two sizes. That was long long before the 1999 ISO C standard." Close by in the thread, Tigran Aivazian also remarked, "Albert's collection of valid assumptions in one message was actually very useful - I even printed them out." [...] "I think it is one of Linux's few remaining problems that assumptions are not readily spelled out - so one has to study the code in every detail to extract the assumptions."

Simon Richter was not so happy with Albert's list, replying to Tigran, "Yes, but these assumptions are one of Linux's remaining problems. Code which is based on them will not be portable to other arches. But since noone cares about them anyway, we might as well declare Linux i386 only, it would at least stop the BSD people laughing because our "stable" kernel doesn't compile on all "supported" archs." Elsewhere there was more discussion, with folks arguing back and forth about the relative correctness of one or the other of Albert's and Larry's assumptions and commandments.

 

4. Some Explanation Of Variable Initialization
28 Jun 2000 - 5 Jul 2000 (11 posts) Archive Link: "[PATCH] rivafb bugfixes"
Topics: Framebuffer
People: Bill SuetholzJeff Garzik

Ferenc Bakonyi posted a patch to fix various things with the framebuffer. Jeff Garzik liked the patch, but objected to Ferenc initializing certain variables to zero, since the kernel would initialize them to zero automatically itself. If it didn't, he said, that would be a bug that should be identified. This explanation of the kernel's behavior didn't make much sense to Bill Suetholz, who objected, "What do you mean don't initialize variables??? If you don't, then on some OS's you will get random garbage in the variable. Granted Linux currently seems to initialize your memory to zeroes, but do you really want to depend on that?" Jeff explained, "Yes. Otherwise you bloat the kernel image size with zeroed data."

 

5. Killing Processes When Out Of Memory
29 Jun 2000 - 4 Jul 2000 (6 posts) Archive Link: "VM in 2.2.17pre9 : a success ?"
People: Willy TarreauRik van Riel

In the course of discussion, Rik van Riel announced a patch to intelligently kill processes on OOM conditions. Willy Tarreau volunteered to test it, and Rik posted a URL to the patch (http://www.surriel.com/patches/) . Willy reported:

I've just tested your patch on 2.2.17pre9 + andrea's GFP-race-fix-3. It seems smart about which process to kill, and I can no longer hang my system with mmap. I even tried several simultaneous mmap + malloc (hang guaranted without the patch). I saw cool messages about mmap and malloc being killed, preserving other processes such as syslogd and init.

Although I'm not for killing hungry processes, I think in this case this is better than letting all the system die (or at least hang for hours). But I don't know what this will lead to for daemons which use lots of mmaps.

There was no reply.

 

6. Petitioners Request Real-Time Latency In The Kernel
28 Jun 2000 - 9 Jul 2000 (259 posts) Archive Link: "a joint letter on low latency and Linux"
Topics: Real-Time: RTLinux, Sound: MIDI
People: Paul Barton-DavisIngo MolnarLarry McVoyLinus TorvaldsVictor YodaikenSteve VanDevenderBenno SenonerRichard Gooch

Paul Barton-Davis posted a petition to Linus Torvalds, asking that "real time" features be added to the kernel. He posted:

we are a group of programmers designing, writing and extending audio+MIDI (and in some cases, video) applications for Linux. As you could tell from the list of signees below, we represent the developers of many significant and substantial audio+MIDI applications for Linux, as well as members/employees of several well known music and audio research institutions and companies.

(If you want to get an overview of the current state for audio+MIDI under Linux, there is no better source than Dave Phillip's magnificently comprehensive web site http://www.bright.net/~dlphilp/linuxsound/).

One member of our group, Benno Senoner, did a lot of work last year in investigating and documenting problems involved in using Linux for real-time audio applications. Others, such as Juhana Sadeharju, had long noted problems using Linux for hard disk recording of audio data. Partly as a result of Benno's work, Ingo Molnar did a fantastic job of coming up with a patch for the 2.2 series that dramatically improved the latencies that could be obtained from Linux.

How good? Well, good enough that members of our community who were convinced that RTLinux was needed to do a professional job with audio changed their minds. Good enough that we were close to (or sometimes better than) BeOS, an OS that has had a lot of excellent press in the audio world as a replacement for the known-to-be-problematic Windows and MacOS systems. Good enough that in some cases, Linux is as good as dedicated hardware solutions, offering latencies in the realm of 2-3ms for our applications.

There was a lot of excitement that 2.4 might include a version of the low latency patches. The excitement came from the possibility that the next release of the various distributions of Linux would represent a set of "desktops" that were ready for excellent, "real-time" audio and MIDI applications. CPU and disk performance has improved to the point where we are on the threshold of a revolution in the way that sound synthesis and processing is done, and many of us want to ride Linux into the heart of that revolution.

However, it turns out, as best we can gather, that you were not happy with the basic structure of some or all of Ingo's low latency work. Our impression is that you want to see more careful code design that avoids interrupt disabling or holding high level locks for too long, rather than using preemption points. So, as far as we can tell right now, 2.4 will represent more of the same as far as low latency limitations, and for us, more of the same means performance much worse than Windows or MacOS present.

How much worse ? Linux currently offers worst-cases latencies that are *10-15* times worse than Windows or MacOS. Developers for those platforms still complain about their performance with respect to latency - imagine their response to the situation with Linux as it stands today!

We understand that we could try to maintain a version of Ingo's low latency patches in parallel to the current kernel. But this is not a good situation for us. We would like to persuade several companies that produce applications and API's for audio+MIDI work to make their code, designs and programs available for Linux. We would like to be able to produce our own "real-time" applications and not have to tell users (who will likely know nothing about Linux, or computers in general) that they need to patch their kernel before using them. Neither of these goals are realistic given the current state of low latency support in the emerging 2.4.

We would like to know:

  • what are your general feelings on modifying Linux to support the kind of applications we are concerned with ?
  • what kinds of compromises, if any, you might accept in order to get good low latency performance into the kernel sooner, rather than later ?
  • what design goals you have in mind when you talk about doing low latency "right", rather than "wrong, as in Ingo's approach" ?
  • to what extent are you willing to enter into a debate about the merits of a preemption-point-based approach to lowering kernel latency ?

Above all, we are all interested in having fun with Linux and audio/music/MIDI. We would invite you to come tour a professional studio, watch the cool and wonderful stuff that can be done right now (with lots of annoying problems) on machines that runs Windows and MacOS, and get a sense of the extent to which general purpose computers are about to replace a whole bunch of expensive stuff sitting around in a typical studio. We want to see penguins there, and soon!

Several of the signees below have noted that lowering latency improves the quality of games, most kinds of multimedia and many non-hard-real-time automated systems. Incorporating a low-latency patch could be seen as extremely helpful in providing competitive performance in areas outside of audio as well.

Thank you for your consideration, for Linux and your benign dictatorship.

At the bottom of the letter, he listed 77 signees of the petition. A very long discussion followed.

Events Surrounding The Letter

First, there was some confusion surrounding the historical events themselves. Ingo Molnar spoke up to correct the idea that Linus had been "unhappy" with Ingo's patch. He said:

*I* was unhappy with the structure of that patch to begin with. The patch is ugly and unacceptable (read: a kludge) for inclusion into the mainstream kernel, period. I also said that i'll send a similar patch for 2.4 as well, once the 2.4 codebase stabilizes. (right now we still have a high flux of fixes coming in - but i'll soon port the patch to 2.4)

so please, do not make this appear as some 'fault' of Linus. Linus is rightfully (and thankfully) watching the quality of the mainstream kernel, and ugly patches are simply not accepted, regardless of the usefulness of a given patch. In fact it's my fault of not submitting those patches in a saner way. I'll fix this in the coming weeks.

Paul replied to this, first of all apologizing for not sending the letter to Ingo before Linus. Apparently he'd intended to as a courtesy, but it had slipped his mind during his final preparations. He also addressed Ingo's point about Ingo disapproving of the patch himself. Paul said he remembered Ingo trying several times to persuade Linus to include the patch, until Linus had flatly refused. But to this recollection of Paul's, Ingo replied that Paul had confused two entirely different patches. Ingo said the patch Linus had rejected had used a completely different approach than the "low latency" patch Paul had described in the petition.

There was no reply to this, but Larry McVoy also replied to Paul, with his own take on the sincerity of the folks involved. He remarked, "I have a little background on this, I was asked to support this letter by using LMbench to "show" that there were no problems with the Ingo patches and declined. I felt that what I was asked to do was misleading and I didn't agree with what was in the letter."

Paul took umbrage at this, and replied, "One clarification here. Benno Senoner asked Larry about this, and it was not part of the process of gathering support for the letter. I was actually unhappy that Benno had asked Larry to do this. Not because I dislike Larry, but I considered it to be inappropriate at that time. So be it."

RTLinux Is Put Forward

In his same post, Larry went on to give a more technical response to the petitioners. He said:

Paul and the others in favor of the low latency fixes are fond of pointing out that they must be good because this is how IRIX and BeOS solve the problem. That's never a good reason to do anything - fortunately for all of us, Linus evaluates things on their technical merits. And "IRIX does it" has no technical merit whatsoever.

I and others have pointed out that if you need hard real time then what you do is use RT Linux and you can get exactly what you need. I've heard two arguments against this:

  1. ``Other operating systems offer "soft realtime" and we don't want to port our code from that to the RT Linux model.'' Translation: ``other operating systems have made poor design decisions, let's try and pressure Linus into doing the same thing with Linux.'' That's a poor approach to the problem. Just saying that NT does it does not make it right. Linux has a large, quickly growing, market share. In my opinion, it is better to do the right thing, wait until Linux is the market leader, and then laugh at the screwed up systems that made the wrong choices. Screwing up Linux so it has the same problems that other systems have in the name of portablility is not the Linux way.
  2. ``If we take the RT Linux path then Linus will never add the low latency features we want.'' I love this argument. It implies knowledge that there is a better way, that if you use the better way then there is no need to do what Linus doesn't really want to do. Exactly.

In fairness to the low latency application people, I think that the RT Linux folks need to provide skeleton "apps" that show how to solve problems in the RT Linux space. Whether that happens or not, the RT Linux approach is really brilliant and it is the right approach to the problem space, and I'm more than a little sick of hearing people avoid using it. Get with the program, folks, use the best tool for the job. No matter what Ingo does, it is a _fact_ that you can not get both good performance for a multi user time sharing operating system and real time applications.

Please understand that it might be that some of Ingo's work is a good thing and that it will go into the kernel. I am not sure about that one way or the other. What I am sure about is that putting features into the kernel for the benefit of real time applications is a slippery slope that just leads to bad change on top of bad change. It effects the scheduler, the I/O paths, the interrupt handling, and probably a bunch of other places I'm forgetting. The changes are at direct odds with time sharing performance and the proponents of the changes will argue each one in isolation, rather than looking at the effects of all of them.

My advice, heeded or not, is to just say no. We have an excellent answer for realtime, it's called RT Linux. Go use it - it's better than any other answer.

Linus Torvalds replied to this:

Just to clear up my opinion a bit:

  • I'm definitely not against low latency. I think latency is hugely important, mostly more so than throughput (within reason, of course: all this is very much a balancing issue).
  • I'm not even against well-defined and well-thought-out "scheduling points". They are hackish, but if there is one or two of them to cover some specific behaviour then that's ok. Not a big deal.
  • I _am_ against patches whose only purpose in life is to run some arbitrary benchmark, and try to make that benchmark look good.

Most of the low-latency patches I've seen have been of the third kind, I'm afraid.

For example, I would probably accept a patch that adds a simple

if (current->need_resched) schedule();

to the case of generic_file_write/generic_file_read. Why? Because that case is a real-world case where it's obvious that with huge caches a normal user can quite simply cause bad latencies, and this is not an issue that needs all that much discussion - it's an obvious hack, but is't also equally obvious why it's there, and what the point of it is. We've had a few of these before: I think the /dev/null driver already does exactly this.

In fact, because we should look at the "UP threading code" for 2.5.x anyway (ie using the spinlocks on UP to generate a fairly well-threaded UP kernel without any source modifications), the one-liner if-statement should probably be a #define with a simple-to-grep-for name so that it can be easily removed.

Why just "probably"? A few reasons:

  • I suspect that especially if we generate such a macro, people will start sprinkling it around at various random places. And I do not want such a simple hack to spread out any more than necessary. One or two places are fine. So is four. Numerous places in the networking code, device drivers and filesystems doing it is _not_. At that point it has gone from a very specific and slightly tasteless hack to an ugly architecture.

    This could probably be avoided by giving it a clear comment and a discouraging name.

  • I do think the "copy_to/from_user()" case is the cleaner one, but I'd hate to do the test in-line, and I'd hate to do it all over the place. And if we do it in copy_to/from_user(), we don't do it _anywhere_ else: again the difference is one between a slightly tasteless hack, and a ugly rule of life.

    This is the more complex case, and requires more careful code in <asm/uaccess.h>. The "generic" arbitrarily-sized functions should probably be moved out-of-line into arch/xxx/lib/uaccess.c, because that test is no longer worth doing inline. Together with benchmark runs.

See? "slightly tasteless" and "strictly localized" are ok. Anything more hackish than that is not.

This non-rejection of RTLinux may have been taken by some as an additional endorsement, because much of the thread focused on that alternative. For instance, elsewhere in the thread Victor Yodaiken also extolled RTLinux, putting in, "RTLinux now works on Linux all Linux versions up to 2.4.test and PowerPC,Alpha, as well as x86 (Mips soon)." He added, " The RTLinux patch was always small. The complexity is in what it does. Despite my complaints, Ingo and Linus, with a small contribution from yours truly, have really simplified the 2.3 irq interface and made it easier to track in RTLinux."

There were also variations on the RTLinux theme: Richard Gooch proposed the idea of having two schedulers, a real-time scheduler and the normal Linux scheduler. As he envisioned it, different apps would use the scheduler appropriate to their needs. This idea did not find much support however.

Folks like Steve VanDevender characterized the main debate as being between two fundamentally opposed ideas. They argued that Linux was a time-sharing operating system, designed to distribute resources fairly among multiple users, while real-time systems needed to preemptively take control of all needed resources, whether other processes wanted them or not. To Steve and others, this issue simply could not be fully resolved.

Low Latency In The Main Kernel

Surprisingly, Linus later came out against this very wing of discussion that had seemed to support his position. At one point, he said:

I personally would rather see that nobody ever needed RTlinux at all. I think hard realtime is a waste of time, myself, and only to be used for the case where the CPU speed is not overwhelmingly fast enough (and these days, for most problems the CPU _is_ so overwhelmingly "fast enough" that hard realtime should be a non-issue).

I definitely agree with low-latency requirements even in a standard Linux. I just disagree violently with doing them with horrible cludges instead of working on doing it right.

In a seperate post, he went on:

_if_ there are truly hard guarantee requirements, RTLinux is the way to go.

The number of problems that really need RT-linux is practically zero for normal uses. This is, btw, why I've never applied the RTLinux patches to the standard kernel tree. My personal opinion is that the RTLinux patches should _not_ be available by default, so that only people who really need the functionality start using it.

Having non-hard-RT users start using the RTLinux functionality would be a disaster, in my opinion. The programming-interfaces are much more cumbersome, and the ways of making the system lock up hard are many and varied.

If you do a computer-controlled radiation-dose machine for treating patients, and the latency guarantees have to be in the sub-100ms range, THEN you should use RTLinux.

If you're doing just audio that needs approximately 1% fo the CPU resources, and you have to use hard-realtime, the system needs work. Using RTLinux is a way of saying "oh, we can't fix this properly".

NOTE: I'm fully aware of the fact that Linux needs improvment in this area. I've tried to explain that my beef with the low-latency patches has never been that I don't believe it is a worthy goal. It's just that I also firmly believe that there are right ways of doing this. Without ugly patches that add random stuff to random places.

In light of these posts, Larry also clarified his own position:

I guess the problem I have is that I don't see a clean way to make sure that no high latency events ever creep into the kernel. I'm 100% in agreement with the idea that all code paths through the kernel should be short and sweet, but that isn't always the case. All it takes is one misbehaving driver that hangs onto the CPU too long and you missed your deadline.

I'm not a fan of realtime either but I hate half assed realtime creeping into a time sharing kernel - everything I know personally or have read about says that this is a bad idea.

Given all that, if you want to take a Linux box and use it to drive a pile of devices with hard guarentees (think factory floor, CNC devices, mixers, lots of stuff that is currently done in ASICs), then you need a better answer than Linux gives, even with the Ingo patches.

He added in the same post, "I'd rather support RTLinux than see lots of kludges being slipped into the generic kernel."

Here Linus agreed that badly written drivers would be a problem, but he pointed out that "the approach that the patches so far have taken is to just add scheduling points all over the map." And regarding RTLinux he reiterated, "In many cases I just think that RTLinux is a worse fix than the disease. I think RTLinux is perfect for those things that truly need latency guarantees: no OS at _all_ in the way. But using it for "normal" stuff like just streaming audio and video is overkill. They don't have microsecond latency requirements."

Up until this general area of the discussion, the proponents of low-latency had argued against the RTLinux faction on several grounds. Firstly they felt that it would not answer all their technical needs, i.e. they claimed it was simply not a good solution for various reasons. Secondly, because RTLinux was not part of the mainstream kernel and was not found in any major distribution, they argued that the audio/video applications they wanted to develop would be unable to run on most systems. Those in favor of RTLinux had argued that RTLinux would get into more distributions if the software apps were made available; and they also argued that RTLinux was not as technically deficient as the petitioners had thought. There had been a lot of discussion back and forth on these points, but now Benno Senoner (one of the proponants of the real-time petition) replied to Linus with relief. He felt that Linus' remarks were right in line with what he and others wanted to hear, and asked if Linus would apply properly done low latency patches to 2.4, or if it would have to wait until 2.5; Linus replied:

I can apply the obvious stuff today. But it would need to be a clean patch, and I suspect that the only part that I would consider truly obvious would be the user copy part. And adding a test for "need_resched" does imply not inlining them any more, it's already border-line, and the re-scheduling makes it obviously so (by the time you can potentially call "schedule()", the compiler has to save/restore all the call-clobbered registers anyway, so 90% of the advantages of inlining have been destroyed, making the disadvantages like icache footprint etc clear).

Note that regardless of _what_ the problem is, I always prefer incremental patches anyway. Maybe people in the end can convince me that every single scheduling point makes 100% sense, and is not a hack at all but a natural thing. Even if that were the case, I'd like to get the thing in smaller and explainable pieces..

Paul replied here, finding Linus' statement about accepting scheduling points to be possibly inconsistant with an earlier statement, that he refused to have a kernel that was "bogged down with random crap all over the place". Paul went on:

Maybe its just that I'm too philosophical and you're too pragmatic. I can see 2 possibilities from here:

  1. your revulsion at "random" scheduling points is a really strong belief that would likely make convincing you of the value of each particular point impossible,
  2. you you accept the idea that there may need to be a bunch of "random" scheduling points for this to work, and whilst you consider this ugly, you accept that there isn't much of an alternative. people will have to have a lot of good numbers to convince you to apply a patch that adds a scheduling point.

Do either of these sound like a reasonable summary of your position, or is there some other precis?

Linus replied:

I _am_ pragmatic. That which works, works, and theory can go screw itself.

However, my pragmatism also extends to maintainability, which is why I also want it done well.

He also replied to each of Paul's two interpretations of his position. To the first, Linus said:

I want more than an explanation like "I ran our latency tester, and this point seemed to be really bad, so I added a scheduling point here".

For example, for the specific read/write user-copy code, I don't even need to get numbers. It's clear that with a machine that has tons of memory, and where the data is cached, we can generate a "read()" system call that spends quite a lot of time without needing any active re-scheduling. This is not a random point: it's something that can be clearly explained from the sources, and a case where there clearly is no better solution unless you fully thread the thing.

To the second, Linus said, "More than just numbers, but yes. I'd like to know that the code isn't just crap. For example, let's say that something uses an O(n^3) algorithm, and to "overcome" the expense of this thing we add scheduling points in it. That's the easy way to do it. But maybe the right thing to do is to realize that the code may be badly structured in the first place?"

So to sum up the discussion, it appears that Linus is at least willing to consider patches that would make the petitioners happy, but he has some strict views on what would be acceptible. It's not clear that any solution will definitely be found or is even possible, but the guidelines have been set for future development. In the meantime, RTLinux remains at least a plausible interim solution for people needing to do their streaming audio/video work under Linux now.

RTLinux first came up in Issue #3, Section #5  (21 Jan 1999: FUD From WindowsNT Magazine) . Then (aside from a couple very brief mentions) not until Issue #34, Section #29  (4 Sep 1999: User-Mode Kernel V. 2.3.15-2um Announced) , where it was asserted that RTLinux would not always be needed for real-time audio. RTLinux was involved in a bit of a trademark discussion in Issue #54, Section #10  (6 Feb 2000: Linux Trademarks) . RTLinux was most recently put forward as a solution for real-time needs in Issue #68, Section #9  (10 May 2000: Standard Kernel Or RTLinux For Real-Time Needs?) .

Other discussions of real-time needs were covered in Issue #5, Section #9  (7 Feb 1999: Process Scheduling) . A battle to make the standard Linux scheduler more real-time was fought in Issue #8, Section #2  (24 Feb 1999: Real-Time Scheduling Flame War) . Most recently the need for real-time scheduling was again suggested in Issue #10, Section #20  (9 Mar 1999: Real Time And MP3 Skipping) .

 

7. Linus Inadvertantly Steps Into Minor Filesystem Dispute
30 Jun 2000 - 6 Jul 2000 (5 posts) Archive Link: "[patch-2.4.0-test3-p2] misc fixes."
People: Tigran AivazianLinus TorvaldsRichard GoochAlexander ViroPhilipp Rumpf

Tigran Aivazian posted a patch, and explained:

This patch does:

  1. small fix to microcode driver - the "cc clobber" directive is not needed for cpuid instruction as it doesn't change EFLAGS. Neither it is needed in general in asm-i386/processor.h:cpuid() inline (this was pointet out by Philipp Rumpf but for some reason didn't appear in test3-preX)
  2. amended comment in fs/inode.c:iput() to reflect the truth, i.e. it is not a "magic nfs path" but the fact that anonymous inodes are not put on unused list
  3. documented the fact that for FS_SINGLE filesystems the driver must call kern_mount() after register_filesystem(). This already confused others (e.g. Richard Gooch) so imho it is worth pointing out.
  4. kern_mount() is supposed to be used only with FS_SINGLE filesystems but the code doesn't enforce it. This patch makes kern_mount() fail with EINVAL on attempt to call it on non-FS_SINGLE filesystems.

Linus Torvalds replied to the fourth item, "Ugh. Would it not be 100% cleaner to just do this automatically for FS_SINGLE filesystems upon register/unregister?" Richard Gooch replied, "That's what I said weeks ago! But Al preferred to keep the two operations separate," and Tigran quoted Alexander Viro in an discussion on the fsdevel mailing list, where Alexander said, "I don't think so. They are different operations and I'm not too happy about mixing them together. Matter of taste, but..."

There was no reply to Tigran on linux-kernel, but in that fsdevel discussion, Richard had replied to Alexander, "Yeah, I know. Having it documented would satisfy me. Getting a kernel BUG after adding FS_SINGLE was a shock: "what the %@$& ?!?"."

 

8. More Flames Over Latency
29 Jun 2000 - 9 Jul 2000 (78 posts) Archive Link: "Low Latency Patch"
Topics: SMP
People: Victor KhimenkoRobert Dinse

Robert Dinse had been paying attention to the low latency discussion (see Issue #76, Section #6  (28 Jun 2000: Petitioners Request Real-Time Latency In The Kernel) ), and decided to try out Ingo's patch. He found interactive response on heavily loaded machines to be drastically improved. The only problem he found was that they worsened the existing likelihood of spin-lock deadlocks on Sparc SMP. He voted for including the patch in the main kernel as a config option. Victor Khimenko replied in frustration, "Argh. Have you READ any Linus's letter on subject ? "It's all about maintability, stupid". If you think tens (if not hundreds) ifdef's spread all over kernel three is easier to maintain then external patch then think again. Patch is not accepted NOT since goal is wrong. It's completely other story: this patch makes maintaining of kernel sources nightmare. Configuration option can not help maintainers. Not at all." Robert felt this was being a bit harsh, and reiterated that the low-latency patch had a huge impact on performance under load. There were a lot of angry words between them, and others joined in as well. A lot of folks pointed to Linus' remarks in other discussions as settling the issue, and eventually the thread petered out.

 

9. Spinlocks Broken In Some Distributions
30 Jun 2000 - 6 Jul 2000 (32 posts) Archive Link: "spinlocks() are severely broken in 2.2.X and 2.4.X for modules"
People: Jeff V. MerkeyManfred SpraulAndi Kleen

Jeff V. Merkey reported:

I have spent the past five and one half weeks chasing down a severe memory corruption problem in the NWFS LRU. The problem I am reporting is what has caused ALL the bugs folks have seen over the past couple of months in NWFS with memory corruption. It's what's held up our next release by over two months. Despite screaming customers, I have held off on posting the next release until I understood clearly what was going on -- now I know -- Linux spinlocks() don't work when used by kernel modules in linux that include spinlock.h.

NWFS is fully multi-threaded and uses fine grained locking. Much of Linux is not fine grained, which would explain why I may be one of the few folks to first see this problem. Initially, I was under the assumption that the spinlocks() in Linux worked, and because of this, focused my attention on my own code rather than scrutinize the code in Linux. I rewrote large sections of NWFS, put in traps, and checks, list consistency routines, etc. It was not until I put in a check for multiple users being inside the same lock that I located the problem. Well, I have completed a very thorough analysis of generated IA32 code, and I have discovered something rather shocking, which is that the spinlock code in Linux is severely broken, and this is not due to a coding error, but a problem with the GCC compiler apparently generating garbage. There's also several issues with using "lock bts" instead of "xchg eax, 1", which is the recommended method for implementing spinlocks() on IA32 intel systems.

What's really scary here is that a great deal of new code has been written that depends on this spinlock code, and once the spinlock code gets fixed properly, we may see tons of deadlocks and lockups all over the place since this code has been there for a very long time.

He posted a lot of code and analysis to prove his claim, and various folks hunted for alternative explanations. At one point the seriousness of the situation was made clear when Manfred Spraul remarked "Changing the spinlock code is IMHO not a solution: we rely on .text.somewhere_else very often (spinlock, semaphore, exception handler table, init functions, initcall,...)"

Eventually, Jeff reported that he and Andi Kleen had tracked the problem to a "bug in gas/ld with relocation records for the .text.lock section being created by the spinlock macro." After some more discussion, folks seemed to settle on the idea that the situation should just be fixed in 'binutils'. But Jeff also remarked, "The bad news is that the binaries in OpenLinux 2.4 (GLIB/EGCS) seem to cause this problem when installed by default -- RedHat 6.2 and the latest Suse Linux were both clean."

Finally, Jeff concluded, "Consider this problem report to be potentially "bogus" since it was apparently isolated to OpenLinux 2.4 and does not seem to affect other Linux versions. The spinlocks aren't broken generally, just in the OpenLinux 2.4 case."

 

10. Small Latency Patch Has Ambiguous Results
1 Jul 2000 - 8 Jul 2000 (6 posts) Archive Link: "[PATCH] latency improvements, one reschedule moved"
Topics: Real-Time
People: Zlatko CalusicRoger Larsson

For more on latency and real-time issues, see Issue #76, Section #6  (28 Jun 2000: Petitioners Request Real-Time Latency In The Kernel) and Issue #76, Section #8  (29 Jun 2000: More Flames Over Latency) .

Roger Larsson posted a patch to clean up 'kswapd' and move its reschedule point. As far as he could tell, the patch really improved latency. He replied to himself a few days later with a new patch to correct some bugs and other problems with the the previous patch. He reported a slight performance drop, but an even better latency improvement. Linus included the patch in 2.4.0-test3-pre4, and Zlatko Calusic reported a pleasing tremor of the soul. He said:

The I/O bandwidth has greatly improved and I'm still trying to understand how can patch this simple be so effective. :)

Great work Roger!

I see this as the first (and most critical) step of returning my faith in good performing 2.4.0-final.

Roger replied that actually, the performance boost Zlatko saw was unintentional. He speculated briefly, then rushed off to code.

 

11. Latency Profiling
3 Jul 2000 - 7 Jul 2000 (4 posts) Archive Link: "[PATCH] latency-profiling for 2.4.0-test3-pre2"
Topics: Disks: IDE, Real-Time, SMP, Sound: ALSA
People: Samuel S ChessmanBenno SenonerAndrew MortonRoger Larsson

For more on latency and real-time issues, see Issue #76, Section #6  (28 Jun 2000: Petitioners Request Real-Time Latency In The Kernel) , Issue #76, Section #8  (29 Jun 2000: More Flames Over Latency) , and Issue #76, Section #10  (1 Jul 2000: Small Latency Patch Has Ambiguous Results) .

Roger Larsson responded to all the recent latency discussion by updating his latency profiling patch to the latest development kernel, 2.4.0-test3-pre2. He posted the patch, then replied to himself with an update and a pointer to more info (http://www.norran.net/nra02596/latency-profiling.html) . After yet another update, Samuel S Chessman reported:

I have been following the latency issue for some time now and I am happy to report great strides have been taken in improving latency in the Linux 2.4.0-test2-pre? series of patches.

Here is a snapshot of some results that show promise! I hope to repeat them on an SMP system soon.

Benno Senoner's (sbenno at gardena dot net) latencytest used for these charts: See http://www.gardena.net/benno/linux/audio/latencytest-0.42.tar.gz for the source.

http://www.tux.org/~chessman/bench/latencytest/ has preliminary results with charts showing the differences when hdparm unmaskirq and 32bit are toggled.

I varied the hdparm 16/32 bit I/O and interrupt unmasking, and got a non intuitive result, better results for 16 bit I/O, unmasking disabled. Tonight I will run the test with the other two cases and see if it correlates to one of them.

System under test is a Dell GX1 400MHz PII IDE disk, running

  • linux-2.4.0-test2-pre5
  • mm/filemap.c patches from Andrew Morton dated Thu Jul 06 2000 - 23:21:48 EDT
  • ALSA-driver 0.5.8b
  • XFree86 Version 3.3.6 w/ Mach64, glx.o, agpart kernel module

All tests are showing greater than 99% +/- 2ms for the disk tests. The overruns appear to be mm related, Andrew's patch indicated sys_close() and sys_exit() still need attention.

This is getting close to usable for my purposes!

End Of Thread (tm).

 

12. Latency Benchmarks And Prognosis
3 Jul 2000 - 7 Jul 2000 (28 posts) Archive Link: "[DATAPOINT] kernels and latencies"
Topics: Real-Time
People: Roger LarssonAndrew MortonAlan CoxRik van Riel

For more on latency and real-time issues, see Issue #76, Section #6  (28 Jun 2000: Petitioners Request Real-Time Latency In The Kernel) , Issue #76, Section #8  (29 Jun 2000: More Flames Over Latency) , Issue #76, Section #10  (1 Jul 2000: Small Latency Patch Has Ambiguous Results) , and Issue #76, Section #11  (3 Jul 2000: Latency Profiling) .

Roger Larsson had gone through his collection of data and found that, in terms of latency, the best kernel not patched with Ingo's latency patches, was the combination of:

  • Linus......... linux-2.4.0-test1
  • Alan Cox...... linux-2.4.0test1-ac22-riel
  • Rik van Riel.. mail sent to linux-kernel and linux-mm with a subject of "[PATCH] -ac22-riel++"

He reported, "It works perfectly for streaming read/write/copy (latencies to SCHED_FIFO process below 5 ms !) but fails for mmap002 with latencies > 180 ms" and gave a pointer to his audio page (http://www.norran.net/nra02596/) which included his latency profiling patches.

Andrew Morton replied with some very promising benchmarks of his own, though he was not in full agreement with Roger's methods. They went back and forth a bit, and other folks joined in, the upshot being that maintaining low latency will probably remain fairly difficult for the foreseeable future, though it should be possible with the proper configuration and hardware (at one point, Andrew remarked, "In the meanwhile, this probably means that people who require low latency will need to buy an RS232 mouse. That's OK." )

 

13. Dynamic Inode Allocation
5 Jul 2000 - 8 Jul 2000 (9 posts) Archive Link: "BK performance tip (22x faster)"
Topics: BitKeeper, Version Control
People: Chuck LeverAndrea ArcangeliLarry McVoyRichard Gooch

Larry McVoy reported that a BitKeeper performance problem could be traced to not enough inodes on the system. By default, Linux set 16384 inodes, which was just under the ammount needed by BitKeeper to house the entire kernel. By issuing the command 'echo 65536 > /proc/sys/fs/inode-max', Larry was able to raise the number of inodes and reduce BitKeeper's running time by a huge amount. He suggested raising the default number of inodes to 32K or more. Richard Gooch replied that he'd had a similar problem, and had used the same solution, but had noticed that 2.3.99 and later kernels didn't have the control Larry'd used. He asked if inode generation had become dynamic in the later kernels, and Andrea Arcangeli replied that yes, it had; and Chuck Lever explained, "2.3/4 kernels use a SLAB cache for inodes, instead of an ad hoc cache. the old limit was used to determine when to reap inodes. the new system reaps them automatically when system memory is short." Richard was thrilled to hear this, and the thread skewed off and petered out.

 

14. Configuring Number Of CPUs On SMP Systems
9 Jul 2000 - 10 Jul 2000 (8 posts) Archive Link: "CONFIG_SMP_CPUS"
Topics: SMP
People: Dimitris MichailidisMatthew WilcoxIngo Molnar

Matthew Wilcox posted a small patch to make the number of processors on SMP systems a config option, rather than a hard coded value in the source code. Ingo Molnar replied that he'd been doing this for a long time himself, and could vouch for the stability of the patch; but he cautioned that if the number of CPUs exceeded the amount configured, the machine could crash in subtle and hard-to-debug ways. He posted a patch of his own to take care of that, and Dimitris Michailidis proposed, "A better solution is to group the various per-cpu data into a single structure instead of having them scattered as they are now, and then allocate as many of these structures as there are CPUs. This allocation can be done automatically by the CPU detection and boot-up code, you don't have to specify manually the number of CPUs. This also allows the same compiled kernel to be used on machines with different numbers of CPUs. Have a look at my PDA patch at http://reality.sgi.com/dimitris_engr/pda_patch-2.4.0-1 for how this can be done." There was no reply.

 

 

 

 

 

 

We Hope You Enjoy Kernel Traffic
 

Kernel Traffic is grateful to be developed on a computer donated by Professor Greg Benson and Professor Allan Cruse in the Department of Computer Science at the University of San Francisco. This is the same department that invented FlashMob Computing. Kernel Traffic is hosted by the generous folks at kernel.org. All pages on this site are copyright their original authors, and distributed under the terms of the GNU General Public License, version 2.0.