Kernel Traffic
Latest | Archives | People | Topics
Latest | Archives | People | Topics
Latest | Archives | People | Topics
Home | News | RSS Feeds | Mailing Lists | Authors Info | Mirrors | Stalled Traffic

Kernel Traffic #93 For 13 Nov 2000

By Zack Brown

linux-kernel FAQ | subscribe to linux-kernel | linux-kernel Archives | | LxR Kernel Source Browser | All Kernels | Kernel Ports | Kernel Docs | Gary's Encyclopedia: Linux Kernel

Table Of Contents

Mailing List Stats For This Week

We looked at 1123 posts in 4434K.

There were 388 different contributors. 148 posted more than once. 143 posted last week too.

The top posters of the week were:

1. Redesigning 'poll()'

21 Oct 2000 - 1 Nov 2000 (73 posts) Archive Link: "Linux's implementation of poll() not scalable?"

Topics: BSD: FreeBSD, Big O Notation, Ioctls, Real-Time, SMP

People: Dan KegelLinus TorvaldsFelix von LeitnerDave ZarzyckiMitchell Blank JrJordan MendelsonLyle CoderAndi KleenH. Peter Anvin

Dan Kegel reported:

I ran a benchmark to see how long a call to poll() takes as you increase the number of idle fd's it has to wade through. I used socketpair() to generate the fd's.

Under Solaris 7, when the number of idle sockets was increased from 100 to 10000, the time to check for active sockets with poll() increased by a factor of only 6.5. That's a sublinear increase in time, pretty spiffy.

Under Linux 2.2.14 [or 2.4.0-test1-pre4], when the number of idle sockets was increased from 100 to 10000, the time to check for active sockets with poll() increased by a factor of 493 [or 300, respectively]. This is terribly, horribly bad scaling behavior, since the time required appears to increase three times faster than the number of sockets.

Detailed measurements and the source code used to make them are available at

Linus Torvalds replied:

Yeah. It's pretty spiffy.

Basically, poll() is _fundamentally_ a O(n) interface. There is no way to avoid it - you have an array, and there simply is _no_ known algorithm to scan an array in faster than O(n) time. Sorry.

(Yeah, you could parallellize it. I know, I know. Put one CPU on each entry, and you can get it down to O(1). Somehow I doubt Solaris does that. In fact, I'll bet you a dollar that it doesn't).

So what does this mean?


  1. Solaris has solved the faster-than-light problem, and Sun engineers should get a Nobel price in physics or something.
  2. Solaris "scales" by being optimized for 10000 entries, and not speeding up sufficiently for a small number of entries.

You make the call.

Basically, for poll(), perfect scalability is that poll() scales by a factor of 100 when you go from 100 to 10000 entries. Anybody who does NOT scale by a factor of 100 is not scaling right - and claiming that 6.5 is a "good" scale factor only shows that you've bought into marketing hype.

In short, a 6.5 scale factor STINKS. The only thing it means is that Solaris is slow as hell on the 100 descriptor case.

He went on:

what you're showing is that Linux actually is _closer_ to the perfect scaling (Linux is off by a factor of 5, while Solaris is off by a factor of 15 from the perfect scaling line, and scales down really badly).

Now, that factor of 5 (or 3, for 2.4.0) is still bad. I'd love to see Linux scale perfectly (which in this case means that 10000 fd's should take exactly 100 times as long to poll() as 100 entries take). But I suspect that there are a few things going on, one of the main ones probably being that the kernel data working set for 100 entries fits in the cache or something like that.

And continued:

I suspect we could improve Linux in this area, but I hope that I pointed out the most fundamental mistake you did, which was thinking that "scalability" equals "speed". It doesn't.

Scalability really means that the effort to handle a problem grows reasonably with the hardness of the problem. And _deviations_ from that are indications of something being wrong.

Some people think that super-linear improvements in scalability are signs of "goodness". They aren't. For example, the classical reason for super-linear SMP improvement (with number of CPU's) that people get so excited about really means that something is wrong on the low end. Often the "wrongness" is lack of cache - some problems will scale better than perfectly simply because with multiple CPU's you have more cache.

The "wrongess" is often also selecting the wrong algorithm: something that "scales well" by just being horribly slow for the small case, and being "less bad" for the big cases.

In the end, the notion of "scalability" is meaningless. The only meaningful thing is how quickly something happens for the load you have. That's something called "performance", and unlike "scalability", it actually has real-life meaning.

Under Linux, I'm personally more worried about the performance of X etc, and small poll()'s are actually common. So I would argue that the Solaris scalability is going the wrong way. But as performance really depends on the load, and maybe that 10000 entry load is what you consider "real life", you are of course free to disagree (and you'd be equally right ;)

Lyle Coder asked for some hard numbers comparing Solaris' 100-entry results with Linux's 100-entry results, and same with the 10000-entry case, and Dan replied that the URL he'd posted gave some of those benchmarks. He posted some to the list, that showed, "When finding 1 active fd among 10000, the Solaris implementation creamed the living snot out of the Linux one, even though the Solaris hardware was 5 or so times slower" He concluded, "If you have to poll 100 or fewer sockets, Linux's poll is quite good. If you have to poll 1000 or so sockets, Linux's /dev/poll works well (I wish it were available for the current kernels). But if you have to poll 10000 or sockets on Linux, neither standard poll nor /dev/poll as currently implemented performs adequately." Shortly thereafter, Linus said:

I hope nobody took my rant against "scalability" to mean that I consider the Solaris case to necessarily be bad engineering. My rant was more about people often thinking that "scalability == performance", which it isn't. It may be that Solaris simply wants to do 10000 entries really fast and that they actually do really well, but it is clear that if so they have a huge performance hit for the small cases, and people should realize that.

It's basically a matter of making the proper trade-offs. I think that the "few file descriptors" case is actually arguably a very important one. Sun (who has long since given up on the desktop) probably have a different tradeoff, and maybe their big constant setup-time is worth it.

Andi Kleen pointed out that Linux scanned the full entry list at least four times, in its current implementation. He added that for small numbers of entries, the current implementation was also wasteful in terms of memory and CPU cycles. Linus argued that 'poll()' was simply a bad interface in terms of scalability (Felix von Leitner remarked, "Is that a reason to implement it badly?" ), and added, "I'm sure it can be speeded up, I'm just not sure it's really worth it if you'd instead just have a better interface that wouldn't need this crap, and consider poll() as just a compatibility layer." Andi asked what kind of interface Linus would prefer, and Linus replied:

I suspect a good interface that can easily be done efficiently would basically be something where the user _does_ do the equivalent of a read-only mmap() of poll entries - and explicit and controlled "add_entry()" and "remove_entry()" controls, so that the kernel can maintain the cache without playing tricks.

Basically, something like a user interface to something that looks like the linux poll_table_page structures, with the difference being that it doesn't have to be created and torn down all the time because the user would explicitly ask for "add this fd" and "remove this fd" from the table.

Th eproblem with poll() as-is is that the user doesn't really tell the kernel explictly when it is changing the table..

But he replied to himself an hour and a half later, with a new proposal:

Actually, forget the mmap, it's not needed.

Here's a suggested "good" interface that would certainly be easy to implement, and very easy to use, with none of the scalability issues that many interfaces have.

First, let's see what is so nice about "select()" and "poll()". They do have one _huge_ advantage, which is why you want to fall back on poll() once the RT signal interface stops working. What is that?

Basically, RT signals or any kind of event queue has a major fundamental queuing theory problem: if you have events happening really quickly, the events pile up, and queuing theory tells you that as you start having queueing problems, your latency increases, which in turn tends to mean that later events are even more likely to queue up, and you end up in a nasty meltdown schenario where your queues get longer and longer.

This is why RT signals suck so badly as a generic interface - clearly we cannot keep sending RT signals forever, because we'd run out of memory just keeping the signal queue information around.

Neither poll() nor select() have this problem: they don't get more expensive as you have more and more events - their expense is the number of file descriptors, not the number of events per se. In fact, both poll() and select() tend to perform _better_ when you have pending events, as they are both amenable to optimizations when there is no need for waiting, and scanning the arrays can use early-out semantics.

So sticky arrays of events are good, while queues are bad. Let's take that as one of the fundamentals.

So why do people still like RT signals? They do have one advantage, which is that you do NOT have that silly array traversal when there is nothing to do. Basically, the RT signals kind of approach is really good for the cases where select() and poll() suck: no need to traverse mostly empty and non-changing arrays all the time.

It boils down to one very simple rule: dense arrays of sticky status information is good. So let's design a good interface for a dense array.

Basically, the perfect interface for events would be

struct event {

unsigned long id; /* file descriptor ID the event is on */
unsigned long event; /* bitmask of active events */
int get_events(struct event * event_array, int maxnr, struct timeval *tmout);

where you say "I want an array of pending events, and I have an array you can fill with up to 'maxnr' events - and if you have no events for me, please sleep until you get one, or until 'tmout'".

The above looks like a _really_ simple interface to me. Much simpler than either select() or poll(). Agreed?

Now, we still need to inform the kernel of what kind of events we want, ie the "binding" of events. The most straightforward way to do that is to just do a simple "bind_event()" system call:

int bind_event(int fd, struct event *event);

which basically says: I'm interested in the events in "event" on the file descriptor "fd". The way to stop being interested in events is to just set the event bitmask to zero.

Now, the perfect interface would be the above. Nothing more. Nothing fancy, nothing complicated. Only really simple stuff. Remember the old rule: "keep it simple, stupid".

The really nice part of the above is that it's trivial to implement. It's about 50 lines of code, plus some simple logic to various drivers etc to actually inform about the events. The way to do this simply is to limit it in very clear ways, the most notable one being simply that there is only one event queue per process (or rather, per "struct files_struct" - so threads would automatically share the event queue). This keeps the implementation simple, but it's also what keeps the interfaces simple: no queue ID's to pass around etc.

Implementation is straightforward: the event queue basically consists of

Advantage: everything is O(1), except for "get_event()" which is O(n) where 'n' is the number of active events that it fills in.

Basically, I can't see that it can be done much simpler, and I don't see that you can have much better performance.

He concluded, "Yes, I'm probably missing something. But this is the kind of thing I think we should look at (and notice that you can _emulate_ this with poll(), so you can use this kind of interface even on systems that wouldn't support these kinds of events natively)."

Dave Zarzycki was very excited by this, and said he'd been wishing for a user-space interface of that kind for a long time. But he added, "The only thing I might have done differently is allow for multiple event queues per process, and the ability to add event queues to event queues. But these features might not be all that useful in real life." H. Peter Anvin liked Dave's suggestion, but wasn't sure if it could be done without harming the rest of the implementation. He explained that Linus' point was that 'select()' and 'poll()' had one large cost, which was setting up and destroying the set of events, and that his solution had been to have each of those actions persist, and not have to be done over and over. So as H. Peter argued, if Dave's suggestion of multiple event queues would entail a lot of construction and destruction of the queues, it would defeat the purpose of the original proposal.

Also in reply to Linus' proposal, Mitchell Blank Jr gave a pointer to Solaris 8's '/dev/poll/index.html' interface, adding, "This discussion is reinventing the wheel, the lever, and the inclined plane." He explained:

I think this is a lot cleaner than your approach:

A few simple improvements can be made to the Sun model, though:

The linux patch of /dev/poll implements mmap'ing of the in-kenrel poll table... I don't think this is a good idea. First, the user just wants to be able to add events and dequeue them - both linear operations. Second, why should the kernel be forced to maintain a fixed-size linear list of events we're looking for... this seems mindlessly inefficient. Why not just pull a "struct pollfd" out of slab each time a new event is listened for?

My unresolved concerns:

To the whole idea that Solaris' implementation could be better, Linus replied, "You must be kidding. /dev/poll is the interface from _hell_. It's absolutely disgusting." He quoted Mitchell's argument that Solaris didn't create any new system calls for their implementation, and replied:

Sure it does.

What do you think ioctl's are? Every single ioctl number is an extra system call. It's just indirected inside the device driver instead of at the normal system call entry point. The advantage? It's noticeably slower than a real system call.

Regarding the 'ioctl' question, Mitchell replied, "As I explained a few lines down from where you stopped quoting (and probably stopped reading) the ioctl() use is just an artifact of Solaris's icky implementation. It could and should use read(). Not an extra syscall." There was no reply.

Far elsewhere, Jordan Mendelson gave a link to a Linux implementation of '/dev/poll/index.html' for 2.2 and 2.4, and Dan also gave a link to FreeBSD's 'kcqueue' interface for events. To this, Linus replied:

I've actually read the BSD kevent stuff, and I think it's classic over-design. It's not easy to see what it's all about, and the whole <kq, ident, filter> tuple crap is just silly. Looks much too complicated.

I don't believe in the "library" argument at all, and I think multiple event queues completely detract from the whole point of being simple to use and implement.

Linus led the thread back to his own proposal, and there followed a bit of implementation discussion.

2. Which Compiler To Use?

24 Oct 2000 - 31 Oct 2000 (65 posts) Archive Link: "[patch] kernel/module.c (plus gratuitous rant)"

Topics: Version Control

People: Linus TorvaldsAndrew MortonAlan CoxRichard HendersonDominik KublaPeter SamuelsonMartin DaleckiPavel MachekKeith OwensDan AloniBarry K. Nathan

Andrew Morton disgruntledly posted a quick patch to let gcc- compile kernels again, and complained that it never would have been broken in the first place if the person who'd sent Linus Torvalds the initial patch had Cced linux-kernel (Dan Aloni authored the patch in question, as he told me by private email, and added that he'd simply forgot to Cc: linux-kernel -- Ed: [13 Nov 2000 08:45:00 -0800]; but Linus replied:

It seems that gcc- is terminally ill. I'd rather change Documentation/Changes, and just document the fact.

These kinds of subtle work-arounds for gcc bugs are not really acceptable, nor is it worthwhile complaining when somebody does development with a gcc that is _not_ broken, and doesn't notice that some random gcc bug breaks the kernel for others.

Andrew agreed that changing the documentation would probably be best, and added, "My gripe is unrelated to gcc - it concerns people who send you patches and keep everyone else, including the nominal maintainer in the dark. I challenge anyone to demonstrate how the evasion of peer review strengthens Linux." Elsewhere, Barry K. Nathan posted the patch to fix up the docs, so that instead of recommending that very compiler, it would recommend egcs 1.1.2 (A.K.A. gcc 2.91.66), and warn against using; Andrew liked the patch, but bemoaned, "all the documentation has for years been saying that is the one true compiler, so we are now in for 12 months worth of bogus oops reports." He posted another patch to add more warnings, and also pointed out that the kernel FAQ needed updating as well.

Elsewhere, Pavel Machek asked if it might not be best to continue to support gcc, since it was still needed for kernel 2.0 and many 2.2 as well. But regarding 2.2's gcc requirements, Alan Cox replied, "There has only been one know egcs 1.1 build problem found in the last 9 months or so (the fpu emu one). I really dont think using egcs 1.1.2 to build 2.2 kernels is a problem. In fact its probably the default nowdays."

Keith Owens suggested that Pavel (or anyone using 2.0) could install multiple versions of gcc, and select the desired one by setting the 'CC' makefile variable before compiling (Dominik Kubla added that gcc accepted the '-V' command line option to specify which version to use; although Richard Henderson pointed out, "a nice idea, but it doesn't actually work. Changes in spec file format between versions makes this fall over." Dominik replied, "Wow. So much for reading the manual..." ).

At one point someone asked what the recommended compiler was for all the various kernel versions, and Peter Samuelson listed:

Martin Dalecki took exception to Peter's saying that Red Hat's 2.96 would break any known kernel, objecting, "Works fine for me and 2.4.0-test10-pre5... however there are tons of preprocessor warnings in some drivers."

3. Root Filesystem In 'ramfs'

29 Oct 2000 - 1 Nov 2000 (12 posts) Archive Link: "/ on ramfs, possible?"

Topics: FS: ramfs

People: Alan CoxStuart LynneDavid Woodhouse

Anders Eriksson had had no luck at all mounting his root filesystem via 'ramfs'. He asked if there was any hope, and several folks brought succor. Alan Cox replied with his usual concision, "Firstly make sure you get the patches that make ramfs work if they arent all yet applied, then build your initrd that populates it on boot. Now you can make use of the pivot_root() syscall (you may need to generate your own syscall wrapper for that one as its very new). That lets you flip your root and initrd around." Stuart Lynne recommended pivot_root() as well, and suggested:

Something like the following at the end of your initrd /linuxrc script should mount your ramfs, copy the existing root fs files to it, pivot and unmount your old root. YMMV

mkdir -p /ramfs /ram1
mount -t ramfs /ramfs /ramfs
find / | sed '/^\/ramfs/d;/^\/proc\/.*/d/index.html' | cpio -pdmV /ramfs
cd /ramfs
pivot_root . ram1
exec chroot . sh -c 'umount /ram1; exit' < dev/console >dev/console

David Woodhouse wondered why bother with 'ramfs' at all, and recommended just using a ramdisk, but several folks pointed out that ramdisks were limited to a fixed size, while 'ramfs' could grow and shrink.

4. Configuring The Proper CPU

2 Nov 2000 (6 posts) Archive Link: "test10 won't boot"

People: Arajan van de VenHans-Joachim BaaderArjan van de VenJeff Garzik

Hans-Joachim Baader reported that 2.4.0-test10 wouldn't boot on his AMD K6-2/400, and posted his '.config' file. Jeff Garzik offered a patch, but Hans reported no success. Then Arjan van de Ven noticed that Hans had configured the kernel for the Pentium II/III CPU type, which wouldn't work on the K6. He explained, "The compiler (and the kernel) will use the "new" Pentium II instructions (such as "cmov") which are not supported by the K6, leading to "illegal instruction" usage very early." Hans replied:

Ouch. This was it. I simply overlooked this option and used the one that worked with 2.2...

But it would be nice if the kernel could detect the wrong CPU type and print a message before it stops. Perhaps compile the init section without CPU specific optimization?

There was no reply.

5. OSS API Bug

2 Nov 2000 - 4 Nov 2000 (12 posts) Archive Link: "Poll and OSS API"

Topics: Sound: OSS

People: Thomas SailerLinus TorvaldsAlan Cox

Thomas Sailer reported:

The OSS API (, page 102ff) specifies that a select _with the sounddriver's filedescriptor set in the read mask_ should start the recording.

Implementing this is currently not possible, as the driver does not get to know whether the application had the filedescriptor set in the select call. Similarily for poll, the driver does not get the caller's events.

Different drivers do it differently. The ISA SB driver just unconditionally starts recording on select, whether the bit in the read mask is set or not. es137* unconditionally does not start recording. Both drivers violate the API.

I don't think this is all that important though, because it's that way for more than a year and the first complaint reached me yesterday.

Amid general confusion, Linus Torvalds replied, "So fix the stupid API. The above is just idiocy." Alan Cox replied, "Its a documentation error. The implemented API does not follow it." Elsewhere, Linus went on:

Considering that about 100% of the sound drivers do not follow that particular API damage anyway (they can't, as has been pointed out: the driver doesn't even receive enough information to be _able_ to follow the documented API), I doubt that there are all that many programs that depend on it.

Yes, some drivers apparently _try_ to follow the spec to some degree, but we should just change the documentation asap.

6. Security Hole In Latest Developer Release

3 Nov 2000 - 4 Nov 2000 (5 posts) Archive Link: "SETFPXREGS fix"

People: Andrea Arcangeli

Andrea Arcangeli discovered that 2.4.0-test10 had an interesting security hole, allowing any user to disable the Floating Point Unit by giving an incorrect value to the 'mxcsr' instruction. He posted some code to demonstrate the problem, and included a patch to fix it. Gareth Hughes had some problems with the code - not with correctness, but just with overall complexity - and felt it could be simplified. Andrea agreed, but added, "my first priority was that the the code was safe, and the previous one was completly safe too (ok, I admit I had to check out the asm generated before trusting it 8)." End of thread.

7. Migrating To A More Modular Kernel

4 Nov 2000 - 6 Nov 2000 (5 posts) Archive Link: "modular kernel"

Topics: Bug Tracking

People: Taco WittePeter SamuelsonHorst von BrandDavid FortEric S. RaymondLinus Torvalds

Taco Witte liked an idea he'd read recently, about making the kernel completely modular. He argued, "it would make it easier to get more people work at the same moment, development would go faster. It would be possible to make groups for a certain part of the kernel (for example sound, or filesystems, or main) with own group pages with status info and todo's and own mailinglists (it would divide this enourmous flow of mail into smaller parts). It would decrease the download size. I believe it would make bug tracking easier as well." Peter Samuelson replied, "I contend that the barrier to entry is already quite low, as proven by the fact that *I* contribute to kernel development, albeit rather little. What evidence do you have to the contrary?" David Fort thought he saw all the fixin's of a flame war, and Horst von Brand added, "Yep. About a very old bone, to boot." Apparently, these posts were enough to allay any other flames that might have been burning on the horizon.

Kernel Traffic most recently covered this kind of proposal in Issue #91, Section #8  (17 Oct 2000: Proposal To Speed Up Release Cycle) . It also came up in Eric S. Raymond's famous critique of Linus Torvalds, in Issue #83, Section #4  (21 Aug 2000: Driver Organization; Serial Devices And X; Sharing Code; Philosophy Of Development) . Still earlier, Linus supported this very idea, and projected a possible timeline whereby greater modularity might be achieved. This was covered in Issue #77, Section #7  (13 Jul 2000: Improving The Kernel Release Schedule)

8. Addresses Kicked Off linux-kernel

4 Nov 2000 - 5 Nov 2000 (2 posts) Archive Link: "[NOISE] Is the mailing list dead?"

People: Matti Aarnio

Mathieu ChouquetStringer complained that he hadn't gotten any linux-kernel mail in about two days, and Matti Aarnio explained:

Looks like all addresses have been kicked out.

That ISP has a habit of returning 500-series reponse codes with supplementary text of style: "System is too busy, please come back latter"

Which of course generates error bounce (like all 500 series codes are supposed to by the RFC 821) leading to removal of erroring address from the list.

Had we fully automated the removal process, both WANADOO.FR and GMX.NET/GMX.DE/GMX.AT users would get kicked out every day around 18-24 CET (UTC+1h). Now it is matter of manual oversight, and while I have been lenient letting those users be, listowners have different minds.

End Of Thread.

9. Preferred Way To Hang A System

6 Nov 2000 (6 posts) Archive Link: "Play Kernel Hangman!"

Topics: BSD: OpenBSD

People: Jeff DikeFrank v WaverenLeen BesselinkJoanne DowEli Carter

Jeff Dike finally snapped, and announced:

After a stranger than usual late-night #kernelnewbies session on Thursday, I was inspired to come up with Kernel Hangman. This is the traditional game of hangman, except that the words you have to guess are kernel symbols.

So, test your knowledge of kernel trivia and play it at

Eli Carter became immediately addicted and had to seek help, though he did not seem at all angry about it. Frank v Waveren stomped in with, "Pah! I'll be impressed when you code it as a kernel-module, activating with left-alt scrolllock or something (that still appears to be free)." But Leen Besselink pointed out, "Actually, OpenBSD already has this (in the kernel !) After a kernel crash once, I got in the kerneldebugger. I didn't really know how to use it, but I could play hangman." Joanne Dow couldn't resist, and replied mischievously, "Now that might be the best argument for a kernel debugger we've seen yet." While Jeff also replied, "That's what prompted this. My little mind got to working after someone on #kernelnewbies told me about the hangman program in the OpenBSD kernel debugger, and the rest is history..."







Sharon And Joy

Kernel Traffic is grateful to be developed on a computer donated by Professor Greg Benson and Professor Allan Cruse in the Department of Computer Science at the University of San Francisco. This is the same department that invented FlashMob Computing. Kernel Traffic is hosted by the generous folks at All pages on this site are copyright their original authors, and distributed under the terms of the GNU General Public License version 2.0.