Kernel Traffic #49 For 3�Jan�2000

By Zack Brown

Table Of Contents

Mailing List Stats For This Week

We looked at 1023 posts in 4000K.

There were 366 different contributors. 171 posted more than once. 156 posted last week too.

The top posters of the week were:

1. Thread-Private Mappings; Linus On Unix

13�Dec�1999�-�21�Dec�1999 (67 posts) Archive Link: "Thread-private mappings and graphics (was Re: Per-Processor Data Page)"

Topics: PCI, SMP, Virtual Memory

People: Jon Leech,�Linus Torvalds,�Larry McVoy,�James Simmons,�David S. Miller

Jon Leech pointed out that thread-private mappings were very useful for 3D graphics applications. He explained, "The OpenGL API relies on an implicit graphics context, so that multithreaded apps need to do some sort of thread-specific lookup at each call." He added, "Many GL calls can be just a few instructions long on suitable hardware (e.g. shove parameters into a DMA buffer or FIFO), so the lookup needs to be very fast. Brian Paul has done some preliminary testing demonstrating a roughly 3:1 performance hit for performing thread-specific lookup via pthread_getspecific() on an otherwise empty call, giving some guidance as to impact in a real driver." He suggested that some sort of kernel implementation would be much easier than requiring changes to the OpenGL API, which would in turn necessitate rewriting a lot of programs and documentation.

James Simmons agreed, but Linus Torvalds said:

It will never happen on Linux.

I'm suprised an SGI person hasn't learnt from past mistakes. IRIX is unstable, and unmaintainable, and please just face it - it's because SGI had the "cool feature of the day" disease.

Thread-private mappings WILL NOT HAPPEN. You can obviously do them in SGI Linux, but that way lies madness, and it's something I'll keep pointing out in public.

You can have thread-private _pointers_. Just have different mappings of the same hardware context if you have to, and just have different pointers to it in different threads.

In the same post, he went on:

thread-private mappings are FUNDAMENTALLY broken. They completely break the whole point of having a thread in the first place, and I can only suggest that if you want them you should look at a great concept that was invented, oh, thirty years ago by people more intelligent than you or I.

Namely "fork()". Which gives you the thread-private mapping you want.

If you want threads with shared address spaces, then make them SHARED. None of this private stuff. It's not on the table, and it won't be. I've explained to people why before, and I bet I'll end up explaining again, but the short and sweet of it is that you CANNOT do thread-private mappings without losing most of the performance advatages of a thread in the first place.

And once you lose the performance-advantage of threads, not all that much else remains. Threads certainly aren't any easier to program than processes...

Larry McVoy replied:

I agree with Linus. IRIX is full of things that noone but a marketing bunny would like, and Linux shouldn't follow in those misguided footsteps.

My understanding of what IRIX had to do to get their semantics to work, based on having been a kernel engineer at SGI for 3 years, so I've looked at this code, is that when you have private mappings, you typically end up replacing the TLB miss handler. The details get sort of messy, but the problem can be summed up as the address space does not belong to a thread, in fact it is the other way around. As long as your threads either share completely or don't share completely, then you only have one VM struct and one set of page tables to walk. As soon as you start having thread private mappings, then you end up having to have different (usually overlapping) page tables. Hence the new TLB miss handler which sorts out the mess.

You really really don't want to do what SGI did. What you do want to do is clearly state what your needs are - leaving out any details about how you want those needs solved - and keep restating them until someone says "ah - you do that like this" and Linus likes it.

James also replied to Linus, arguing in favor of thread-private mappings. He said:

I'm not talking about implementing Thread-private mappings for the entire system. What I have done is when the process mmaps the accel region set a flag. This way if the process creates new threads or forks it will have a private mapping. The impact of private mapping is thus minimized. Once a process unmaps the accel region the flag is turned off. The only way preformance can be killed on the system is if the process that mmapped the accel region creates a bunch of threads and none of the threads use the accel region. I just don't see that happening. Also how many threads will a well designed OpenGL library have. You don't want to go crazy here creating a bunch of threads.

The question we have to ask why do we want private mapping. The reason DRE (Direct Rendering Engine) needs this is to ensure a page fault happens. If you don't want private mapping for thread in this case then tell me a way to ensure every thread or forked process that uses the accel engine page faults. Why do you need a page fault? Because the page fault is what I used to serialize access to the accel engine on SMP machines and to save and restore the graphics context. As Jon Leech pointed out each threads need to refer to *different* graphics contexts. OpenGL is based on the graphics pipeline. Ideally each part of the pipeline would be used by a different process. Say we have SMP machine with one video card. Ideally a different process on each CPU would be in different places in the graphics pipeline. The problem is one thread could have the hardware in a state that could alter the desired results for the other threads since the other threads most likely need the hardware in another state.

Also as Jon pointed out the API requires that these contexts be identified by some thread-specific mechanism available to the graphics library, not by explicit stack pointers in the application. Either the threaded OpenGL API is broken or DRE for linux-SGI is. If this the case then linux will need its own special threaded OpenGL library compared to all the other platforms which don't require this special rewrite.

Jon replied, "Just to be absolutely clear, the topic I brought up initially has to do only with use of the OpenGL API from multithreaded applications; it has nothing whatsoever to do with "DRE", which is not a project SGI is involved in, or with other aspects of the graphics hardware access model."

Linus also replied to James, with:

The thing that makes me admire UNIX is not that it's UNIX. The thing I like about UNIX is that even after 30 years, the fundamentals stand out. The basic notion of files as unstructured streams of data, and the UNIX fork/exec way of doing things (most everybody else does a "spawn()", which does not have the same philosophy at all).

And quite frankly, especially when it comes to threads, Linux has a DESIGN. It's not the hodge-podge of threading issues that we call "pthreads" that came about from different vendors tryign to solve the same problem in different ways (inluding user-mode only solutions). It's a CONCEPT, the same way "fork()" is a concept.

The difference between a "concept" and a "random collection of routines" is that the concept will survive, and the routines won't.

But that also requires that people don't mess up the concept by thinking that it's "just an implementation". Because once lost, you won't be able to undo the changes. You can fix bugs, but you can't easily remove the "feature of the day".

Linus went on:

You've broken the concept of shared TLB's. You've then superglued the broken parts together and said "if you look at it from the right angle you cannot SEE the cracks".

But the cracks ARE THERE! They make the system more complex internally, and even when you don't see the cracks you may notice that the thing doesn't quite stand up straight.

You broke the whole notion of interchangeable, anonymous, shared TLB mappings.

You broke the notion that when a thread is switched, no MM work needs to be done.

You broke the notion that you can simply look at "process->mm" and determine whether it can share the TLB state with another thread (even if that TLB state in between was used by an anonymous kernel thread).

In short, you took a notion and dirtied it until it was just a random collection of routines that superficially LOOK like it has a design.

In short, you didn't care about the BEAUTY.

To James' statement that the question to ask was why we wanted shared mappings, Linus continued:

No. You're asking the WRONG question.

There is no way in HELL we want private mappings. End of story.

Use processes and SysV shared memory if that is what you're after.It works today, and it gets you EXACTLY the same semantics that you are apparently after. Sure, you have to think about the problem another way, but it's just a mirror image.

Instead of saying "ok, I want private mappings in a shared address space", you say "ok, I want shared mappings in private address spaces".

What's the difference? Doesn't soundlike much, no?

But look at it from a NOTIONAL standpoint. You don't break anything by taking a private address space and adding a shared object to it - we've had that notion for a long time, and then you HAVE a private TLB and a private page table to play with.

In contrast, if you break the sharedness of a shared address space, YOU DON'T HAVE ANYTHING LEFT! You just broke the bubble, and it popped. You turned it into a private address space, for no gain (you already HAD private address spaces, so you just degenerated the whole system).

Linus concluded, "I'm not asking you. I'm TELLING you that your idea will not be accepted in the standard kernel. I can go on explaining all day WHY, but you don't seem to care. You're ignoring the bigger picture, and that's your right. It's also your right to f*ck up your own version of Linux, but you're not getting close to mine."

James replied, "You are right about the idea of not breaking a concept. Its true I shouldn't break a standard concept to do what a single driver/project wants done. If this was the case their would be no standards. Egg on my face :(" He added that he did care about other people's opinions, and went on to say that he would keep working on the problem, keeping Linus' admonitions in mind.

David S. Miller also replied to Linus with some code, based on an implementation suggestion in Linus' previous post. James replied that he'd work on coding that into the kernel. Linus replied, "Note that you really should look at what DRI did with the 3dfx driver, that does all of this =and= tries to keep much of the context locking in user space (so that only on clashes does it go to the kernel to fix things up). The kernel side is in the standard 2.3.x kernel these days, the X server side is in the unofficial 3dfx X server."

David replied:

Please keep in mind that what the SGI folks are complaining about here and how the 3dfx has to do are radically different issues.

Most commodity 3D graphics hardware these days cannot be interrupted mid-operation, have it's state fully saved, have another renderer's state fully restored, and let the latter continue where he left off. You must complete the full operation you are currently in the middle of before you can let someone else have the card.

Whereas most SGI, Sun, and other vendor's higher end cards allows you to arbitrarily stop a renderer mid-place, save the card state, and restore the card state another thread has. This can happen at nearly arbitrary locations, so the following works:

Thread 1 Thread 2
OP = RECTANGLE
X = 0
Y = 0
OP = LINE (takes fault, thread1's graphics card state is saved, thread2's is restored and mappings are removed from thread1's space)
X1 = 5
Y1 = 10
X2 = 10
Y2 = 10 (draws the line)
W = 50 (takes fault, graphics state restored, thread2's state saved and his mappings removed)
H = 50 (draws the rectangle)

Commodity PCI/AGI 3D graphics cards cannot do this, which is why the userland locking solution exists at all. However for cards that can do the above, this is what people want.

To David's "Most commodity 3D graphics hardware these days" comment, Linus replied:

Look out when you're condescending.

Commodity hardware is where it is at when it comes to 3D. Forget about CAD work etc - 3D is all about games, and probably always will be. It's just a fact that the game market is about a million times larger than the traditional 3D market ever was, and as a result they have more resources.

It's the i386 all over again. The "non-professional" 3D solutions are already getting level with the "professional" ones.

Raul Miller also replied to David, asking if there were benchmarks comparing on-board and off-board state swapping. David mentioned one card that would cache multiple rendering contexts' states; and described the page fault handler for such a thing. Linus took up the new subject, saying:

I don't understand why people are so hung up about page faults.

I think it's ENTIRELY because of historical baggage, and the particular implementation under Irix.

What I'm surprised about is that nobody seems to just come out and say:

Page faults are BAD. Playing with the page tables is EXPENSIVE. Page faults fundamentally are NOT thread-safe, because page tables are fundamentally shared among threads.

Ok. Nobody else said it, so now I have.

YOU SHOULD NOT PLAY MM GAMES! They do not scale in SMP, they do not scale with threads, and the costs of missing are absolutely huge. The whole thing is also extremely hard to debug, and implies a much tighter coupling between the kernel and the X server than there should ever be!

You can do a _regular_ SMP-safe lock with _real_ thread safety and no faulting behaviour in a few instructions. We're talking maybe 50 cycles here - about 40 cycles for the actual two locked instructions, and a very generous 10 cycles to check whether you are the old owner and going to the switch routine if not).

Note that IF you have to switch contexts, the regular lock will be a hell of a lot faster than taking page faults, so let's just ignore that case: page faulting obviously loses, and there's no way anybody can seriously claim anything else.

So let's look at the no contention case, where you got the lock, and everything was fine. You spent 50 cycles on verifying it. Big deal. That's 50 _CPU_ cycles. Not memory cycles, not PCI cycles. In exchange for those 50 cycles you get:

Oh. And btw. It's already been done. See the 3dfx driver.

So forget this playing with mmap and page faults. Use mmap() as a way to access the physical hardware, but NOT as a way to switch contexts. Ok?

2. Preparing For Code Freeze

14�Dec�1999�-�23�Dec�1999 (167 posts) Archive Link: "Ok, making ready for pre-2.4 and code-freeze.."

Topics: Code Freeze, Disk Arrays: RAID, Disks: SCSI, FS: NTFS, FS: procfs, I2C, Kernel Release Announcement, Networking, PCI, SMP

People: Linus Torvalds,�Ron Flory,�Riley Williams,�David S. Miller,�Alexander Viro,�Alan Cox,�Tigran Aivazian,�Roman Zippel,�Ingo Molnar,�Jes Sorensen,�Dominik Kubla,�Henrik Olsen

Linus Torvalds announced Linux 2.3.33, saying:

After doing too many last-minute updates of critical code that we really shouldn't have left this late (Both the mm layer and the SCSI layer was changed quite a lot: we'll be better for it, but I'd have been happier if we hadn't needed to), I'm going to calm things down. I've released 2.3.33 which fixes a few smaller problems with 2.3.32, and I'll let it quiet down a bit for a while.

We're obviously not going to have a 2.4 this millenium, but let's get the pre-2.4 series going this year, with the real release Q1 of 2000.

Henrik Olsen, Tigran Aivazian, Dominik Kubla, Henning P. Schmiedeh, Catalin BOIE, and Ron Flory all pointed out that the millennium actually wouldn't pass until next year. Ron said, "Contrary to popular misconceptions, the year 2000 is actually the LAST year of this millennium. After Dec 31 1999 we will have completed 1999 full years. Jan 1 2001 is the first day of the next millennium." Linus replied:

Contrary to popular misconceptions, PEOPLE DON'T CARE!

The fact that our forefathers were Pascal-programmers, and started counting from one does not mean that we have to continue that mistake forever. We've since moved on to C, and the change from 1999->2000 is a lot more interesting in a base-10 system than the change from 2000->2001.

The reference point of our timekeeping is based on an event where the uncertainty about the timing is much more than a year, and was made up several hundred years AFTER the fact. As such, if you want to be a stickler, you might as well say that the next millenium may have started several years ago.

So please stop sending me email. You don't have to celebrate if you don't want to. But let the rest of the world who doesn't care about silly irrelevant details (what's a millenium to you anyway) just go on with our life.

NEXT year I may agree with you. I'll join the ranks of people with no life but the ability to count in another 360 days or so. But that's mainly in order to have an excuse to go out to town.

Riley Williams added:

For those of you interested in this:

These two facts imply that the second millenium ended on 31st December 1993, and we have been in the third millenium for nearly six years now.

David S. Miller had a more on-topic reply, saying, "I'm still a few weeks away from getting my platforms working again, currently I'm wedged at 2.3.27 with some weird perhaps Sparc-specific issue that is preventing user apps from stating up after boot. Could be the new zone code, who knows, no hard clues... been on this for 4 days now."

Alexander Viro added:

With the filesystems/VFS situation looks so (and I'm not going into IWBNI area, only code that needs fixing):

That's just the most pressing stuff. I'm going to fork the -bird (aka VFS-CURRENT) after 2.4.0 and will feed the stable/well-tested stuff back into the main tree (with intention to collapse it after 2.5 will open), but it would be nice if we could fix at least the stuff mentioned above _before_ 2.4.

Jes Sorensen mentioned that he though Roman Zippel was working on the AFFS code. Alan Cox also replied to Alexander. He said Ingo Molnar was working on RAID and PIII patch merging. He added that CODA could wait until later. Regarding the /proc issues, he went on, "We have a ton of other user exploitable races with module load/unload including basic stuff like open which with a bit of care you can use to crash the machine as any user. Procfs is only a part of this." He continued:

Taking my working list the key items seem to be

Less critical stuff

I have the i2c stuff sorted mostly (thanks to Gerd and co), the 2.2.13/14 stuff I will do next week and is all bug fixes, Im still trying to stomp all the isa memory mapped I/O issues. I'll also sort the vmalloc cases out.

Most of the above is bug fixing or driver stuff so is post freeze work. The raid and PIII stuff are not. I'd also prefer you to look at the core softnet changes and say yes/no, then draw your line the side of it you choose. The softnet work is the other 50% of the scaling work, without it 2.4 wont scale much better than 2.2 for real world situations. That bothers me.

I've got a list of other stuff (Erez stackable fs, telephony API work, performance counters, lm-sensors, ibcs) that are in the nice but so be it category. I have no problem with those landing outside of 2.4.0 or staying as add ons. Skipping softnet though would I think be a mistake.

3. ReiserFS Or Ext3 In Standard Kernel?

17�Dec�1999�-�23�Dec�1999 (4 posts) Archive Link: "JFS"

Topics: Disk Arrays: LVM, FS: ReiserFS, FS: XFS, FS: ext2, FS: ext3, Ioctls

People: Hans Reiser,�Stephen C. Tweedie,�Theodore Y. Ts'o

Ted Sikora knew that folks were talking about a journalling filesystem, and asked if ReiserFS or ext3 would eventually be included in the Linux tree. Stephen C. Tweedie, author of ext3, suggested that it would be best to have both, and Hans Reiser, author of ReiserFS, agreed. In a different post, Hans added, "We are working through the holidays to port reiserfs to 2.3. Our not finishing the port is what is holding up our introduction into 2.3. We are late but working hard at it...." EOT.

The possibility of an Ext3 was barely a glimmer in a developer's eye back in Issue�#5, Section�#1� (3�Feb�1999:�Capabilities And ACLs) . The flameware produced by Stephen's early journalling work was covered in Issue�#7, Section�#6� (12�Feb�1999:�fsync(); syslogd; Ext2 Extensions; Linus Chastized) . Then in Issue�#15, Section�#2� (31�Mar�1999:�Journalling And 'Capabilities' In ext3) , the possibility of including 'capabilities' in ext3 was discussed, including a long analysis by Theodore Y. Ts'o. In Issue�#21, Section�#2� (20�May�1999:�XFS Going Open Source) , SGI's promise to release XFS under an Open Source license, made some folks wonder whether any other journalling filesystem should be bothered with. This article was also Kernel Traffic's first mention of ReiserFS. ReiserFS was next mentioned in Issue�#24, Section�#6� (11�Jun�1999:�New ioctl For Advanced Filesystems) , in the context of migrating the kernel from the FIBMAP to FIONDEV ioctl. Next, in Issue�#27, Section�#10� (4�Jul�1999:�Legacy Compatibility) , ReiserFS was listed as a valuable 2.2.x feature, and a good reason to upgrade from 2.0 or 1.2; its next appearance was in Issue�#34, Section�#4� (29�Aug�1999:�ReiserFS Nears Readiness; Difficulties Discussed) , where Hans claimed it was almost ready for inclusion in the 2.3.x series. Then ext3 came back under discussion in Issue�#38, Section�#2� (16�Sep�1999:�ext3 Filesystem Status; ACLs) , where it was revealed that Stephen had released version 0.01 for kernel 2.2.2; both ext3 and ReiserFS then came up in Issue�#43, Section�#2� (28�Oct�1999:�Journalled Filesystem For Linux) , in which (aside from another historical summary like this one) it came out that the latest SuSE was shipping with ReiserFS. In the same issue, ext3 also got a brief mention in Issue�#43, Section�#5� (29�Oct�1999:�Mirroring Via The Buffer Cache) as part of a different argument. Issue #44 had an ext3 status report in Issue�#44, Section�#2� (8�Nov�1999:�ext3 Status Report) , and some possible ReiserFS licensing conflicts in Issue�#44, Section�#4� (8�Nov�1999:�Possible GPL Conflicts In Reiserfs License) . They both came up again in Issue�#45, Section�#11� (19�Nov�1999:�When LVM And Others Will Go Into The Main Tree) , in the context of folding them (and LVM) into the 2.4 kernel. Finally, in Issue�#47, Section�#3� (1�Dec�1999:�ext2/ext3 Compatibility) , there was some discussion of compatibility between ext2 and ext3.

4. How To Be A Kernel Hacker

18�Dec�1999�-�21�Dec�1999 (4 posts) Archive Link: "Kernel design"

People: Bill Wendling,�Rik van Riel,�Riley Williams

Sherif Abou Seda asked how to get the "algorithms of the kernel design." Bill Wendling replied, "First, get a scalpal and a saw. Then, once you have the kernel hacker strapped down to the table and under anesthitized, you can remove his/her brain. Of course, you'll have to translate the wet-ware to some other form of your choosing..."

Rik van Riel also replied to Sherif, saying, "Algorithms for designing the kernel (for doing the actual design and stuff) vary from developer to developer, but most seem to involve staring, deep thought and lots of caffeine. The "graduate student algorithm" should prove a sufficiently close approximation :)" and Riley Williams added:

The algorithms I know about seem to be along the lines of the following:

  1. Read a dozen or so pages of source code looking for whatever bug one is after. Drink several large mugs of coffee whilst doing so.
  2. Toss a coin. If heads, go to step 3; if tails, go to step 4; if it lands on its edge, go to step 5.
  3. Note a bug one wasn't looking for, and fix it. Go to step 1.
  4. Overlook the bug one was looking for. Go to step 1.
  5. Have a brainwave, and go directly to the bug and fix it, even though it was nowhere near the code one was looking at. Go to step 1.

Put any timescale you care to against that lot.

5. Protecting Permissions In NFS

20�Dec�1999�-�22�Dec�1999 (18 posts) Archive Link: "2.3.30 linuxNFS import is broken (Screwed up NFS/RPC credentials)"

Topics: FS: NFS, POSIX

People: Trond Myklebust,�Alexander Viro,�Linus Torvalds,�Horst von Brand

Trond Myklebust noticed that the API for readpage() and writepage() had been changed to no longer pass the file pointer. He explained, "This screws up any attempt to cache the RPC credentials at file opening, since there's no longer any way to pass the credential down to the read/write." The result was a broken 2.3.30 LinuxNFS tree. Alexander Viro asked where the file pointer was used; he said the behavior seemed okay to him. Trond elaborated:

The problem is that NFS relies on the user sending the RPC authentication each and every time we access data on the server. In order not to get a permissions error suddenly if the user changes euid while s/he is reading/writing to the open file, we therefore want to use the same RPC authentication info throughout the file's lifetime. Ideally that means taking the RPC auth info that was valid when opening the file (since this is more or less in line with a POSIX filesystem's behaviour with permission checking at file open only) and caching it somewhere.

The most practical way of implementing this policy is therefore to hide the RPC auth in the file descriptor structure (I use the private data field), and pass that info via the file pointer to readpage/writepage/whatever else needs it.

He added that Linus Torvalds had rejected the patch for this as not yet clean enough.

Horst von Brand objected that if several people open the same file, the permissions might have changed inbetween, forcing the system to cache the RPC auth information once for each open(); which would get complicated. Trond felt this was a non-problem: he explained that the file structure was allocated when the file was opened, and was not shared between users. Each user could therefore cache their own set of permissions. This was why he preferred to pass the file structure rather than a dentry (directory entry structure), which would be common to all users and would prevent this treatment.

Sharon And Joy

Kernel Traffic is grateful to be developed on a computer donated by Professor Greg Benson and Professor Allan Cruse in the Department of Computer Science at the University of San Francisco. This is the same department that invented FlashMob Computing. Kernel Traffic is hosted by the generous folks at kernel.org. All pages on this site are copyright their original authors, and distributed under the terms of the GNU General Public License version 2.0.