Kernel Traffic #172 For 23 Jun 2002

By Zack Brown

Table Of Contents

Mailing List Stats For This Week

We looked at 1483 posts in 7774K.

There were 390 different contributors. 205 posted more than once. 163 posted last week too.

The top posters of the week were:

1. New Fast Mutex Implementation For 2.5

5 Jun 2002 - 13 Jun 2002 (45 posts) Archive Link: "[PATCH] Futex Asynchronous Interface"

Topics: FS, Locking

People: Linus TorvaldsPeter WæchtlerOliver XymoronAlan CoxAlexander ViroRusty Russell

Rusty Russell posted some patches to implement /dev/futex, a file-based mutex locking handler. It could be opened, closed, and read via normal system calls, which he said would be very convenient for applications. But Linus Torvalds cried out:

STOP!

What madness is this?

You have a damn mutex system call, don't introduce mode crap in /dev.

Do we create pipes by opening /dev/pipe? No. Do we have major and minor numbers for sockets and populate /dev with them? No. And as a result, there has _never_ been any sysadmin problems with either.

You already have to have a system call to bind the particular fd to the futex _anyway_, so do the only sane thing, and allocate the fd _there_, and get rid of that stupid and horrible /dev/futex which only buys you pain, system administration, extra code, and a black star for being stupid.

Rusty felt that for network programming at least, the socket API was not the best model to copy (close by, Alan Cox said much the same thing); and Peter Wæchtler suggested /proc/futex instead, which he said would involve "Less adminstrative work, clean interface (also for shell scripts like Alan suggested). Al Viro would like this, it's more like Plan9 or QNX6." But Linus replied:

Tell me _one_ advantage from having the thing exposed as a filename?

The whole point with "everything is a file" is not that you have some random filename (indeed, sockets and pipes show that "file" and "filename" have nothing to do with each other), but the fact that you can use common tools to operate on different things.

But there's absolutely no point in opening /dev/futex from a shell script or similar, because you don't get anything from it. You still have to bind the fd to it's real object.

In short, the name "/dev/futex/index.html" (or "/proc/futex/index.html") is _meaningless_. There's no point to it. It has no life outside the FUTEX system call, and the only thing that you can do by exposing it as a name is to cause problems for people who don't want to mount /proc, or who do not happen to have that device node in their /dev.

Elsewhere, Rusty saw the writing on the wall, and tried to implement futexes in the filesystem itself, copying a batch of code from sockfs. In fact, he felt he'd copied a bit too much code, and asked Linus and Alexander Viro if they could spot any areas he could prune. Linus replied:

I don't think it's a matter of cutting, as much as possibly a matter of tryign to share some common code. pipefs, sockfs and now this: they all do pretty much exactly the same thing, and there is nothing that says that they should have separate super_operations, for example, since they are all identical.

And once you have the same super_operations, you really have the same "fill_super" functions too. The only thing that separates these superblocks is the root name, so that /proc gets nice output. So it should be fine to just have

>sb = create_anon_fs("futex");

and share all of the setup code across futex/pipes/sockfs.

Which still leaves you with the

get_unused_fd();
get_empty_filp();
filp->f_dentry = dget(sb->s_root);
.. fill it ..
fd_install(fd, filp);

but by then we're talking single lines of overhead.

Rusty took a look, but decided that the futex code he was trying to write was not quite identical enough to sockfs and pipefs. He asked Alexander for an opinion, but got no reply. Meanwhile Rusty, Linus, and various other folks discussed implementation details for awhile. At one point Peter asked, "What are the plans on how to deal with a waiter when the lock holder dies abnormally?" And Linus said, "That's why they are called FUTEX'es - they're fast. They're NOT SysV semaphores, and they are done 99% in user space. The kernel doesn't even _know_ about them until contention happens, and even then only in a rather dim "somebody wants me to do this, but I don't know _what_ he is doing" way." Later, he added:

the point is not that it would be a performance hit on "exit()", but that WE DON'T TRACK THE LOCKS in the kernel in the first place.

Right now the kernel does _zero_ work for a lock that isn't contended. It doesn't know _anything_ about the process that got the lock initially.

Any amount of tracking would be _extremely_ expensive. Right now getting an uncontended lock is about 15 CPU cycles in user space.

Tryin to tell the kernel about gettign that lock takes about 1us on a P4 (system call overhead), ie we're talking 18000 cycles. 18 THOUSAND cycles minimum. Compared to the current 15 cycles. That's more than three orders of magnitude slower than the current code, and you just lost the whole point of doing this all in user space in the first place.

Oliver Xymoron suggested, "That doesn't rule out approaches like storing a cookie alongside the lock once it's acquired (or in a parallel space). Which can easily be done with a wrapper around lock acquisition. And stale lock detection needn't be done in kernel space either." And Linus replied:

Oh, agreed, you can do debugging locks in user-space, but it won't be the kernel that does anything, it will instead have to depend on something like "if I time out on the lock, I can explicitly test if the previous holder (who saved his thread ID in memory after getting the lock) is still alive and try to clean up after him".

This is what a lot of the filesystem locking code does for the things in /var/lock/xxx, of course.

No kernel necessarily involved.

But it's going to have to depend on the politeness of all threads that partake in the locking.

2. Per-Socket Statistics Proposed And Rejected

6 Jun 2002 - 14 Jun 2002 (50 posts) Archive Link: "RFC: per-socket statistics on received/dropped packets"

Topics: Development Philosophy, Ioctls, Sockets

People: Chris FriesenDavid S. MillerBill DavidsenMark MielkeBen Greear

Chris Friesen proposed:

For a while I've been wanting a way for a program to find out if any of its socket buffers were overflowing due to too much incoming traffic. Finally, I decided to code it up and try it out.

As it turns out, it was relatively simple to add, although it required the addition of two new entries in sockios.h.

Basically, inside of sock_queue_rcv_skb() and sock_queue_err_skb() the receive counter gets incremented unconditionally, and then if there is no free space in the socket buffer then we also increment the counter for messages dropped due to out of memory.

The stats are stored as part of a socket_stats struct, making it easy to add other counters in the future.

To access and reset the counters, two ioctl commands were added to the socket ioctl. GIOCSOCKSTATS is used to get the stats, while SIOCZEROSOCKSTATS is used to reset them. I haven't bothered with trying to reset them both atomically as I don't think it's that critical.

The patch was coded and tested for 2.4.18, but it is known to at least apply (with offsets) on 2.5.20.

David S. Miller completely disapproved of the patch, saying that the patch was only useful for datagram sockets. Ben Greear objected that datagram sockets were the only ones that had the data-dropping problem the patch dealt with, but Mark Mielke suggested that applications could handle this themselves by collecting and analyzing the statistics provided by Chris' patch. But David said that SNMP provided all the statistics any application would need; and that if there were any areas that SNMP didn't adequately cover, it only meant that more SNMP events should be added. The bunch of them went around on it for awhile, with David saying at one point, "Every argument I hear is one out of lazyness. And that is not a reason to add something. Simply put, I don't want to add all of this per-socket counter bumping that only, at best, 1 tenth of 1 percent of people will use. This means that the rest of the world eats the overhead just for this small group that actually uses it." Bill Davidsen said to David, "your arguments sound like you have a solution to your problem and you want everyone to use it even if it doesn't help them. Have you some emotional tie to SNMP, like being an author?" And to this, David replied, "After a comment like this, I have no interest in listening to anything else you have to say. I've been maintaining the Linux networking for 5 or more years now, and the most important thing I do is say no to changes."

3. Coding Style

9 Jun 2002 - 13 Jun 2002 (36 posts) Archive Link: "[PATCH][2.5] introduce list_move macros"

Topics: Coding Style, USB

People: Linus TorvaldsThomas Mirlacher

In the course of discussion, Linus Torvalds went over some of his coding preferences:

The linux coding style _tends_ to avoid using typedefs. It's not a hard rule (and I did in fact apply this patch, since it otherwise looked fine), but it's fairly common to use an explicit "struct xxxx" instead of "xxxx_t".

I dislike type abstraction if it has no real reason. And saving on typing is not a good reason - if your typing speed is the main issue when you're coding, you're doing something seriously wrong.

(Similarly, if you are trying to compress lines to be shorter, you have other problems, nothing to do with type names).

Does code look "prettier" with a "_t" rather than a "struct "? I don't know. I personally don't think so, and I also hate the "_p" (or even more the just plain "p") convention for "pointer".

Hiding the fact that it's a struct causes bad things if people don't realize it: allocating structs on the stack is something you should be aware of (and careful with), and passing them as parameters is is much better done explicitly as a "pointer to struct".

There are _some_ exceptions. For example, "pte_t" etc might well be a struct on most architectures, and that's ok: it's expressly designed to be an opaque (and still fairly small) thing. This is an example of where there are clear _reasons_ for the abstraction, not just abstraction for its own sake.

But in the end, maintainership matters. I personally don't want the typedef culture to get the upper hand, but I don't mind a few of them, and people who maintain their own code usually get the last word.

Thomas Mirlacher tried to sum it up by saying, "using the "struct mystruct" is _recommended_, but not a must." But Linus replied:

Well, it's more than just "struct xx". It's really typedefs in general.

For example, some people like to do things like

typedef unsigned int counter_t;

and then use "counter_t" all over the place. I think that's not just ugly, but stupid and counter-productive. It makes it much harder to do things like "printk()" portably, for example ("should I use %u, %l or just %d?"), and generally adds no value. It only _hides_ information, like whether the type is signed or not.

There is nothing wrong with just using something like "unsigned long" directly, even if it is a few characters longer than you might like. And if you care about the number of bits, use "u32" or something. Don't make up useless types that have no added advantage.

We actually have real _problems_ due to this in the kernel, where people use "off_t", and it's not easily printk'able across different architectures (we used to have this same problem with size_t). We should have just used "unsigned long" inside the kernel, and be done with it (and "unsigned long long" for loff_t).

We should also have some format for printing out "u32/u64" etc, but that's another issue and has the problem that gcc won't understand them, so adding new formats is _hard_ from a maintenance standpoint.

PS. And never _ever_ make the "pointerness" part of the type. People who write

typedef struct urb_struct * urbp_t;

(or whatever the name was) should just be shot. I was _soo_ happy to see that crap get excised from the kernel USB drivers.

4. Binary Files Found In The Kernel Sources

10 Jun 2002 - 13 Jun 2002 (6 posts) Archive Link: "2.5.21 no source for several objects"

Topics: Kernel Build System, Licencing, Source Distribution

People: Keith Owens

Keith Owens reported that, unlike the build systems currently in use, kbuild 2.5 detected a number of binary-only files in the kernel sources. He listed them off, and added, "This is one of the advantages of having a build system that knows about everything." Various folks took responsibility for the binary files, saying in all cases that they had been included by mistake and could safely be deleted.

5. Status Of FAT CVF

11 Jun 2002 - 13 Jun 2002 (5 posts) Archive Link: "Status of FAT CVF?"

Topics: FS: FAT

People: Matti AarnioAndreas DilgerOGAWA Hirofumi

Someone asked about the status of fat_cvf is 2.4; Matti Aarnio replied, "Over about past week there has been some talk with that topic. The conclusion appeared to be "kernel driver is abandoned, future support belongs into mtools"." Elsewhere, Andreas Dilger also said to the original poster, "There was a patch posted last week to l-k which basically removed CVF support from the kernel entirely, because it was totally non-functional." And OGAWA Hirofumi added, "I got direct email about it. Then he said, he ports and uses dmsdos on 2.4. And I asked if he can port dmsdos to 2.5 or not. So, currently that patch is pending."

6. Developer Disconnect

18 Jun 2002 - 19 Jun 2002 (12 posts) Archive Link: "[PATCH+discussion] symlink recursion"

Topics: Developer Interaction, FS: ext2, Security

People: Andries BrouwerLinus TorvaldsDaniel PhillipsAndries Browuer

Andries Brouwer announced, "As promised below an implementation of nonrecursive symlink resolution." But Linus Torvalds replied bluntly:

There is no such thing as a non-recursive symlink resolution.

Symlink walking is by it's very nature recursive, since we have to be able to look a symlink up in the middle of another one.

So either it's recursive in C (caller ends up calling itself) or it linearizes the recursion by hand (caller keeps track of the stack by hand using a linked list or by expanding the pathname in place or whatever, instead of using the C stack).

Both are recursive, it's only a question of whether the recursion is handled by the language or by hand, and whether the interim state is held on the stack or in explicit data structures.

I see no advantages to handling it by hand, since this isn't even a very deep recursion, and since even if you do the recursive part by hand by a linked list you still need to limit the depth _anyway_ to avoid DoS attacks.

In fact, we avoid following symlinks too deeply even for the _non_recursive_ case (see "total_link_count") exactly because of these DoS issues.

Could we allow deeper recursion if we did it by hand? Sure. Are there any real advantages in 15 levels of recursion over 5 levels of recursion? I don't see any.

Andries pointed out that his patch allowed the sysadmin to choose the allowed level of recursion, so they wouldn't have to worry about the possibility of stack overflows. Daniel Phillips also objected to Linus:

It's the recursive trip through the filesystem that causes the problem. Suddenly the stack space consumption becomes (N * greediest_filesystem) and that has to be factored into all worst case calculations. It's a huge hole to blow out of this severely restricted resource, so reducing N to 1 is a big deal.

Also, as a practical matter, it's much easier to lay down a rule for filesystem developers that reads: "thou shalt in no case use more than N bytes of stack on your longest path" than "in no case use more than N bytes of stack except on any symlink resolution path, in which case see the limit on recursive symlinks to know how to analyze that path".

But Linus said:

Actually, the trip to the filesystem itself is not recursive. We only have one lookup _active_ at a time, so the stack depth is fairly well bounded. I think at some point it was on the order of 64 bytes per invocation on x86. It really isn't as bad as people have made it out to be.

(But yes, the x86 is denser on the stack than many architectures)

Note that I'm absolutely not adverse to have people test Andries patch, and I don't _mind_ it. I'm really arguing not so much against the patch itself, as against the _religious_ and unthinking reaction against a perfectly fine programming construct (limited recursion).

PS. Yes, a filesystem can do stupid things, and make the actual single level function have a huge stack-frame: the example was for the normal "page_symlink" thing.

Andries took him to task, saying:

In your previous letter you wanted to play semantical tricks with the word recursive, even though you understood perfectly well what I meant with "nonrecursive" (and you yourself used the same terminology in older posts).

Now I hesitate whether I should react to the above statement. Maybe this time there are semantical tricks with the word active, but it sounds a bit as if you misunderstand the situation.

Let me state the facts instead of worrying about semantics.

The routine link_path_walk() in namei.c will call do_follow_link() in case of a symlink, and this routine will call dentry->d_inode->i_op->follow_link(), say, nfs_follow_link(), which calls vfs_follow_link(), which calls link_path_walk(), etc.

You see that in a stack of N invocations, there will also be N stack frames of foofs_follow_link(). So, yes, in the way I use recursive, routines like nfs_follow_link() are indeed recursive: they end up calling themselves.

Last Sunday or so I gave a demo patch that takes the filesystems out of the loop. Then symlink resolution is still recursive but there will be at most one invocation of foofs_follow_link().

Yesterday I showed that it is also easy to avoid recursion altogether.

These are independent stages, and one might consider doing one and not the other.

Linus replied to Andries' 4th paragraph:

Yes. But did you look at the stack frames of those things? It's something like 16 bytes for ext2_follow_link (it just calls directly back to the VFS layer), 20 bytes for vfs_follow_link(), and 56 for link_path_walk.

(yeah, it obviously isn't just 64 bytes any more, it's 92 bytes. Maybe I just counted wrong last time. Or maybe link_path_walk is different enough).

Oh, and I think the actual ->follow_link pushes 8 bytes of arguments.

So doing a recursion 5 deep is ~500 bytes of stack space.

That was my point: the _only_ difference between "explicit recursion" and "recurse by hand" is where and how those intermediate bytes are allocated. There is nothing inherently "evil" in letting the compiler do it for you.

And as you found out yourself, a recursion limit of 5 is actually a lot more than people normally use.

But hey, guys, if you want to linearize the recursion, I'm easily swayed by numbers. I've actually done the numbers for stack usage (exactly because I worried about it some time ago), and I don't worry too much about that number. I also don't worry about the number "5", simply because I don't think I've _ever_ gotten a complaint about it that I remember.

But there are other numbers, like performance (sometimes linearizing recursion loses, sometimes it wins), or somebody doing the math on ia-64 and showing that the 100 bytes/level on x86 is actually more like 2kB on ia-64 and totally unacceptable.

But trying to sell this thing to me as a "recursion is evil and must be stamped out" just doesn't work.

Andries snapped his fingers, and said:

I realize you probably missed the history. There was a discussion about large stack variables found by Checker, and how Checker might have difficulties getting the max stack right in the presence of this recursion.

I remarked that it is trivial to remove the recursion, should anyone be interested in that, but Al called such claims false and wrong, so I did half of the work (removing the filesystems from the loop) three days ago, and Al said that it was nothing yet, actually removing the recursion would be hell. So I removed the recursion yesterday evening. A demo project.

(And you were cc'ed only because I happened to come across what I think is a refcounting buglet in the present code.)

Have not heard from Al yet, but my secret hope is that he'll soon come back and tell me that my code is ugly and contains seventeen bugs but that he has done it right with elegant, fast and maintainable code.

In the meantime, no, this was not a patch submission, it was for discussion and because Al wanted to see actual code.

There was no reply.

 

 

 

 

 

 

Sharon And Joy
 

Kernel Traffic is grateful to be developed on a computer donated by Professor Greg Benson and Professor Allan Cruse in the Department of Computer Science at the University of San Francisco. This is the same department that invented FlashMob Computing. Kernel Traffic is hosted by the generous folks at kernel.org. All pages on this site are copyright their original authors, and distributed under the terms of the GNU General Public License version 2.0.