Kernel Traffic #65 For 1 May 2000

By Zack Brown

Table Of Contents


I'd like to thank all the people who emailed me about the semi-broken link to the generic latest issue. It used to be, but changed recently to Anyone bookmarking this link should update their bookmarks, as the old link won't be maintained. I think I emailed everyone back who wrote to me about this, but profuse apologies if I missed anyone. In any case, this is exactly the kind of thing I'd like to hear about, and many thanks to all who wrote in.

Special thanks also go to Patrick Erler, who pointed out that the printer-friendly version still used colors for quoted text, which wouldn't actually show up in the print-out. So I've changed the format of quoted text in the printer-friendly version to be bold and italic, instead of dark red. At the moment this is an imperfect solution because of the way the pages are generated, so I'd appreciate any bug reports or other comments. In the mean-time, thanks for the suggestion, Patrick!

Special thanks also go to Jerome Bertorelle and Chris Baker, who both emailed me about the the notation described in Issue #64, Section #3  (10 Apr 2000: Cornering A Slowdown) for measuring the speed of algorithms. Jerome said to me:

I have some precision on the O(...) notation -- if you care ;-) It is due to the Russian mathematician Landau, and is widely used in mathematics (function analysis). It comes in two flavor:

What does it mean in practice -- applied to computer science?

A function that belongs to o(f(n)) has a cost which is negligible compared to f when n gets large. A function that belongs to O(f(n)) has a cost which is similar to f when n gets large enough.


Note that this is true *for large values of n*, and that the constant implied by the O(...) notation is *arbitrary*. So for small value of n, a function which is O(n^2) might be much faster than an O(1) function.

Mailing List Stats For This Week

We looked at 1038 posts in 4526K.

There were 385 different contributors. 172 posted more than once. 145 posted last week too.

The top posters of the week were:

1. Loopback Device Broken In Latest Kernels

11 Apr 2000 - 20 Apr 2000 (32 posts) Archive Link: "PROBLEM: ls gets stuck in D state on 2.3.99pre4-5"

Topics: SMP

People: Steve DoddStephen C. TweedieTigran AivazianMike GalbraithMircea DamianDavid FordTheodore Y. Ts'o

In the course of discussion, David Ford mentioned that loopback had been broken since 2.3.3x; Mircea Damian asked if anyone would fix it, and David gave the only reply, saying he didn't have time. Elsewhere, Peter T. Breuer asked why Theodore Y. Ts'o didn't seem to be working on it, since it was his driver; but there was no reply from Ted or anyone else. Elsewhere there was some discussion of how to go about fixing it, and at some point Steve Dodd asked Stephen C. Tweedie, "Do you have any hunch as to what _is_ causing the loop device deadlocks?" Stephen replied, "No, but I know how I'd go about searching. Use the SGI kdb debugger code. Once you hit the deadlock, break out into the debugger. That not only lets you take a backtrace of whatever code is executing at a given point in time, it also lets you take a backtrace of any sleeping process. Run "btp <pid>" on any of the processes stuck on the loop device and you'll soon find out where they are blocked at least."

Tigran Aivazian replied that KDB didn't seem to work on the latest kernels. Stephen said he'd only tried it on the earlier 2.3 versions, but Tigran then said:

just to let everyone know - I tried the kdb-v1.1-2.3.48 from SGI ( against 2.3.99-pre6-3 on SMP and it works just fine (the modifications I had to make were trivial).

It definitely didn't work against 2.3.49-52 but that may have been a different kernel bug triggered by kdb. (on a different SMP machine)

Mike Galbraith asked what changes Tigran had actually made to get KDB working, and Tigran posted a one-line patch to smp.c; there followed some criticism of the patch (though everyone agreed it would work), but the discussion did not return to the loopback device deadlocks.

2. Proposal: LUID For Secure Auditing

14 Apr 2000 - 19 Apr 2000 (90 posts) Archive Link: "Proposal "LUID""

Topics: Capabilities, Samba

People: Linda WalshAlan CoxRik van RielJamie LokierBrandon S. Allbery

Linda Walsh proposed:

How do people feel about the following proposal:

Adding support for login user id (auditable user id).

  1. adding a variable "luid" to the uid_t line in the task struct
  2. adding two system calls - 1 to 'set' and one to 'get' the value.
  3. adding CAP_SET_LUID that allows setting setting the luid.

Set points would be at 'login', cron/at (running as a user), r(sh,cp,login), and s(sh,..?). Implementation at user level would probably be in a pam library. This wouldn't change over exec's/forks nor would it change at with 'su' nor with SUID programs.

This id would be used to track a user from the point of access to the system to their ending contact which is required for C2 (now CAPP) auditing.

A number of fairly short (and some longer) subthreads branched off from Linda's post. Alan Cox gave his approval, "I actually implemented it for some experimental stuff I was doing (resource tracking). Certainly doesnt bother me. It should be 'obvious' code."

Elsewhere, Rik van Riel objected that user-land daemons like httpd, sendmail, procmail, ftpd, and many others would be affected by this change, and would require fixing. He added, "Unless somebody volunteers to rewrite those daemons, it may be best to keep the change transparent." But Linda pointed out, "Httpd, sendmail and all the deamons you mention would be run with the default system ID of 'init'. They are 'system' processes and as such, in a 'trusted' Computing base (TCB) they would not have a 'login' id associated with them. ftpd/rtelnetd should theoretically be using 'pam' when they start a login session. I've been told by someone else in my group, who is analyzing these functions, that rtelnetd calls login (which uses pam). On my system their are entries for both rlogin and ftpd and samba, etc in pam. So none of the demons you mention would be affected."

Rik had suggested using EUID instead, and now Jamie Lokier suggested, "why not just use the time-honoured "real user id"?" But Brandon S. Allbery replied, "I think you're misunderstanding; this is a "new idea" only for Linux. LUIDs are part of CAPP, which used to be called "C2" security. IOW it's an existing standard, and one that some places insist on." Jamie asked:

Ok, but what's the point? There is a perfectly functional "real user id", and since you have to audit daemons to ensure LUID is properly tracked according to your preferred definition of "session", why not just ensure that the ruid tracks the same way. I.e., just use ruid and call it LUID for CAPP purposes.

If you're thinking about capabilities to restrict LUID changes to the select few daemons (e.g. just "login" and then only via a physical console) -- well, once you've gone that far, the ruid isn't used for anything else. It's available for use as the CAPP LUID :-)

Brandon had the last word of the subthread, with:

No. ruid changes over su; it has to, so setuid programs do the right thing (if you su, you do not want setuid programs to switch between their set uid and your original real uid). You need a separate uid to track who you logged in as for security auditing.

Also, CAPP/C2 security auditing/logging doesn't work in terms of sessions. LUID isn't set by login to enforce some kind of session mechanism, but solely to indicate that some process which does something security-auditable was ultimately initiated, directly or indirectly, by a user who authenticated to the system as the user with that (l)uid. Cron also sets it for cron jobs, because it's acting as a proxy to run things for the user that installed the crontab.

And as far as capabilities went, he concluded, "Capabilities also aren't related to luids. They're quite simple: only a process with a "null" luid (for some suitable definition of "null"; 0 doesn't qualify if root is allowed to login) can change its luid. This is part of the official CCAP/C2 definition of luid."

There was quite a bit more discussion, divided between proponants, exponents, and folks who didn't seem to know the full details of CCAP, but who were interested in discussing it anyway. Eventually Linda posted this explanation (quoted in full):

These are some basics from CAPP that may explain

The Common Criteria (CC) Controlled Access Protection Profile, hereafter called CAPP, specifies a set of security functional and assurance requirements for Information Technology (IT) products. CAPP-conformant products support access controls that are capable of enforcing access limitations on individual users and data objects. CAPP-conformant products also provide an audit capability which records the security-relevant events which occur within the system. The CAPP provides for a level of protection which is appropriate for an assumed non-hostile and well-managed user community requiring protection against threats of inadvertent or casual attempts to breach the system security. The profile is not intended to be applicable to circumstances in which protection is required against determined attempts by hostile and well funded attackers to breach system security. The CAPP does not fully address the threats posed by malicious system development or administrative personnel. CAPP-conformant products are suitable for use in both commercial and government environments.


The CAPP is for a generalized environment with a moderate level of risk to the assets. The assurance requirements and the minimum strength of function were chosen to be consistent with that level of risk. The assurance level is EAL 3 and the minimum strength of function is SOF-medium.



An _authorized_user_ is a user who has been properly identified and authenticated. These users are con-sidered to be legitimate users of the TOE.

An _authorized_administrator_ is an authorized user who has been granted the authority to manage the TOE. These users are expected to use this authority only in the manner prescribed by the guidance given them.


All individual users are assigned a unique identifier. This identifier supports individual accountability. The TSF authenticates the claimed identity of the user before allowing the user to perform any actions that require TSF mediation, other than actions which aid an authorized user in gaining access to the TOE.

The LUID concept is meant to address the needs of a particular multi-national security specification (the Common Criteria). The countries that co-developed the Criteria are: the UK, France, Germany, Canada, Netherlands and the US. It has been agreed that a CC system evaluated in 1 country will be accepted in the other 5 countries.

Find the complete document on page The latest Common Criteria documents can be found at

It is my reading of the above that a CAPP system would not be attached to the internet but only a local 'intranet' composed of other similarly controlled systems.

This certainly isn't meant to be a be-all, end-all answer to security. This is a first step for a well defined environment (like a corporate intranet.

A CAPP compliant system meets "Evaluation Assurance Level 3" which describes (in exhaustive detail) the level of testing and proof of Assurance. It is *not* a formal mathematical proof of correctness (EAL7). For more on EAL's, see page, Document CCV2.1, Part 3.

CAPP compliant systems are thought to be the minimum security needed for many commercial vendors and will be the minimum requirement DoD systems next year. These are not firewall system and unlikely to be used out of an 'internal environment'.

Brandon, who had supported the proposal in earlier posts, now responded with these comments:

Hm. CAPP appears to have altered goals from C2: C2 was to detect *internal* security breaches on an assumed-externally-secure system, and indeed its auditing provisions are not suitable for detection of external threats.

If CAPP truly targets external threats, then I think we should indeed bypass it as a completely misdesigned system for such.

Elsewhere in the same post:

C2, from which CAPP is descended, explicitly *does not cover* networked systems. Not even "intranet"-connected systems. C2 would need a major reworking to even begin to address such.

I am starting to agree that CAPP is a trainwreck....

I will concede the point about LUIDs per se; it looks like CAPP has generalized them somewhat. However, keep in mind that the more complex the mapping of a user identity, the more suspect it is from an auditing standpoint.

The debate went on for awhile, with points on both sides.

3. Status Of PPP Over Ethernet

15 Apr 2000 - 20 Apr 2000 (15 posts) Archive Link: ">=pre5 OOPS on boot failure to open /dev/console"

Topics: Networking

People: Andrew MortonHenner EisenBert HubertJens Axboe

In the course of discussion, Bert Hubert asked about the status of PPP over ethernet. Andrew Morton gave a link to a netdev mailing list post ( and explained, "Michael Ostrowski dropped a lump of nice-looking PPPoE code onto netdev last week." Someone else added that the code Michael had posted would be important for anyone trying to do "PPP over X". Jens Axboe was very happy to hear that people were working on PPP over X, and Henner Eisen gave his take on the situation:

The PPPOX code certainly seems to be a good framework for PPPoE and similar, where some new and somewhat complex ppp-specific encapsulation methods for ppp frames are needed. But for tunneling ppp over existing protocols, where the additional ppp-related encapsulation is simple, maybe another approach might be more appropriate.

I've started to implement ppp over the AF_X25 socket layer. Here, ppp frames are carried directly inside the X.25 payload, without any additional encapsulation headers. The RFCs for ppp over other protocols are frequently similar (no additional ppp frame encapsulations or just some pad bytes). For such protocols, the encapsulation of the ppp frames is trivial and the PPPOX framework (which makes the task of implementing complex ppp encapsulations schemes more easy), does not give any advantage. Instead, the hard problems faced when interfacing such a protocol to ppp are the interactions with the protocol internals. Existing network protocol stacks differ in various areas (e.g. which parts of the protocol processing need process context, how can ppp_channel flow control be interfaced to the carrier protocol's flow control mechanism).

Thus, the design chosen is a library approach, providing services which make interfacing of existing socket code to the ppp_generic code as well as to generic tunnel network interfaces (e.g raw-IP on top of X.25 payload) easy. It aims beeing re-usable for other protocol stacks.

Currently, I've just finished the design and some of the prototype coding, but totally untested until now. Thus, I did not upload it yet. But If you (or soembody else) is interested to have a look at the prototype code (I don't know whether this or the PPPOX approach will be more appropriate for your PPPoATM), please drop me a mail.

There followed a bit of implementation discussion.

4. Unexplained Memory Overuse In Development Kernels

16 Apr 2000 - 22 Apr 2000 (45 posts) Archive Link: "memory handling in pre5/pre6"

People: Mark HahnAndrea ArcangeliUlrich DrepperRik van Riel

Ulrich Drepper and a lot of other people noticed that with 2.3.99-pre5 and 2.3.99-pre6-3, the system would use a lot more memory than with previous versions. After a lot of "me too"s, Rik van Riel finally pointed out that none of this mattered unless folks were also seeing an actual performance hit. A lot of people sang out "yes!" and as Mark Hahn put it, "MAJOR performance hit. on my machine, for instance, I wind up with something like 240M of 256 being wasted by big data files that I'm done with, and a nice, cyclic pattern of ~4k swapouts, followed immediately by a long stream of swapins. throughput drops to around 40% of peak." After this second batch of "me too"s, Rik posted a patch that he thought might help. Several people reported no (or very little) improvement with the patch, and Andrea Arcangelie also added that the code tried to fix nonexistent problems and had bugs of its own. Rik and Andrea had fun with their nerf baseball bats while debugging, but were unable to conclusively identify the problem.

5. Philosophy Of Backward Compatibility In The Kernel

16 Apr 2000 - 18 Apr 2000 (16 posts) Archive Link: "Fix for maestro in 2.3.99-preX"

Topics: Backward Compatibility, Sound: Maestro

People: Jes SorensenJeff GarzikLinus TorvaldsPavel Machek

In the course of discussion, Pavel Machek suggested stripping 2.2 compatibility from the maestro driver; which in his opinion would allow the code to be cleaned up properly. But Jes Sorensen objected:

Some of us are trying to maintain drivers that are being used both in 2.2.x and 2.3.x - having the compatibillity code stripped out regularly from someone making a quick hack in the 2.3.x tree is a major pain.

Dropping 1.2.x support is very reasonable and 2.0.x is certainly debatable, but if an author is actively maintaining the code, it should be up to him/her to decide.

Jeff Garzik agreed, "you shouldn't strip 2.2.x compatibility unless the author says it's ok." But Linus Torvalds came in, with:


It's a major pain just because you do it wrong.

A lot of network drivers have this bass-ackwards way of maintaining both 2.3.x and 2.2.x drivers- they do something like

#ifdef LINUX_VERSION_CODE > 0x20139
struct net_device_stats stats;
struct enet_statistics stats;
or something like
#if (LINUX_VERSION_CODE < 0x02030d)
dev->base_addr = pdev->base_address[0];
dev->base_addr = pdev->resource[0].start;

and they do this in the middle of code that makes it fairly unreadable.

And then people complain when 2.3.x cleans stuff up, and removes those stupid things. Without realizing that the cleaned up version is often much better, and _can_ be made to work with the older kernels too.

What you can much more cleanly do in almost all cases is to have a "forwards compatibility" layer, so that you can write the driver for the current code, and then still be able to compile it for 2.2.x. And that forwards compatibility code doesn't even need to be in the recent kernels: after all, it is really only needed on the =old= setups.

Why do it that way? The BIG reason for doing it this way is that by putting the compatibility code into _old_ kernels when supporting a driver like that, you don't perpetuate the code. It automatically and magically just gets dropped whenever people don't care enough about older versions, because it's not carried around in the new code.

Example of how this _can_ be done (network drivers are the best example, simply because there are so many of them, and because there are some recent changes to how they operate that people still remember):

So for example, the backwards compatibility crud in 2.2.x (which doesn't even need #ifdef's, because it only exists in 2.2.x) would look something like:

#define net_device device
#define net_device_stats enet_statistics
#define dev_kfree_skb_irq(a) dev_kfree_skb(a)
#define netif_wake_queue(dev) clear_bit(0, &dev->tbusy)
#define netif_stop_queue(dev) set_bit(0, &dev->tbusy)
#define netif_queue_stopped(dev) ((dev)->tbusy != 0)
#define netif_running(dev) ((dev)->start != 0) static inline void netif_start_queue(struct net_device *dev)
dev->tbusy = 0;
dev-<start = 1;
#define pci_resource_start(dev,bar) ((dev)->resource[(bar)]&PCI_BASE_ADDRESS_MEM_MASK)
#define module_init(x) ...

and you're now almost done. With a clean driver, and without any #ifdef's AT ALL.

(Yes, there are still going to be details, but you're getting the picture on how this should work).

Now, one argument is commonly that "but you can't change old kernels, so the old kernels cannot have new interfaces added to them when a development kernel changes". And that argument is _bogus_. Because if you truly don't change old kernels, then you also don't need to have any backwards compatibility AT ALL, because the old driver will obviously continue to work (or not work) forever unchanged.

This is why I do not want to add compatibility files to development kernels. It's the wrogn thing to do, because adding compatibility files to new kernels always implies carrying baggage around forever. Adding the compatibility files to old kernels is conceptually the right thing to do: it tells you (in the right place) that old kernels are still maintained.

6. The Future Of The MIN() Macro

17 Apr 2000 - 19 Apr 2000 (18 posts) Archive Link: "small patch for pty.c"

Topics: SMP

People: Paul MackerrasHorst von BrandLinus TorvaldsStephen Rothwell

Paul Mackerras reported, "In looking through drivers/char/pty.c, I saw a couple of places where the MIN() macro is used and one of its arguments is a function call which could potentially return a different number each time it is called, particularly on SMP systems." He posted a patch to get rid of the bad references to MIN() ( , but replied to himself the next day, "Stephen Rothwell just pointed out to me that my pty.c patch was bogus. The first section was fine, the second section was just plain wrong. Here is a better patch. :-)" He posted the new version, but Horst von Brand objected that someone was just going to stumble on this code, wonder why it didn't use MIN(), and change it back. He added, "Better fix MIN to work right, and get rid of _all_ these problems for good." But Linus Torvalds replied:

No. Wrong moral. You should not "fix" MIN().

MIN() is a _bad_ thing to use. It _always_ gets the sign wrong. Sorry, but that's how it is. It makes people think that "oh, it takes the minimum of two numbers" and at the same time it makes people completely forget signedness issues.

In short, it's a really stupid macro, for something that is usually simpler to do by hand anyway. And doing it by hand you can make it work right, without confusion like this.

MIN() should go. If people really want to use a macro for something as simple as MIN(), then you should always use UMIN() and SMIN(), and make people have to think about signedness issues (but even then you tend to be better off just doing it explicitly by hand).

Horst posted a version of MIN() that he felt solved the problems:

This one doesn't get it wrong:

#define min(x, y) ({typeof (x) x_ = (x); typeof (y) y_ = (y); x_ < y_ ? x_ : y_;})

The variables x_ and y_ are the _same_ types as the x and y the macro gets, the sizeof-ish behaviour of typeof ensures no evaluation of the expression (i.e., no side effects); and then it does the comparison as it would with the original values, just never evaluating anything twice. No (explicit) mess with temporaries that have to be declared, and no huge expressions that have to be written twice (only to get one of them wrong, as Murphy's law assures will happen where you can't see it).

There was no reply.

7. Makefile Patch To Ease Module Compilation

17 Apr 2000 - 19 Apr 2000 (4 posts) Archive Link: "[Patch] module compilation infrastructure"

People: Olaf TitzJeff Garzik

Olaf Titz posted a patch, and explained:

please consider the following patch (against 2.3.99-pre3) or a similar thing for inclusion into the kernel. It fixes the long-standing problem of finding out the right compiler options to compile external modules.

This patch causes "make dep" to write a makefile fragment which can be included from a module makefile.

Jeff Garzik replied, "Nice. The kernel has needed something like this for quite a while," but replied to himself a couple days later with some compilation errors. Olaf replied with a question or two, but there was not much discussion.

8. Block Allocation In ext2

18 Apr 2000 (9 posts) Archive Link: "Block fragments in ext2"

Topics: BSD: FreeBSD, FS: ext2

People: Stephen C. TweedieAlexander ViroLinus Torvalds

Justin Hopkins was working on a paper for school, and asked for a little help. He gave a pointer to Linus Torvalds' statement against block fragments in ext2 ( , and asked if block fragments really were deprecated. Stephen C. Tweedie replied, "Yes. The ext2 block allocation algorithms are a lot better than those in the original FFS, and we can get good performance with smaller block sizes." Alexander Viro took up the issue, with, "Stephen, I seriously suspect that larger allocation block sizes would buys us a better speed. Reason: allocation algorithms in Linux ext2 and FreeBSD UFS implementations are very similar. Ditto for layouts, indeed. And on FreeBSD 16Kb blocks give visible win over the 4Kb ones." Stephen agreed with this, adding that this would speed up large file deletion. He went on, "Most of the advantage, though, lies in the lower amount of metadata required for the mapping tree if you have a larger blocksize. I'd much prefer to see us end up with btree-based mapping trees and a small blocksize rather than large blocks with standard indirection mapping tables as a final solution, as that really ought to gain the best of both worlds: small-blocksize allocation efficiency with large-blocksize metadata performance." There followed a brief discussion of various possible implementations.

9. Experiment In Linux Deprivation

19 Apr 2000 (1 post) Archive Link: "Moving.."

People: Linus Torvalds

Linus Torvalds mentioned:

Just a quick heads up that I haven't had access to a computer for two days (and yes, my hands _are_ shaking), because we're in the middle of a move. We've gotten everything but the cats to the new house, but all our stuff is still in big boxes, and I will probably remain unable to read email from home for at least another week.

So this is just a notice to people to not worry - I'm just snowed in and will probably require some re-sends after it all clears up..

10. Problems Connecting To Web Sites

20 Apr 2000 - 21 Apr 2000 (5 posts) Archive Link: "Problems with MTU?"

Topics: FS: sysfs, Modems, Networking

People: Hans-Joachim BaaderTheodore Y. Ts'o

This thread featured a long, characteristic post by Theodore Y. Ts'o. Initially, Hans-Joachim Baader reported:

for several months I had the problem that access to some web sites fails mysteriously. This means I never got a reply although the servers were reachable with ping and telnet.

Today I did

echo 1 > /proc/sys/net/ipv4/ip_no_pmtu_disc

and suddenly all these sites worked again for me. I use kernel 2.2.x (2.2.pre15-19 currently). Some of the affected sites are (in fact, the whole and

Are these web sites broken?

Theodore replied:

Actually, yes, the web sites are broken; but it's a relatively subtle problem. Here's what's going on.

The web sites are probably behind a firewall which is filtering all ICMP packets (since some dumb firewall administrator read somewhere that all ICMP packets are evil). Somewhere in the network path between you and the web site, there's a link which has a maximum MTU which is smaller than the ethernet default of 1500 bytes.

Normally, dumb hosts would just send bytes using their local max MTU, and then when the packets hit the link with the restricted MTU, the packet would get fragmented at that point, and then it would get reassembled at the receiver. Fragmentation has the problem, though, that it only takes one fragment to get lost in order to require the retranmission of the entire packet, thus wasting network bandwidth --- and if the reason why the one of the fragments was dropped was due to link congestion, the fact that all of the fragments have to get retransmitted can actually worsen the situation.

Hence, good network citizens (of which Linux is one) uses Path MTU discovery to try to determine the appropriate size to send over any arbitrary link. The way it works is very simple; in each direction, the hosts sends packets with the Don't Fragment bit set. When a packet which is too large reaches the router just before the link with the restricted MTU, since the don't fragment bit is set, the router drops the packet on the floor and sends back an ICMP Destination unreachable with a code which means "fragmentation needed and DF set", along with the maximum MTU of the constricting network link. This allows the sender to automatically determine maximum MTU for the hop, and the sender can then resend the TCP segment using a smaller packetsize, now that the maximum path MTU is known.

The problem with this comes with, as I mentioned earlier, bone-headed firewall maintainers who believe that all ICMP packets are bad and filters all of them. This includes the ICMP Destination unreacable packets which are needed to make path MTU function correctly. As a result, a site which is behind one of these firewalls will continually send big packets with the don't fragment bit set, which then get rejected when they hit the constricting link, but since the firewall filters out the ICMP "too big" message, the sending sight never knows that the packets are getting rejected, and so they can never send you anything.

This doesn't come up in normal operation for most hosts because most links support at least the ethernet maximum MTU of 1500, and if there is a constricting link, it is at the client endpoint (for example, the client is dialing up with PPP and so has a restricted MTU). At the client end point, it's not a problem, since the client knows that it's sending packets to an interface with a restricted MTU, and so it sends small packets. The problem comes when the constricting link is in the *middle* of the network path, and so Path MTU discovery is required in order to make things work. (Either that, or you have to learn to live with fragmentation with their attendant disadvantage).

So this will come up if you are using a PPP to connect to your gateway, and then you use NAT to allow machines on the local ethernet to gain access to the network via your singleton PPP connection. It also comes up with those folks using DSL where the providers are using the abomination also known as PPP over Ethernet (PPPOE).

There are a couple of solutions you can use to solve this problem. One is to ask nicely to the web site administrators that they fix their firewall. Unfortunately, this doesn't always work. I am told that Amazon's web programmers were told about this problem, and they acknowledged it *as* a problem, but said they weren't allowed to fix it. (Probably because the bone-headed firewall administrator in their case was clueless, and they didn't have the power to override him.)

The second approach, assuming that the constricting link is close to you (usually it's the second-to-last hop in most scenarios), you can simply set a per route max segment size (MSS) parameter, which will force TCP packets using that particular route to be no larger than the MSS size --- in both directions. Given that the problem is usually on the link to the outside internet, and that connections to other hosts within the subnet are OK, it's usually a matter of setting the mss option only on the default route:

route add default gw mss 1400

The final approach, which apparently some of the DSL providers use, is that on the cable modem box or on the DSLAM in the telco's central office, they are actively messing with the outgoing packets, by looking for TCP packets with the SYN bit set (which indicates the beginning of a TCP stream), and change the max MSS option in the IP header to be smaller than max MTU caused by the PPP over Ethernet overhead. This is incredibly ugly, and violates the IP protocol's end-to-end argument. It also breaks in the presence of IPSEC, since the DSL provider won't be able to muck with the packet without breaking the cryptographic checksums.

So it's completely ugly. Still, I imagine that Rusty will no doubt be writing a new "packet fucking" module in ipchains to support this kind of TCP syn option rewriting. :-)

11. /usr/include/linux And /usr/include/asm Symlinks

20 Apr 2000 - 24 Apr 2000 (12 posts) Archive Link: "PROBLEM: kernel 2.3.99-pre5 does not compile without system-wide kernel headers"

People: Theodore Y. Ts'oAndries BrouwerJeff GarzikOlaf Titz

Romain Vignes was trying to compile multiple versions of the kernel, without having to update the /usr/include/linux and /usr/include/asm symlinks each time. He posted a 'Makefile' patch to work around the problem by explicitly including the required directories on the 'gcc' command line, but acknowledged that this wasn't a complete fix. Jeff Garzik replied that changing the symlinks was not necessary; one could just unpack the kernel tarball and start compiling. But Theodore Y. Ts'o explained and proposed:

.... except if you want to build stand-alone kernel modules that work against the kernel that you are currently booting.

This still requires that /usr/include/linux and /usr/include/asm match the kernel you are currently building. Some packages will now use /usr/src/linux/include/{linux,asm} in preference if they exist, or allow the user to manually specify the location of the kernel source tree to use, but these conventions haven't been universally standardized yet.

Doing so would be a good idea, BTW. If we can all agree that "this is the way to do things in Linux 2.4", and make sure all of the distributions are informed of that fact, that would be Good Thing (tm). My personal nomination of the standard way to do things is to assume that /usr/src/linux is a symlink to kernel sources corresponding to the default kernel being booted on that machine, and that stand-alone device drivers that need to compile against a kernel should default to /usr/src/linux, but there should be an easy way for users to override that and specify some other kernel source tree if necessary. Any objections to such an approach?

Some folks did object. Andries Brouwer, Olaf Titz, and Jeff all pointed out that the "kernel that you are currently booting" was not as clear a concept as it might seem, since anyone compiling the kernel would actually have two kernels -- the previous and the current. There was a very small amount of discussion, but nothing conclusive.

12. New Hacker Posts First Patch

20 Apr 2000 - 23 Apr 2000 (8 posts) Archive Link: "[PATCH] Fix for rtc.c non-atomic access to rtc_status"

Topics: SMP

People: Cesar Eduardo BarrosManfred Spraul

Cesar Eduardo Barros posted his first ever kernel patch, to fix some SMP races in 'drivers/char/rtc.c'. There were a couple replies criticising the patch, and Cesar also replied to himself 4 times (mostly with improved patches). For the first self-reply, he said,, "I added code to avoid the xchg() race (thanks to Manfred Spraul for hinting me on that one). Couldn't avoid using spinlocks." In the final version he also credited Manfred with making the code much simpler than the original submission.

13. Stable Pre-Patches Stall Console Output

20 Apr 2000 - 21 Apr 2000 (9 posts) Archive Link: "2.2.15pre19 and screen"

People: Petri KaukasoinaAlan CoxManfred SpraulPhilippe TroinRobert L. Harris

Petri Kaukasoina reported, "2.2.15pre19 and 2.2.16pre1 have problems with "screen" (the text window multiplexer program). Under screen, the output of some programs (e.g. less) only show about 1024 characters at a time and then stop. When I type some key, it shows the next 1024 chars of the output and so on. 2.2.15pre18 and earlier are ok." Manfred Spraul asked for an 'strace' of 'screen' or 'less'; Petri couldn't reproduce the problem on an 'straced' process, but Philippe Troin confirmed that he too saw the problem; and posted 'strace' output. Robert L. Harris also experienced the same problem and posted some 'strace' output. Alan Cox posted a one-line patch to 'drivers/char/pty.c', and that ended the thread.

14. Renovating File Locking Code

21 Apr 2000 (6 posts) Archive Link: "[PATCH] /proc/locks bugfix"

People: Matthew WilcoxManfred SpraulAlan Cox

Manfred Spraul posted a couple bugfixes for 'fs/proc/proc_misc.c' in the 2.3 series, and Matthew Wilcox replied, "There are actually worse problems than this in locks.c. It's called from nfs lockd without the big kernel lock held at all (it holds a semaphore on the inode, but this is inadequate). I've fixed this in my tree by putting all accesses to the lock data structures under its own semaphore. I hadn't spotted the problem with /proc/locks, but I wasn't looking for it :-) I noticed a race condition with my solution yesterday, so I'm fixing that and hopefully I should be able to get a rearchitected locks.c out in a couple of weeks time. I'm doing a few other things to it at the same time..."

Manfred reminded Matthew, "It's dangerous to replace lock_kernel() with a single semaphore: lock_kernel() means one cpu, down() means one thread. Did you double check that you don't cause a major slowdown/reschedules, e.g. if kmalloc() sleeps?" He added that he always preferred a single spinlock (where the process loops until a resource becomes available) instead of a semaphore (where the process sleeps until a resource becomes availabe, at which point it is awakened) in those situations for performance reasons. Matthew replied that the file locking code wasn't performance critical, but that in terms of guarding against slowdown using his method, "yes, the code only sleeps in very specific places, which are while the semaphore either isn't held, or is dropped while we sleep, and the code then restarts its scan if we did have to block."

In terms of preferring spinlocks to semaphores, Matthew said he could go either way. He added, "I did think about making the locking more finegrained here, but I really don't think it's worth it"

Alan Cox came in at this point, with, "The 2.3.9x file locking code is fairly broken anyway. Actually going over it with a large pickaxe wouldnt do any harm." Matthew replied, "`A large pickaxe' is a pretty good description of what I've done to it," but added that his changes weren't ready yet.

15. Organization Of Kernel Modules

21 Apr 2000 - 23 Apr 2000 (10 posts) Archive Link: "Announce: modutils 2.3.11 is available - the debugger's helper"

Topics: Networking, PCI

People: Richard GoochJamie LokierDominik KublaVictor KhimenkoMatthew WilcoxKeith Owens

Keith Owens announced 'modutils' version 2.3.11, and there followed a debate over whether to continue keeping groups of related modules in separate subdirectories, or to move all modules into a single flat directory. Matthew Wilcox pointed out that all modules were symlinked at compile-time into '/usr/src/linux/modules/index.html' anyway, and therefore having separate subdirectories didn't even protect against namespace conflicts. And, he added, given a flat namespace, all those subdirectories only amounted to obfuscation.

Richard Gooch countered with the idea that subdirectories made it easier for humans to find particular modules or to view groups of related modules. He said, "When I ask myself "do I have the ne2k-pci module", it's obvious to just look in the "net" directory. The subdirectories are a natural categorisation that makes life easier. It's not about pretending there is a hierarchical namespace." Matthew replied that, for the 'ne2k' example, it was just as easy to do an 'ls *ne2k*' from a flat directory and be done with it. He went on to point out that the hierarchical directory structure made life more complex both for the kernel build scripts and for 'modutils' itself. But Richard felt that the added complexity was not significant, saying, "I can't see why we would change things in the direction of making it harder for humans, since we already have a reasonable system implemented."

Victor Khimenko came in at this point from a different angle, pointing out that the existing directory structure had serious problems; and it was often difficult to find particular drivers if one didn't know before-hand where to look. Jan Evert van Grooth agreed with Victor, as did Jamie Lokier. Jamie put it:

I went looking for ipfwadm.o in net. Nope, it's in misc. What about af_packet.o? That's in misc too.

It makes some sort of sense: net/ seems to contain network *device* drivers. This now includes ppp (but it didn't before).

And why isn't there a sound directory? Etc.

If there must be directories, IMO "video" and "sound" should have priority over "ieee1394".

Dominik Kubla also sang in this chorus, saying that if there was going to be a hierarchical directory structure, it should be done properly (he posted his own suggested hierarchy); but that actually, "we should ditch the subdirectories and just adopt a flat single directory: it makes things simpler and namespace collision will happen at compile time anyway, so we need not worry about this." The discussion ended there.

16. Handling Large Files On 32-bit Systems

24 Apr 2000 (8 posts) Archive Link: "llseek and 64 bit return..."

People: Matti AarnioLinda WalshOliver Xymoron

A one-day thread. Linda Walsh pointed out that lseek() gave an error if its return value required more than 32 bits. She suggested having lseek() return a 64 bit value on Intel machines. Matti Aarnio gave a pointer to The Large File Summit ( and replied, "Use lseek64() and be happy. On intel linux it is implemented at glibc by means of llseek(), but that is then a side issue." Linda replied, "Is there some good reason why the interface shouldn't be cleaned up on Intel? Seems like it would be much more maintainable, less ugly -- rather than have various user-land kludges that take advantage of extra kernel calls. Wouldn't it be more straightforward to have lseek return a 64 bit value on ia32 (et al. 32 bit platforms) as it does on the 'big machines'? Seems like it could result in simpler glib code as well..." After some more discussion, Oliver Xymoron also gave a pointer to the same page, and said, "Linux is following a fairly well thought out scheme called the Large File Summit for dealing with 64-bit files on 32-bit systems." That was the end of the thread.







Sharon And Joy

Kernel Traffic is grateful to be developed on a computer donated by Professor Greg Benson and Professor Allan Cruse in the Department of Computer Science at the University of San Francisco. This is the same department that invented FlashMob Computing. Kernel Traffic is hosted by the generous folks at All pages on this site are copyright their original authors, and distributed under the terms of the GNU General Public License version 2.0.