Kernel Traffic #59 For 20�Mar�2000

By Zack Brown

Table Of Contents

Mailing List Stats For This Week

We looked at 1543 posts in 6474K.

There were 466 different contributors. 223 posted more than once. 184 posted last week too.

The top posters of the week were:

1. Capabilities

9�Feb�2000�-�8�Mar�2000 (182 posts) Archive Link: "Capabilities"

Topics: Access Control Lists, Capabilities, FS: ReiserFS, FS: XFS, FS: ext2, POSIX

People: Peter Benie,�Chris Evans,�Christopher Allen Wing,�Matthew Kirkwood,�Hans Reiser,�Victor Khimenko,�Theodore Y. Ts'o,�Pavel Machek,�Jesse Pollard,�Andreas Gruenbacher,�Jason McMullan,�Gregory Maxwell,�Casey Schaufler,�Horst von Brand

Peter Benie started it off with:

I've tried to use capabilities to run xntpd without excessive privilege. Not surprisingly, the only capabiity xntpd requires is cap_sys_time.

For this change to be useful, xntpd needs to run as a uid other than 0, otherwise it can overwrite files owned by root, regardless of what capabilities it has. This is no big deal - we just have to call setuid to change uid.

Here's the problem - if you have any programs that don't understand capabities, you have to run without SECURE_NO_SETUID_FIXUP or else they won't throw away privilege correctly. In that mode, changing to a non-zero uid clears the effective and permitted capability sets. This is no good since you now have insufficient privilege to do what you want, so you employ the following horrible hack. (Have your Unix barf-bag ready...)

fork
[Parent] [Child]
setgroups block, waiting for parent
setgid
setuid
unblock child
wait for child catsetp(getppid, capabilities)
exit

The child is still running as root with all capabities, so it can hand over cap_sys_time to the parent. The parent will then have a non-zero uid and a non-empty capability set.

This must have worked once since sucap relies on a similar trick, but it doesn't work now because of how init is started.

#define CAP_INIT_EFF_SET to_cap_t(~0 & ~CAP_TO_MASK(CAP_SETPCAP))
#define CAP_INIT_INH_SET to_cap_t(~0 & ~CAP_TO_MASK(CAP_SETPCAP))

This results in all children of init running with cap_setpcap-i, so the child process cannot hand over cap_sys_time to the parent.

Chris Evans said he had a patch for this, and said, "Im very glad to hear you are de-privving xntpd. I'd like to see that change in distributions ASAP! People de-privving bind (thank God) have also hit this issue" . He went on to explain, "I discussed the issue with the capabilities maintainer (Andrew Morgan) and we decided upon a simple solution; If a process has its capabilities changed via sys_capset(), it is marked as capability aware. When a "capability aware" process does setuid(0 -> !=0), capabilities are not cleared. The "capability aware" flag is cleared on exec()." Christopher Allen Wing replied, "Allow me to second this suggestion. In the present state capabilities are useless on Linux as a means of privilege isolation, since they can't be used by anyone besides root on any standard Linux distribution." He added, "I want to be able to start a daemon, drop all capabilities except the one I need, and then setuid() to a non-privileged user," and exhorted, "Linux needs your patch!" Chris E. replied, "I've tested my patch fairly carefully in the scenario you mention. I'll rediff and submit to Linus once the rabid 2.3 kernel patching calms down."

Peter also replied to Chris E.'s patch announcement, saying that it did meet his needs, allowing him to avoid the horrible hack he'd described. On the other hand, he pointed out, the patch would be unclean in his opinion, because "the side-effect of sys_capset is extremely non-obvious; I'd probably be happier making the state change explicit with prctl()." Matthew Kirkwood seconded this, and Chris E. replied, "You'll get that when the filesystem support for capabilities goes in. Alternatively, tighten up the bounding set as part of your system initialisation scripts."

Matthew felt this was a misunderstanding, and replied, "Read what the man says, Chris. He wants to be able to decree that setuid programs (for example) don't get CNBS without breaking inetd. I don't believe that this is functionality for its own sake. If you think of it as a sysctl which allows you to turn off bits of SECURE_NO_SETUID_FIXUP." Chris E. asserted that this was a case of functionality for its own sake, because once filesystem support came along (which he felt would be soon), "the solution becomes one of userspace setup rather than kernel support. Complexity is better in userspace than the kernel. We don't want to introduce temporary kernel tweaks between now and such time as we have filesystem support for capabilities, because then people will _use_ that support and we could get stuck with it." Matthew came back with, "This is the same argument which we heard against a capability bounding set, and I consider it thoroughly bogus in both cases." He added that he saw no evidence of filsystem support coming any time soon, and added, "The last patch I saw" [...] "was against 2.1.xx, and used reserved space in the ext2 inode which has since been used for other things. Linux has available a pair of __u32s in the on-disk inode which is not sufficient, I believe. reiserfs has no inode space reserved for future expansion." He also said that the functionality would not add much complexity, and appended a short, admittedly untested patch to his email, to prove it. There were a number of replies.

Hans Reiser replied, "if you want an optional field added" [to reiserfs] "talk to us, dynamic space allocation is a strength of our approach, and putting support for you in a next major version of reiserfs is something we will do if you send me an email/URL that convinces me your work is valuable (I am ignorant of it, educate me.)" Matthew thanked him, and added that he hadn't been aware that reiserfs was flexible enough to do this. He went on:

That being the case, I would like:

Chris E. objected that 32-bit capability fields might not be big enough to handle all future needs. He felt that 64-bit capability fields would be much better in any filesystem. Victor Khimenko felt that even 64-bit fields would not be enough. He bemoaned:

WHY everyone is trying to add low limits in design just to try to break then later ? 32, 64 or 128 WILL NOT be enough in long run. I can understood why capabilities in kernel are 32bit word:

  1. they are checked quite often and so it should be doable FAST and
  2. they can be extended in future without big fuzz (it's internal kernel structures so only few system calls should be changed)

On other hand once something added to filesystem, it's added to FILESYSTEM. Read "set in stone". It's VERY hard to fix something there. And starting of program is NOT time-critical operation. Even if you put capabilities in separate file and store inode number of that file in inode of capability-enabled file this will not slow process starting much (even if IMO it's overkill). So 32bit is not way to go and even 64 is not enough. IMHO anyway.

Casey Schaufler felt that having too many different capabilities would only make it more difficult to set up a secure system; and there was some small discussion. Elsewhere, the issue of how many bits to reserve came up as well, with Chris E. urging people to reserve at least 64 bits. He felt this would be enough for a fairly long time. Matthew felt that even 40 or 48 would be enough, but he had no objection to 64. Theodore Y. Ts'o observed:

Well, there's a trade off here. If you could have 32 bits basically almost right away, and more would take longer, which would you choose? Also, keep in mind that more bits is not necessarily good. There is a *huge* complexity cost in maintaining capabilities. People have enough trouble keeping track of the 12 bits of permissions on a per file basis. This adds one or two orders of magnitude of more bits for every executable.

However, my knowledge of human nature being what it is, I agree with you that unless very strong measures are taken to control the virtual explosion of new capabilities people will want to add, we will need more bits. So I'd suggest either putting a hard limit on 32 bits, or budgeting 128 or 256 bits, since bits are relatively cheap once you exceed 32.

People who are interested developing capability source should seriously think about ways to control the complexity, though. If the user-mode management tools aren't good enough, capabilities will be a diaster, and their use could actualy decrease the overall security on a system.

Gregory Maxwell put in another voice for limiting the number of capabilities, pointing out that it could become virtually impossible for anyone but a true guru to implement security effectively on a system with many capabilities. And Pavel Machek, also replying to Theodore, added, "Well, 32 will be almost certainly exceeded: we are using 28 NOW."

Jesse Pollard also replied to Theodore, saying that 32 would definitely not be enough. He proposed:

I'd like to (potentialy) be able to control every system call through a capablity. I'm also a strong fan of MLS systems, it lets my paranoid side out when protecting systems:).

I'd suggest using an index reference to a table containing multiple capability lists. The set of usable capability lists is limited. Many inodes, but the number of uniqe capability lists would be rather low (20-30 most likely). Using a reference:

  1. reduces the impact on an inode (8 bits would allow for 255 different capabilitiy lists - reference 0 represents no list)
  2. centralizes control over what capability lists are allowed; multiple inodes would be able to reference the same capability list.
  3. allows customization by the security administrator (users would not be able to create arbitrary lists)
  4. Allows more capabilities to be added without impact to the inode structure, only the reference table support.

Some additional administrative information could be determined easily too - if a "link count" is updated each time an inode is given/removed a capability reference, then the total number of files making use of the capability list would be known; as well as any unused definitions.

I'm in the process of setting up a secured server system (using RSBAC) and I would like to run the web server in a compartment, without the ability use the exec system call. Yes this restricts a lot of CGI capability, but does allow the use of Java (as part of the server) and mod_perl. The web pages (and data files) will be stored at a security level below that of the server process.

This configuration would prevent any hack entry into the server (via bugs/ stack overflow, etc) from being able to do anything to the data (no write down). Without the exec, no shell process could be generated. The most that could be done is aborting the server process. Since these are usually children of a parent server, they would normally be restarted when needed (parent would have fork, and listen).

Extended capabilities would allow me to detect/prevent hacking output data by changing the server. Admittedly, it would not prevent replacing the functionality of a web server (say by loading a bogus Java/perl server and running it) but if the remaining capabilities prevented the use of the "listen" system call (only the parent server listens) then even this attack would be defeated. (The Apache server would have to be patched to drop the capability to use the listen system call after the fork).

Jason McMullan said this system made much more sense to him than having a bitmask per inode. He argued that administration of cababilities would be improved on such a system. But down the line in the argument, Andreas Gruenbacher argued against Jesse's proposal:

Capabilities, Access Control Lists, Auditing, Mandatory Access Control and Information Labeling are specified in Posix 1003.1e / 1003.2c Draft Standard 17. DS17 was withdrawn. Nevertheless, implementations of other Unixes are based on DS17 or earlier drafts.While DS17 is not perfect, a lot of work surely has gone into Capabilities and ACLs. Most of it actually makes sense.

The specs are publicly available. I guess Casey Schaufler already posted the link; here it is again: http://www.guug.de/~winni/posix.1e/download.html.

As far as capabilities are concerned, it's important at least the user interface for manipulating ACLs is close to DS17. Having a completely different capability scheme on Linux makes no sense. Linux tries to be compatible with POSIX and other Unixes; I see no reason why capabilities should be an exception.

Provided that the user interface is similar to DS17, a capabilities table just adds complexity that doesn't pay off. The setfcap utility specified in 1003.2c manipulates individual capabilities.

The owner of a file may change the capabilities of a file (provided that he/she is capable of CAP_SETFCAP). This also affects the per-filesystem capabilities table. A capability table causes the same trouble when backing up / restoring a filesystem. Each file and its associated capabilities need to be backed up. Just backing up the index into the capabilities table doesn't make much sense. When restoring the file to another filesystem, a capability set corresponding to the capabilities of the file will probably need to be created ...

I am convinced a fixed set of capabilities is all we need. A limit of 32 may be too low; 64 seems perfectly reasonable to me. Capabilities are not meant to cure each and every security problem. Things like protecting /etc/shadow are really better dealt with by filesystem permissions.

64 capabilities require 3x64 bits for each inode. For future expansion, 3x128 seems a safe limit. Storing 128 more bytes in each inode is beyond limits. Most files won't use capabilities, so that would be a massive waste of disk space.

I propose to store a pointer to the capabilities of a file in each inode. On ext2, we already have i_file_acl, which points to the ACL of a file. Capabilities could be stored at the same location.

One possible implementation:

The ext2 part of the ACLs for Linux project at <acl.bestbits.at> uses i_file_acl (and i_dir_acl) as a pointer to a disk block that contains the ACL of a file. The very same disk blocks seem a natural place for storing capabilities. This adds some trouble in the ACL code (it interferes with the ACL cache), but is doable.

[Inodes with identical ACLs frequently share an ACL disk block. When an inode is associated with an ACL, the ACL is looked up in a hash table. If a suitable ACL disk block is found in the cache, that block is reused; otherwise, a new ACL disk block is created.]

An alternative for storing capabilities, ACLs, etc. is to implement a mechanism for storing arbitrary meta information (i.e., attribute lists) for inodes. Irix XFS supports that. Ext2 has no comparable mechanism, but the i_file_acl block approach may be good enough.

The SunWorld article ``Controlling Permissions with ACLs'' at http://www.sunworld.com/swol-06-1998/swol-06-insidesolaris.html and the Linux ACL project at http://acl.bestbits.at/ contain some more implementation ideas.

And Pavel also replied to Jesse's proposal, pointing out that disabling exec() amounted to security-by-obscurity. He acknowledged that it would probably still deter 95% of all attacks, but explained, "you can do exec without actually invoking exec system call -- you close some fds, mmap executable somewhere into your address space, unmap old files ... and you've done exec() without actually doing exec." Jesse protested that there were some security benefits in any case. He took 'passwd' as an example, explaining that unless the program was exec()ed, it would not be given the priveleges needed to change the password file. In general, he said, "it is not possible to gain additional capabilities by a subsequent attack on privileged utilities" He added that:

On several B2 rated systems, the shell IS capable of requesting level changes, and unless the shell is operating in the proper environment (capabilities, level, compartment, and process tree) then even the request for security change can be a violation. (BTW, the system calls for changing security environment has to be built into the shell - they don't work otherwise - see below)

On one system I use (Cray UNICOS), the shell cannot change security classifications without:

  1. be a login shell with a parent process that is flagged as a security user entry point(telentd/sshd recieve these privilages).
  2. have no subprocesses
  3. have no open output files other than the controlling terminal (ie. don't redirect stderr to a disk file...)
  4. have the permission to change (elevate) security access.
  5. can not raise level above that of the user connection (ie. secure wire, labeled network connection).

At least two of these get broken with a web server:

1. a web server should not be labeled as a user entry point
4. the web server sould not be labeled as such.

There are other restrictions that can be applied by modifying the web server itself:

  1. deny fork capability in children of the listening web server.
  2. deny any open for write capability.
  3. deny listen, bind, connect... capability in children of the parent web server (no new network connections will be allowed after fork).

Horst von Brand pointed out that without CGIs, without a way to record user input, and without client/server database etc., the web site would not have much to offer. But Jesse reiterated that embedded Java and modperl would still be available as part of the server, there were still certain things the web site would be good for. Those would be, he said:

  1. catalog look-up via modperl or java module
  2. redirect the browser to another site for business etc. applications
  3. serve read-only data

At this point the discussion veered off.

2. Symlink Permissions In devfs

1�Mar�2000�-�7�Mar�2000 (9 posts) Archive Link: "[PATCH] devfs and symlinks--2.3.48"

Topics: FS: devfs, FS: ext2, FS: procfs

People: Richard Gooch,�Jamie Lokier,�Alan Cox,�Linus Torvalds

Matthew Vanecek complained that with devfs installed, symlinks in /dev would all have permissions lr-xr-xr-x instead of lrwxrwxrwx as was standard throughout the rest of the system. He posted a one line patch to fix the inconsistency, but Richard Gooch wouldn't accept it, because "Symlink permissions should not matter. The kernel doesn't care, and neither should applications. If some application out there is doing lstat(2), I'd rather break it and see it fixed, since it's probably broken in other ways too." Jamie Lokier pointed out that in /proc, symlink permissions did matter, in that you couldn't read from symlinks that were not readable by you. Richard replied that procfs was a special case, and that in devfs, symlink permissions were simply ignored by the kernel. Matthew replied that if it didn't matter, why not just fix the inconsistency? Richard reiterated that the inconsistency would help identify broken programs. Jamie analogized that this was the same as burgling someone's house to teach them to keep their door locked. He suggested facetiously, "Why don't we read 9 bits from /dev/urandom and store them in the ext2 symlink mode?" Richard invited him to submit a patch to Linus Torvalds; and Alan Cox added that those bits were just not used.

3. 64-bit Linux

3�Mar�2000�-�8�Mar�2000 (39 posts) Archive Link: "Linux 64 bit - Trillium"

Topics: BSD: NetBSD, FS: NFS, Microkernels: Mach, Networking, SMP, Virtual Memory

People: Jeff V. Merkey,�Matti Aarnio,�David Weinehall,�Zach Brown,�Alan Cox,�Drew Sanford,�Matthew Wilcox,�Oliver Xymoron,�Victor Khimenko,�Paul Jakma,�James W. Laferriere,�Gregory Maxwell,�Brian Pomerantz,�Jeff Dike,�Florian Weimer

Someone had heard a rumor about 64-bit Linux, and asked about it on the list. Gregory Maxwell mentioned that Linux had been 64-bit for awhile already, and pointed to the Alpha and Sparc ports; but Brian Pomerantz, also replying to the original poster, added that those ports still had significant problems. Jeff V. Merkey replied, "You can download the source code for IA64 Linux from VA Linux's Website. There are not any Merced IA64 boxes out at present except for Intel partners, but will be soon (2Q 2000). They will probably cost about as much as the income of a small third world nation when they come out." James Manning pointed out that the sources Jeff referred to had been in the official 2.3 series for a few versions already; but Steve Rihards took issue with the idea that IA64 boxes would be redeamable for one (1) third world nation. He said that on the contrary, they were intended to be priced much lower than Sparcs. Victor Khimenko came down on Steve, pointing out that Jeff had "obviously" been talking about early-access boxes, not retail sales; the reference to the second quarter of 2000 made that clear enough (In His Humble Opinion). Steve didn't reply.

Matti Aarnio, also replying to the original poster, offered the following fine history/summary:

"Trillium" is just one of lattest efforts among others to bring one more architecture into Linux. http://www.linuxia64.org/ It has been widely touted only because the target processor is made by intel, and so far unavailable.

Linux began as i386 operating system in 1991. Linux 1.0 had only "i386" in 1994.

At Linux 1.2.* series (1995) mainline kernel already supported:

Now Linux 2.2 already supports:

Linux 2.3 (development series) carries already additionally:

In the works (but not in mainline source) I have heard also of:

Paul Jakma added to this list the IBM S/390 mainframe port, which had been merged into 2.2.1; and David Weinehall pointed out that the S/390 was only in 2.2.x, and hadn't yet been ported to 2.3; he added that 2.3 already supported (continuing Matti's list):

unported architectures that he felt were most needed, were:

There were several replies to this. Florian Weimer added that the Intel Paragon would be another good one to have. Paul Jakma, also replying to David, opined that a lot of work was being done on the Vax port; but James W. Laferriere said that his last visit to the web page had shown very little progress. Alan Cox also replied to David, pointing out that Zach Brown was working on the 88K. Zach replied:

I even had two of them :) But I couldn't find nice docs easily, so I didn't bother. I've now passed the machines on to a nutty NC guy (peter jones), maybe he'll have more luck.

The nicest way to do this port would be to get the motorotola vme reference boards, I hear mot.de still has them in stock. Don't forget a very large barf-bag to deal with the 88k iteslf, and a time machine to find anyone else on the planet who knows anything about them.

In his same post, Alan also remarked:

The AS/400 is more like a toaster than a computer. They take an interesting approach of implementing a protected machine via software 'trust'. The compiler generates apps that are written in an interpreted layer and have been validated to some extent. The code is then executed JIT style with trust checks in the code. The real hardware is much akin to an MMUless 386.

You'd almost want to bootstrap the system with a JVM on the hardware and use the gcc compile into java byte code target to build your system.

Drew Sanford replied, "Having done quite a bit of work with banks, all of whom have used AS400's for transaction accounting, I developed alot of respect for the way AS400's do things. It has more than once crossed my mind to get my hands on one of the systems the banks were phasing out just to see if I could port some form of linux to run in a VM on the AS400 much like the linux port to the S/390. Were it not for a lack of time and sometimes money, I think I would already have made a go at this. I think it would make quite a machine running linux. As a side note, when I read Alan's remark about the AS400 being more like a toaster than a computer, it made me stop and think for a second. I think if anyone else had said that, it would have created in a desire to dropkick the person that said it, and teach them a thing or two. Coming from Alan, however, it seems outragously funny....it took me nearly 5 minutes to be able to hit reply. I don't know where the idea of it being like a toaster came from, but the technicallity in the architecture is accurate. I've still got a maniacal smile on my face ...."

Matthew Wilcox also replied to David, regarding the possibility of porting to a Cray. He replied with mirth, "i heard they don't have an MMU. ucLinux on a Cray, baby! I've heard people talking about merging ucLinux into 2.5, but they're crazy people." Oliver Xymoron replied:

I've heard the hardware can do it, but the software folks didn't want to give up cycles to indirection. Cray was all about maximum bandwidth, minimum latency. Hence, no cache, no virtual memory, shared libs, etc. The OS's job on a dinosaur supercomputer was to get the hell out of the way.

For a while, I had a Convex C1 in my living room, which was a vector supercomputer. If you wanted, you could run UNIX, complete with TCP and NFS on it, as well as Cray binaries. Assuming you had three-phase power and good A/C. I think my VAIO probably kicks its ass in every respect.

Or you could take all the insides out and use the chassis for rackmount equipment like a good stereo..

In his original post, Matthew had also listed more machines to port to: "AT&T's Hobbit processor. Matsushita's MN10300. Fujitsu FR30. Sun MAJC. Notional Semidestructors 32016. Jeff Dike's usermode port. Linux/L4. mkLinux. There are some m68k machines which aren't included yet like sun3, next and apollo DN. pc532 (see www.netbsd.org). The i860 and i960. Transputers. Sun's 386-based workstations. Sequent SMP machines. Pyramid. IBM Romp. Sony NEWS."

Oliver replied soberly, "Retro-ports, while fun, don't make a whole lot of practical sense. You rapidly reach a point where maintaining an old machine is more expensive than buying a new one. Emulators, on the other hand..."

4. Cisco Routers And syncppp

6�Mar�2000�-�7�Mar�2000 (7 posts) Archive Link: "[PATCH] cisco-hdlc mode in syncppp.c"

People: Gergely Madarasz,�Alan Cox,�Paul Fulghum

Gergely Madarasz posted a 2-line patch to make the syncppp driver cooperate with Cisco routers in cisco-hdlc mode (such as the 3600 series). Paul Fulghum replied that he'd already submitted a similar patch that was rejected by Alan Cox, on the grounds that it should be "done right". Gergely replied, "Ah, I see... anyway, this fix should be urgent, because I have quite a few clients now who had this problem." Alan replied that he had investigated what it would take to do it right, and had been thoroughly disgusted. But he agreed that Gergely's patch was better than what existed currently. He concluded that he'd either take the patch or do it right himself for 2.2.15-final.

5. Accessing Parents Of Traced Processes

6�Mar�2000�-�9�Mar�2000 (10 posts) Archive Link: "Allow debuger to examine real parent"

People: Mike Coleman,�Linus Torvalds,�Alan Cox,�Pavel Machek

Pavel Machek posted a short patch to allow a debugger to examine the original parent of a traced process. He said the information it provided was not available elsewhere, but Alan Cox said the feature was superfluous, isince 'ps' obviously got the information from somewhere. But Mike Coleman replied:

ps reports the tracing process as the parent, rather than reporting the original parent as the parent. AFAIK, ps gets its info from /proc, and you can verify that every occurrence of 'pptr' in the proc code is 'p_pptr' (referring to the parent) rather than p_opptr (referring to the original parent).

[The only times when p_pptr != p_opptr is when a PTRACE_ATTACH happened, or when CLONE_PTRACE was used.]

Although I'm the author of the patch, I don't think it's really the ultimate correct solution to the problem. The correct solution, IMHO, is (within userland) to always report the original parent in places where the parent is currently being reported. This preserves the illusion that ptracing isn't happening, for processes that don't care or need to know about it. For processes that really *do* need to know, there should be a "special" way of finding out--a new file in proc, maybe, or a new syscall or new ptrace subcommand.

Since the full "correct" solution isn't required for SUBTERFUGUE, and since I've heard that small, simple patches are easier to get into the kernel, I just used a minimal solution.

Regarding Mike's description of /proc's behavior, Linus Torvalds replied, "Hmm.. That's probably a bug. It would probably be best to export the original parent in the /proc setup, and make the "debugging parent" possibly visible some way (but the original parent should be the normal one)." And regarding Mike's assessment of the proper fix, Linus replied:

I think that illusion should be maintained, and really only debugging should know about the so-called "real" parent.

The whole "re-parenting" thing was just a bad excuse for not re-doing the parent pointers entirely, and should probably be ripped out at some point. We should probably

The reason for the re-parenting is obviously signal handling at exit time, but that's just such a special case that it's not really worth the confusion between "original parent" and "real parent". But it works, and this is code that nobody really likes to touch because it is so fundamental, so..

6. Some Kernel Files Use BSD License

8�Mar�2000 (33 posts) Archive Link: "BSD Licensed files in Linux kernel."

Topics: BSD

People: Alan Cox,�Linus Torvalds

A one-day thread. Darren Reed said that certain files in the kernel sources, such as linux/drivers/net/bsd_comp.c, were released under the BSD license, not the GPL, which he suspected amounted to a violation of the GPL. There were numerous replies. Among them, Alan Cox explained:

I talked to a genuine authentic (non-internet) lawyer about the GPL v BSD stuff a few years ago:

In general GPL + anything is a no go unless the 'anything' is extremely non restrictive. The GPL draws an arbitary line and labels one side free they other not.

Since UC Berkeley dropped the advertising clause from their parts of the code its possible this problem has actually vanished by default in the kernel case now. I've not checked the documentation to figure this out.

And Linus Torvalds also explained:

There are numerous files that are under a dual license, which is perfectly ok. I don't actively encourage it, because it creates problems if somebody else than the original author were to ever say "I want to patch the Linux GPL'd version of this file, but I do NOT want to make my changes available under the BSD license" or vice versa, but it hasn't been a problem so far, so it's not something I discourage either.

(And besides, if somebody were to send me a patch with those kinds of restrictions, my solution would probably be to just not accept the patch in the first place).

The one file you mention (bsd_comp.c) is the only real special case that I'm aware of, which is why it cannot be linked into the kernel directly (it only ever gets built as a module - not for any technical reasons, but simply due to the copyright issue).

7. Scheduling Difficulties Under Linux

8�Mar�2000�-�10�Mar�2000 (26 posts) Archive Link: "Linux responsiveness under heavy load"

Topics: BSD: FreeBSD

People: Rik van Riel,�Jeff V. Merkey,�Theodore Y. Ts'o,�Victor Khimenko,�Andrea Arcangeli,�Helge Hafting

Nicolas MONNET reported that under heavy loads, Linux became less responsive. He'd heard the the BSDs did much better in that area, and asked if it was true. Rik van Riel replied:

We're working on it. Andrea Arcangeli is currently improving the disk scheduling algorithm, I have some (small) changes to the memory management subsystem and a number of other people are working on patches that aren't ready/public yet...

It's true that we lag somewhat behind FreeBSD in this point, but we're working on it :)

Victor Khimenko also replied to Nicolas, saying that the problem was systemic and very difficult to solve. Since the kernel was not multithreaded, userland processes could never interrupt it. And while you could limit a process's CPU time, limiting the kernel's CPU time was not possible. Linux also handled low-memory situations poorly, and that was also something difficult to change.

Elsewhere, Jeff V. Merkey pointed out that Windows NT had the same problem of poor interactivity under heavy load, and "the way they got around it was to "cheat" by cranking up the priority of the console and Windows GUI subsystem anytime someone hit a key or moved the mouse. In fact, an NT server is still heavily loaded, but by making the display look and feel "snappy", it presents the illusion that the server has great responsiveness under heavy load." He suggested working the same trick for bash and Xwindows. Helge Hafting added that OS/2 employed a similar method. Theodore Y. Ts'o replied to Jeff, explaining:

I wouldn't call this cheating at all. There are a number of classical tradeoffs that you have to contend with when doing scheduler design. One is the tradeoff between efficiency and responsiveness. Switching between tasks costs, and so if you have a very heavily loaded batch system, it's most effiencient to set the scheduling quantuum very high, so that a process runs for long periods of time before some other process runs. On the other hand, for a desktop system, where user interaction is important, you want to set the scheduling quantuum as short as possible.

Linux does something similar, in terms rewarding processes that give up their timeslice before their scheduling quantuum is up, and giving less priority to CPU-bound processes. NT is simply taking it to the next level, and rewarding user-interaction more than other I/O bound processes. Since NT doesn't have an X server, they have to bump up the priority of the W32 subsystem when it's in the middle of doing graphics. The equivalent analogue for us on the Unix side is to run the X server at a slightly high priority. (Say, nice -4).

This isn't cheating; it's just making a policy statement that user interaction is more important that some of the other things that the machine might be doing. If a 10 minute kernel compile takes an extra 10 seconds to complete, it's not a big deal. But if mouse motion and keyboard response takes an extra 10 milliseconds, it is a very big deal as far as the user is concerned.

8. Problems With Newest Compiler Snapshots

10�Mar�2000�-�13�Mar�2000 (8 posts) Archive Link: "[pre-2.3.51-2] memset problem on IA32"

Topics: Version Control

People: Philipp Thomas,�Mike Galbraith,�David S. Miller,�Artur Frysiak

Artur Frysiak got the following linker error while compiling 2.3.51-pre2:

kernel/kernel.o: In function `check_free_space':
kernel/kernel.o(.text+0xb384): undefined reference to `memset'

David S. Miller posted a one-line patch to linux/include/linux/fs.h, to include linux/string.h; Artur replied with complete success, but Philipp Thomas objected that David's patch "doesn't help when using current CVS gcc and it decides to generate a memcpy libcall as those places won't see the inlined versions from asm/string.h, at least not on ia32." Mike Galbraith agreed that just calling memcpy() from the kernel sources would allow newer compiler snapshots to generate a library call, instead of using the inline memcpy() function as earlier compiler versions had. According to his research, none of the docs said that the compiler had to give preference to using inlined functions over library calls; and all the folks he'd asked had also confirmed that it was completely up to the compiler. He added, "even if you put memcpy() in a lib, nothing dictates that the compiler can't call bcopy().. or a whole slew of libc functions for that matter," and concluded, "If this is true, it means that it's a coding bug to use this kind of construct in the kernel."

9. More Yamaha Doc Problems

11�Mar�2000 (10 posts) Archive Link: "Yamaha Sounddrivers again"

Topics: Sound: SoundBlaster

People: Daniel Egger,�Alan Cox

Daniel Egger was unable to activate soundblaster emulation on his soundcards, and couldn't find a solution in the linux-kernel archives. So he reported, "Having read the "documentation" of the YMF754 and YMF744 chips it seems to me that a soundcard containing this chips wouldn't even need bitfiddling to activate the sb-emulation because it it already activated." In practice, however, this didn't seem to be the case. Alan Cox replied simply, "you have to program the rest of the board to kick it into that mode. That phase is undocumented." Daniel was confused by this, since the documentation seemed to state that the board was already in that mode. But Alan replied bluntly, "I wrote the driver, tried the theory. The AC97 codec is not being initialised by the hardware and the DMA stuff seems to need some kind of support setup doing that isnt documented."

Sharon And Joy

Kernel Traffic is grateful to be developed on a computer donated by Professor Greg Benson and Professor Allan Cruse in the Department of Computer Science at the University of San Francisco. This is the same department that invented FlashMob Computing. Kernel Traffic is hosted by the generous folks at kernel.org. All pages on this site are copyright their original authors, and distributed under the terms of the GNU General Public License version 2.0.