Kernel Traffic #236 For 26 Oct 2003

By Zack Brown

Table Of Contents

Mailing List Stats For This Week

We looked at 1670 posts in 8990K.

There were 558 different contributors. 284 posted more than once. 186 posted last week too.

The top posters of the week were:

 

1. Patch To Support Many Groups
28 Sep 2003 - 2 Oct 2003 (22 posts) Archive Link: [PATCH] Many groups patch.
Topics: Clustering: Beowulf, FS: InterMezzo, FS: smbfs, Ioctls, Samba
People: Rusty RussellLinus TorvaldsPete ZaitcevTim Hockin

Rusty Russell posted a patch by himself with help from Tim Hockin, to raise the number of possible groups to greater than 200. Apparently SAMBA customers had a particular desire for that. He explained, "This version drops the internal groups array (it's so often shared that it's not worth it, and the logic becomes a bit neater), and does vmalloc fallback in case someone has massive number of groups." Linus Torvalds replied:

Why?

kmalloc() works fine. Anybody who needs 200 groups may be sane, but anybody who needs more than fits in a kmalloc() is definitely so far out that there is no point.

The vmalloc space is limited, and the code just gets uglier.

Have you been looking at glibc sources lately, or why do you believe that we should encourage insane usage?

Pete Zaitcev from Red Hat said that his company did "have some customers who run insane number of groups, with their own patches. This practice is popular in the Beowulf crowd for some reason. I should note this is not very mainstream." Tim Hockin said his company also had customers who wanted more groups than kmalloc could handle. He posted his own patch, saying, "My version uses a struct group_info which has an array of pages. The groups are sorted and bsearched, instead of linear. The perfomance is quite good. An older version against 2.6.0-test1 or something is attached. If this method will fly, I'll take some of Rusty's good ideas and finish this version of it.." Linus felt this one was saner than Rusty's, though he still disapproved of supporting thousands of groups. Tim did some more work and submitted a new patch against 2.6.0-test6, with a changelog entry:

Summary: Get rid of the NGROUPS hard limit.

This patch removes all fixed-size arrays which depend on NGROUPS, and replaces them with struct group_info, which is refcounted, and holds an array of pages in which to store groups. groups_alloc() and groups_free() are used to allocate and free struct group_info, and set_group_info is used to actually put a group_info into a task. Groups are sorted and b-searched for efficiency. Because groups are stored in a 2-D array, the GRP_AT() macro was added to allow simple 1-D style indexing.

This patch touches all the compat code in the 64-bit architectures. These files have a LOT of duplicated code from uid16.c. I did not try to reduce duplicated code, and instead followed suit. A proper cleanup of those architectures code-bases would be fun. Any sysconf() which used to return NGROUPS now returns INT_MAX - there is no hard limit.

This patch also touches nfsd by imposing a limit on the number of groups in an svc_cred struct.

This patch modifies /proc/pid/status to only display the first 32 groups.

This patch removes the NGROUPS define from all architectures as well as NGROUPS_MAX.

This patch changes the security API to check a struct group_info, rather than an array of gid_t.

This patch totally horks Intermezzo.

Linus said he was sorry to keep complaining, but that following the code-duplication practice in the compat code just made the patch uglier than it had been before. He suggested extracting the duplicated code into a new kernel/gid16.c file, and adding a CONFIG_GID16 boolean config variable. Tim posted a patch, but then said a bit later:

So I dug deeper into the problem, and I think it can be solved relatively painlessly.

First a few observations, based on grep:

  • uid16_t is only used once: fs/smbfs/ioctl.c
  • gid16_t is never used
  • every arch defines uid16_t/gid16_t to unsigned short
  • some arches define old_uid_t/old_gid_t the same as uid_t/gid_t, some don't
  • ncpfs and smbfs use __kernel_old_uid_t
  • old_uid_t and old_gid_t are only used in highuid.h and uid16.c
  • every arch that defines UID16 defines old_uid_t and old_gid_t to ushort, except x86_64 (which I *think* is a bug)

So what I'm thinking is this:

  1. convert uid16.c to use uid16_t and gid16_t, and NOT use highuid.h
  2. build uid16.o iff CONFIG_UID16_SYSCALLS
  3. anywhere that defines CONFIG_UID16 adds CONFIG_UID16_SYSCALLS
  4. any 64-bit arch that wants uid16 stuff adds CONFIG_UID16_SYSCALLS

Now, the 16-bit forms of the syscalls are available to all the interested parties. Then we go through the arch stuff and remove all the duplicated uid16 stuff, where ever possible.

This will leave highuid.h unmolested, so all dependants of that will still work.

Here's the really simple patch, without removing any arch code, yet. What's the preferred way in Kconfig to identify the proper arrangement of this idea?

But there was no reply.

 

2. Separating Kernel Headers From User-Space Headers (The Saga Continues)
30 Sep 2003 - 3 Oct 2003 (16 posts) Subject: "[PATCH] linuxabi"
Topics: BSD
People: Andries BrouwerBernd EckenfelsMatthew WilcoxJ.A. MagallonEric W. Biederman

Continuing from (or in ignorance of) Issue #232, Section #3  (2 Sep 2003: Separating Kernel Headers From User-Space Headers) , Andries Brouwer said:

Something we have talked about for a long time is separating out from the kernel headers the parts fit for inclusion in user space.

This is a very large project, and it will take a long time, especially if we want the user space headers to be a pleasure to look at, instead of just a cut-n-paste copy of whatever we find in the current headers.

Some start is required, and the very first step is making sure that you agree with the project. Immediately following is the choice of directory names.

Below

  1. a small textfile "linuxabi" describing the naming (subdirectories linuxabi and linuxabi-alpha etc of include),
  2. the file linuxabi/mountflags.h with definitions for MS_RDONLY and family,
  3. the file linux/mountflags.h that includes linuxabi/mountflags.h and moreover defines things like MS_RMT_MASK and IS_NOATIME(inode), and
  4. the patch on fs.h that removes these defines and adds an include line.

And the text of his linuxabi file:

The subdirectories linuxabi and linuxabi-$ARCH (linuxabi-alpha, linuxabi-arm, ...) of linux/include are meant for headers that are to be used both by the kernel and in user space. The symbolic link linuxabi-arch points at linuxabi-$ARCH for the current architecture.

Be careful not to pollute namespace.

Typical material for such headers are manifest constants and structures used by the kernel-userspace interface.

Make sure no symbolic types like dev_t, pid_t, ino_t and the like are used, but only explicit types like char and int, or even more explicit types like uint8_t and int64_t.

These headers are "append-only", in the sense that Linux tries to keep supporting old interfaces.

Bernd Eckenfels took issue with this last statement, saying that Linux did not necessarily insist on keeping old interfaces, nor would it insist on that in the future. Aside from that, he liked Andries' work. Elsewhere, J.A. Magallon suggested calling the file 'abi' instead of 'linuxabi', so other systems like the BSDs could follow the same convention.

Elsewhere, Eric W. Biederman said this whole thing was a 2.7 project, but Andries disagreed, saying the restructuring was entirely unrelated to kernel development, but Eric said that it would need a full development cycle to get the feature right.

No one mentioned the work done earlier by Matthew Wilcox.

 

3. Value Of DigSig Questioned By Developers
1 Oct 2003 (15 posts) Subject: "Re: [ANNOUNCE] DigSig 0.2: kernel module for digital signature verification for binaries"
People: Pavel MachekMakan PourzandiAlexander ViroValdis KletnieksWilly Tarreau

Continuing from Issue #234, Section #10  (25 Sep 2003: New DigSig Module For Digital Signature Verification For Binaries) , Pavel Machek asked for an explanation of why someone would want to use DigSig, since "if I want to exec something, I can avoid exec() syscall and do mmaps by hand..." Makan Pourzandi replied:

There are different answers to this question because there are many possible attack scenarios. I try to take the most realistic one and give a short answer.

For the attacke described by you to be successful, one assumes that the intruder gained access to the system, he wrote his own code on the system (or brought it in), and compile it on the system (cannot execute its own code as it is not signed), produced the binary to mmap the malicious code to the memory, and run the code that call syscall mmap.

First digsig can help to avoid the access to the system by the intruder. as it aborts the execution of malicious code which often leads to a root access for the intruder.

Second, digsig can avoid the execution of the binary that allows to bring in the code or other malicious binaries. AFAIK, the intruders generally use their own binary to download malicious code. This is because in hardened systems, the use of ftp ot other alike binaries, (when these binaries are not completely removed from the system for security reasons) is closely monitored and controled through firewalling rules. Even in simple desktops, it is rather easy to control the use of ftp and alike to track down the intrusion source. therefore, the intruder needs to run his own binary to download the root kit which is avoided by the use of digsig.

Third, the intruder now has access to the system, he cannot execute the code he brought in with himself (not signed) or he cannot bring it in (c.f. above). So he needs to compile the code on the system. AFAIK, for the absolute majority of servers the admins remove all compilers (specially gcc) on all servers. this is for several different security reasons (I don't want to get there). therefore, the above hypothesis gets even more difficult to realize.

Last, but I believe the most important, the level of difficulty of execution of such an attack is much higher than the average knowledge level of many script kiddies. The absolute majority of attackers have little or absolutely not any knowledge of the operating systems in general and linux in particular, let aside the knowledge of writing a C program, calling mmaps in that progam and run the malicious code to gain access as root, then remove the module to execute a classical attack.

There is no such thing as 100% secure system, digsig increases the level of security of the system as it just makes it much more difficult for the intruder to succeed in his/her attack.

A few posts down the line, Alexander Viro pointed out that DigSig might have a temporary impact on 'script kiddies', but that "in a month rootkits get updated and we are back to square 1, with additional mess from patch..." Willy Tarreau and Pavel agreed with this. Close by, Valdis Kletnieks added, "the only thing the patch does is raise the bar on a purely temporary basis, and that it provides little long-term benefit as soon as the rootkits start working around it. As has been pointed out, DigSig only secures one tiny part of the way that executable code gets into memory."

 

4. Linux 2.4.23-pre6 Released
1 Oct 2003 - 9 Oct 2003 (9 posts) Subject: "Linux 2.4.23-pre6"
Topics: Power Management: ACPI, USB
People: Marcelo Tosatti

Marcelo Tosatti put out 2.4.23-pre6, saying, "It contains several ACPI fixes (the USB "not working anymore" problems in -pre5 should be gone), support for the SCTP protocol, x86-64/PPC/SH merges, network drivers update (EMAC, e1000, sk98lin), megaraid update, amongst others."

 

5. Kernel Port-Availability Security Suggestion
1 Oct 2003 - 6 Oct 2003 (8 posts) Subject: "A new model for ports and kernel security?"
Topics: Backward Compatibility, FS: accessfs, Spam
People: John LangeJames MorrisJesse PollardValdis Kletnieks

John Lange proposed:

why do we have this requirement that only root processes can connect to low ports (under 1024) ?

My understanding is that this is a hold-over from ancient days gone past where it was meant to be a security feature. Since only root processes can listen on ports less than 1024, you could "trust" any connection made to a low port to be "secure". In other words, nobody could be "bluffing" on a telnet port that didn't have root access therefore it was "safe" to type in your password.

I don't know if the above is the real reason or not but if it is, clearly it has outlived its usefulness as a "security" feature.

Regardless, does not the requirement that only root can bind to low ports now create more of a security problem than it ever solved?

Are not processes forced to run as root (at least at startup) that have security holes in them not the leading cause of "remote root exploits"?

So if the root-low-port requirement isn't serving any purpose and is indeed the cause of security problems is it not time to do away with it?

Furthermore, while only root can bind to low-ports, ANY user can bind to high ports. This also causes a ton of security concerns as well!

So I would like to propose the following improvement to kernel security and I invite your comments.

Suggestion: A groups based port binding security system for both outgoing and incoming port binding.

For example, the group "root" is allowed to listen to ports "*" (all) and allowed outgoing connections to "*" (all) as well.

The group "www" would be allowed to bind to ports "80, 443" (http, https) and not allowed ANY outgoing connections.

The group "mail" (or postfix, or whatever) would be allowed to listen to port "25" (smtp) and connect to "25".

The group "users" would not be allowed to listen at all but might be allowed to connect to 20-21, 80, 443.

etc.

This accomplishes two major things:

  • no process ever needs to run as root.
  • no unauthorized process can ever listen on a port or make connections. On servers that allow remote users this prevents things like bots, spam relays, ftp drops and all sorts of mischief.

I envision a simplistic "/etc/ports/index.html" with the format, "<groupid>,<incoming ports>,<outgoing ports>"

I realize similar things can be accomplished in other ways (with iptables I believe) but it just seems to me that this should be a fundamental part of the systems security and therefore should be in the kernel by default (just as the root binding to low ports is currently).

Valdis Kletnieks said he thought John's proposal was already written, as the grsecurity patch (http://www.grsecurity.org) . Elsewhere, James Morris also said a similar feature was implemented in AccessFS (http://www.olafdietsche.de/linux/accessfs/) . But James added, "We should keep the standard behavior as default in the core kernel. Other security models can be implemented via LSM, Netfilter, config options etc." John took it as a good sign that other folks had implemented various versions of his idea; he said to James:

I believe there are several compelling reasons why the standard behaviour should be changed.

  • patches are not a real solution. As a sysadmin I can't afford the extra headache of applying patches after the fact every time I need to upgrade the kernel. Also, if there is an urgent patch to the kernel then I would have to wait for the external patch to be updated before I could do a kernel compile. So generally external patches are problematic for your standard sysadmin.
  • If the functionality is not built into the standard behaviour then no one will code for it.
  • If it is generally agreed that the current behaviour is outdated and creates problems with security then we have to ask "Is there any compelling reason to keep it?" Would linux with the patch not be a better, more secure Linux?

Backward compatibility would not be a problem because most programs just try and grab the port and error if they can't get it. They would work fine under the /etc/ports idea.

Any other programs that might have problems (for example any which check to see if they are root before starting) can still be started as root. Again, no problem.

Jesse Pollard said that application porting compatibility would be an issue:

Right now all that is necessary is to recompile the application. With the patch, you also have to figure out how to apply appropriate ports to the application, and you don't know ahead of time how many ports to allocate. Grid applications tend to have one port for each node they communicate with. If two users generate a 5 way connection you have to give two sets of groups... If the user then wants a 10 way you have to reallcate again.

This is insufficent flexibility. What you want is to tie each port to a capability (or tie port allocation to a capability) and then grant the user the capability to allocate ports. This really calls for a LSM based module that can pass the request to a security daemon to permit/deny port allocation based on external rules.

This would be more flexable, maintainable, and is less intrusive of the kernel core.

 

6. Big Updates To HFS+ And HFS
2 Oct 2003 - 6 Oct 2003 (15 posts) Subject: "[ANNOUNCE] new HFS(+) driver"
People: Roman Zippel

Roman Zippel announced:

This is a rather big update to the HFS+ driver. It includes now also an updated HFS driver. Both drivers use now almost the same btree code and the general structure is very similiar, so one day it will be possible to merge both drivers. The HFS driver got a major cleanup and a lot of broken options were removed, most notably if you want to continue using netatalk with this driver, you have to fix netatalk first.

The HFS+ driver has a number of improvements and fixes:

  • blocks are now preallocated.
  • allocation file is now in the page cache too
  • better extent caching
  • btrees are now able to grow arbitrarily
  • allocation block size can now be larger than a page
  • actual fs block size is adjusted to avoid alignment problems
  • cdrom session/partition support (note that this is a crutch and has problems)

This is basically the version I'd liked to get merged into 2.6 (minus lots of ifdefs and some debug prints). You can find the driver at http://www.ardistech.com/hfsplus/

 

7. New Xen Virtual Machine Monitor For x86
2 Oct 2003 - 3 Oct 2003 (24 posts) Subject: "[ANNOUNCE] Xen high-performance x86 virtualization"
Topics: BSD: FreeBSD, Microsoft, User-Mode Linux
People: Ian PrattKeir FraserKarim YaghmourTheodore Ts'oJohn BradfordLars Marowsky-Bree

Ian Pratt announced:

We are pleased to announce the first stable release of the Xen virtual machine monitor for x86, and port of Linux 2.4.22 as a guest OS.

Xen lets you run multiple operating system images at the same time on the same PC hardware, with unprecedented levels of performance and resource isolation. Even under the most demanding workloads the performance overhead is just a few percent: considerably less than alternatives such as VMware Workstation and User Mode Linux. This makes Xen ideal for use in providing secure virtual hosting, or even just for running multiple OSes on a desktop machine.

Xen requires guest operating systems to be ported to run over it. Crucially, only the kernel needs to be ported, and all user-level application binaries and libraries can run unmodified. We have a fully functional port of Linux 2.4.22 running over Xen, and regularly use it for running demanding applications like Apache, PostgreSQL and Mozilla. Any Linux distribution should run unmodified over the ported kernel. With assistance from Microsoft Research, we have a port of Windows XP to Xen nearly complete, and are planning a FreeBSD 4.8 port in the near future.

Xen is brought to you by the University of Cambridge Computer Laboratory Systems Research Group. Visit the project homepage to find out more, and download the project source code or the XenDemoCD, a bootable `live iso' image that enables you to play with Xen/Linux 2.4 without needing to install it on your hard drive. The CD also contains full source code, build tools, and benchmarks. Our SOSP paper gives an overview of the design of Xen, and evaluates the performance against other virtualization techniques.

Work on Xen is supported by UK EPSRC grant GR/S01894, Intel Research Cambridge, and Microsoft Research Cambridge via an Embedded XP IFP award.

Home page : http://www.cl.cam.ac.uk/netos/xen
SOSP paper : http://www.cl.cam.ac.uk/netos/papers/2003-xensosp.pdf

Lars Marowsky-Bree was very happy to see this, and Karim Yaghmour asked if there were any plans to port Xen to other architectures. Keir Fraser replied, "Our aim was to implement an efficient VMM for commodity hardware, and that really means x86. We're considering a port to x86-64, but so far we're limited in man power (this is why *BSD is not yet available, for example)." Close by, John Bradford asked if Xen could run within itself, recursively; and Keir replied, "No --- Xen runs on x86 but exports a different 'x86-xeno' virtual architecture that OSes must be ported to (basically, privileged ops must go through Xen for validation). x86 != x86-xeno, so Xen will not run on Xen." Theodore Ts'o asked how hard it would be to port Xen to x86-xeno in that case, and Keir replied:

To allow efficient switching in and out of Xen we take a small amount of every virtual address space, and also grab ring 0. Since we don't hide that from overlying OSes, we couldn't do a full recursive implementation of Xen -- we'd run out of rings (quickly) and address space (eventually) :-)

Full recursion needs full virtualization. Our approach offers much better performance in the situations where full virtualization isn't required -- i.e., where it's feasible to distribute a ported OS.

Karim Yaghmour said:

I noticed that the SOSP Xen paper briefly mentions Jacques Gelinas' work on VServers (http://www.solucorp.qc.ca/miscprj/s_context.hc). While Jacques' work hasn't attracted as much public attention as other Linux virtualization efforts, I've personally found the approach and concepts quite fascinating. Among other things, most of the code implementing the contexts is architecture-independent (save for a few syscalls added to arch/*/kernel/entry.S). So, thinking aloud here, I'm wondering in what circumstances I'd prefer using something as architecture specific as Xen over something as architecture independent as Jacques' VServers? (Granted VServers can't run Windows, but I'm asking this from the angle of people looking for resource isolation in the Linux context.) Among other things, VServers are already in use by many ISPs to provide simultaneous hosting of many "virtual machines" on the same box while maintaining strict separation between machines and still providing a secure environment.

For those who aren't familiar with Jacques' stuff, have a look at this document here: http://www.solucorp.qc.ca/miscprj/s_context.hc?prjstate=1&nodoc=0. The actual concepts implemented in VServers are here: (http://www.solucorp.qc.ca/miscprj/s_context.hc?s1=2&s2=0&s3=0&s4=0&full=0&prjstate=1&nodoc=0) http://www.solucorp.qc.ca/miscprj/s_context.hc?s1=2&s2=0&s3=0&s4=0&full=0&prjstate=1&nodoc=0

Keir replied, "One of the main differences is that we provide resource isolation, so that each virtual machine only gets the resources that its sponsor paid for. This allows companies providing virtual servers to provide differentiated service according to the amount paid."

 

 

 

 

 

 

We Hope You Enjoy Kernel Traffic
 

Kernel Traffic is grateful to be developed on a computer donated by Professor Greg Benson and Professor Allan Cruse in the Department of Computer Science at the University of San Francisco. This is the same department that invented FlashMob Computing. Kernel Traffic is hosted by the generous folks at kernel.org. All pages on this site are copyright their original authors, and distributed under the terms of the GNU General Public License, version 2.0.