Kernel Traffic #214 For 28�Apr�2003

By Zack Brown

Table Of Contents

Mailing List Stats For This Week

We looked at 1445 posts in 7326K.

There were 384 different contributors. 199 posted more than once. 156 posted last week too.

The top posters of the week were:

1. Static Versus Dynamic Device Numbering

26�Mar�2003�-�11�Apr�2003 (81 posts) Archive Link: "64-bit kdev_t - just for playing"

Topics: FS: devfs, FS: initramfs, FS: ramfs, FS: sysfs, Hot-Plugging, Ioctls

People: Roman Zippel,�H. Peter Anvin,�David Lang,�Werner Almesberger,�Kevin P. Fleming,�Joel Becker,�Alan Cox,�Andries Brouwer,�Andrew Morton,�Linus Torvalds,�Matt Aubury

Andries Brouwer posted a patch to increase the size of kdev_t (used for device numbers) from 16 bits to 64 bits. Roman Zippel very insistantly asked for details on how the increased number space would be managed; and after a lot of wrangling, it came out that he was concerned that merely increasing the available device numbers would be just another way to continue to allow static device number assignments, while he felt the true solution was to avoid static assignments altogether and concentrate on implementing dynamic device number allocation. At one point he said, "We are sooo close to dynamic device numbers, so I really don't understand why people want to go back all the way back to static numbers. The kernel is mostly ready, what is missing now is the userspace and a driver audit." And added, "This 'new API' is _huge_ step backwards, because it puts the burden of managing static device numbers on the drivers again."

A lot of folks stood against him, including Alan Cox. Some other folks like Joel Becker were just desperate to find a way to give device numbers to large quanities of devices, and would be happy with a 'temporary' fix. But Roman said at one point, "The ones who ask now for a larger dev_t the loudest are likely the first to demand later not change anything for "compability", because they hardcoded certain assumptions about dev_t into their applications."

However, the concept of dynamic device numbers has remained controversial. Linus Torvalds has been pushing for them for quite awhile, with the result that developers on the stable series have banded together to ignore him on that issue. At one point H. Peter Anvin said to Roman, "I have an idea, why don't you read the archives of this mailing list for the past eight years and learn, once again, why dynamic numbers are broken for nearly all applications (disks and ptys being, perhaps, the few case where they actually work.) This has been hashed and rehashed on this list so many times it's not even funny." Roman replied:

Ok, I checked the archives and found some interesting mails:

http://www.ussg.iu.edu/hypermail/linux/kernel/0105.1/1170.html
http://www.ussg.iu.edu/hypermail/linux/kernel/0105.1/1180.html
http://www.ussg.iu.edu/hypermail/linux/kernel/0105.1/1072.html
http://www.ussg.iu.edu/hypermail/linux/kernel/0105.1/1310.html
http://www.ussg.iu.edu/hypermail/linux/kernel/0105.1/1348.html

This is from the last dev_t discussion I was able to find and my apologies to Linus for dragging him into this, personally I prefer a discussion based on arguments, but you want to feel the whip of authority. (There are also some juicy mails from Al, but you can look for these yourself.)

Linus argues here for dynamic numbers and I was not able to find a single mail, where he completely changed his mind since then. If you know something I don't, I'd be really happy to hear about it (actually I found 9 (nine!) year old mails, where he argues for a more dynamic system).

In above discussion, Alan was one of the few who actually came up with reasonable arguments, some of his concerns were:

Am I now worthy of an answer, so you could please explain "why dynamic numbers are broken for nearly all applications"? What were I supposed to learn from the archives? Maybe you should read them yourself, because I didn't found a single discussion with a clear outcome.

At some point David Lang said:

the biggest problem I see with dynamic numbers is that it needs a userspace devfs type solution for creating and maintaining the device nodes that are then used. While this isn't rocket science it's also somthing that is hard to get people to agree to (remember the devfs names that everyone gripes about are not what richard started with it's what he switched to to get things into the kernel, they changed many times during that process)

I don't think many people will argue that dynamic assignments are evil, but I think you will find a lot of people very nervous about switching to them and the risk involved with doing so.

Werner Almesberger agreed that devfs-type solution was the difficult part. He said, "This probably means that the kernel will have to come with a default initrd-like setup that is built and attached by "make bzImage" and the like. I thought that people were quite actively working towards something like this ?" Kevin P. Fleming replied, "Yes, this is being worked on actively. This will use the 2.5.x initramfs infrastructure, and when it's up and going there will be early-userspace tools included in the kernel tarball to do the basic things that need doing (essentially responding to hotplug events and creating/removing devices nodes as needed)."

Elsewhere, H. Peter said that dynamic device numbering was not a bad idea, just very hard to get right. He said:

So far, *none* of the schemes used for dynamics have gotten it right. They just ignore a fair number of the problems. People keep focusing on disks, and they are nearly uniformly the almost-trivial case in comparison with especially character devices, where you don't have the layer of indirection called /etc/fstab, persistent labels, etc.

It is also independent of the need to switch to a larger dev_t. Claiming that we can squeeze more out of the existing device scheme if we have an ideal-world dynamic scheme is unrealistic because:

a) There are, genuinely, systems with more than 65,536 devices or anonymous mounts. That rules out the current dev_t just by itself.

b) Despite the fact that people have tried since the mid-90's, we still don't have a sane way to manage such dynamicity.

c) We are now in what pretty much amounts to a crisis situation. We have needed to enlarge dev_t for well over half a decade. Therefore, it is too late to say "well, given X we wouldn't need it." We need something done in *this* kernel cycle.

Given that it has taken, literally, 8 years to get to this point, and based on collective global experience with numberspaces, I'm arguing for enlarging it far more than anyone can currently imagine being necessary.

dev_t is already 64 bits in glibc, and the glibc<->kernel interface needs to be fixed *anyway*. We have to take the pain of migration, we might as well go all the way.

Andrew Morton confirmed that this was Linus' plan, as far as he knew.

Roman gave a point to a mailing list post (http://marc.theaimsgroup.com/?l=linux-kernel&m=95547434315472&w=2) by Linus from April 2000 in which he approced of a scheme by Matt Aubury. Roman also said to H. Peter, "Maybe you didn't notice, that only now the block device layer is clean enough to go dynamic. Maybe you didn't notice that scsi devices are already dynamically numbered and that there are already user space tools to translate them to constant device names." He also took exception to H. Peter's statement that Roman wanted to "squeeze more out of the existing device scheme". Roman said he agreed that kdev_t had to be made larger, but was concerned about the way this would be done, and what would come out of it. Joel said:

There are a couple things being discussed here. One is the size of dev_t. The other is dynamic numbers. It would seem that most folks agree with a larger dev_t and a more dynamic numbering system. Let's assume we want both for now (folks who don't, please keep out for a second). There are three courses of action that seem to be advocated.

  1. Ship 2.6 with 16bit dev_t, work on a larger dev_t and perfect dynamic devices in 2.7.
  2. Ship 2.6 with a (32|64)bit dev_t, work on a perfect dynamic scheme in 2.7.
  3. Hold 2.6 until it can ship with (32|64)bit dev_t and perfect dynamic devices.

Many folks, Peter and myself included, are claiming that choice (1) is absolutely untenable. We need more device space today, not in 3 years when 2.7 becomes 2.8.

If I understand you correctly (and here is why I mailed), you feel that choice (2) is the worst of the choices. You feel that we should either choose course (1) or course (3). I'm not sure which of those you prefer.

Roman replied:

That misunderstanding is hopefully easy to resolve:

(4) Ship 2.6 with a (32|64)bit dev_t with an experimental dynamic scheme and keep the device numbers below 0x10000 as they are now.

There was no resolution of the debate during the thread.

2. Linux 2.4.21-pre7 Released

4�Apr�2003�-�10�Apr�2003 (41 posts) Archive Link: "Linux 2.4.21-pre7"

People: Marcelo Tosatti

Marcelo Tosatti announced 2.4.21-pre7 (http://www.uwsg.indiana.edu/hypermail/linux/kernel/0304.0/0955.html) , saying this was hopefully the last -pre release before 2.4.21.

3. Radeon Framebuffer Code Fork

6�Apr�2003�-�11�Apr�2003 (8 posts) Archive Link: "[PATCH] New radeonfb fork"

Topics: Framebuffer

People: Benjamin Herrenschmidt

Benjamin Herrenschmidt announced:

As I told a while ago, I'm forking radeonfb for now, at least until Ani (current maintainer) either give me maintainership or gets all that stuff in the official version.

I need testers, and I'd appreciate any patches people may have for it as well since I know a bunch of ppl have been spreading various radeonfb patches around, I want to take over all of these and see what is worth getting in. For 2.5, I'm working on a complete rewrite (& split) of the driver.

So far, I already have something to play with that fixes a bunch of issues. Patches against 2.4.20 and 2.4.21-pre7 can be found here: (too big to inline). Note that I also bring in various other pci_ids.h updates but that shouldn't harm you and is easier that way for me ;)

http://penguinppc.org/~benh/radeonfb-040603-2.4.20.diff

http://penguinppc.org/~benh/radeonfb-040603-2.4.21-pre7.diff

NOTE: It's known that radeonfb is incompatible with ATI binary GL drivers (at least it crashes the machine on a friend's r300), I'm investigating.

Daniele Venzano tried this out, and said it was better than before, but still had problems. One problem was that the cursor was only visible at 8-bit depth: at 16- or 32-bit it disappeared. Benjamin said this was a "Known problem with fbdev's in 2.4. I have to find out if that can be fixed easily, though implementing HW cursor would cure it as well..."

4. New flink() System Call Shot Down

6�Apr�2003�-�11�Apr�2003 (51 posts) Archive Link: "[PATCH] new syscall: flink"

Topics: Microkernels: Hurd

People: Ulrich Drepper,�Linus Torvalds,�H. Peter Anvin

Ulrich Drepper proposed:

I got a couple of requests for a function which isn't support on Linux so far. Also not supportable, i.e., cannot be emulated at userlevel. It has some history in other systems (QNX I think), though, and helps with some security issues. It really not adding much new functionality and I hope I got it right with my "monkey see, monkey do" technique of looking up other places doing similar things.

The syscall I mean is

int flink (int fd, const char *newname)

Similar to link(), but the first parameter is a file decsriptor. Using the file descriptor helps to avoid races in some situation.

A number of folks had criticisms, which Linus Torvalds summed up eventually, with:

As others have pointed out, there is no way in HELL we can do this securely without major other incursions.

In particular, both flink() and funlink() require that you do all the same permission checks that a real link() or unlink() would do. And as some of them are done on the _source_ of the file, that implies that they have to be done at open() time.

One check in particular is "is the opener willing to let this be linked anywhere else in the namespace". Since the opener isn't necessarily the same agent as the one doing the flink().

If you really really think you need this (and not just do it because some random idiot-customer doesn't understand security), then I would suggest you add a O_CANLINK flag to open, and require that that flag is set in the file descriptor.

That way you get "flink()" behaviour, but you require that the opener be aware of the fact that the file may be linked into another position. That will fix the glaring security hole.

H. Peter Anvin pointed out that if Linus' objections were completely true, then there must be security problems already existing in the kernel, that needed to be addressed. And Ulrich also said:

there are two or three ways I can see:

I'm certainly not qualified to say whether this is viable or not. The safelink() idea certainly is implementable, just 3-4 more lines on top of the flink() patch. But this wouldn't be necessary if we'd have the more complete support with the new open() flag(s). Al mentioned to me some problems with network filesystem in the context of flink(). So somebody who understands these issues might want to comment. It seems there is some interest in this.

5. New Kernel Tree For Embedded Linux

7�Apr�2003�-�8�Apr�2003 (15 posts) Archive Link: "[ANNOUNCE] New kernel tree for embedded linux"

People: Joern Engel,�Tom Rini

Joern Engel announced:

Some days ago, I've started a -je ("just embedded") tree which will focus on memory reduction for the linux kernel.

The RATIONALE is that on a ppc with some flash, memory, network and nothing much else, I don't feel like parsing MS-DOS partitions, offering IPX networking etc., but that junk is still included in 2.[45].current - unconditionally. And there is more...

My first GOAL is to add config options that rip the code out for any platform that doesn't need, yet keeps it in for everyone that does. If I don't know what the code is needed for, I'll just rip it out and wait for bug reports - hopefully.

If I feel that any particular patch is clean enough for mainline, I'll forward it to Linus/Marcello.

WHO should use this tree:

Bugreports of any kind will help me to clean up the patches and get them included in mainline. I personally run them on my PIII notebook, right now, and things didn't break. (Yet?)

HOW can you help:

  1. Any patch that reduces the memory footprint on _any_ platform is welcome. Even the worst hacks should be cleaned up over time to work for everyone.
  2. Test the patches and:
  3. Send any other patches and convince me that they help embedded people by my definition (whichever that may be at that time).

WHAT patches will I ignore/reject:

Finallly, WHERE can you get it:

http://wh.fh-wedel.de/~joern/software/kernel/je/24/patch-2.4.20-je1

http://wh.fh-wedel.de/~joern/software/kernel/je/25/patch-2.5.66-je1

DISCLAIMER:

No, the server does not support directory browsing, there is no mailing list and there are currently only three patches in the 2.4 tree and two in the 2.5 tree. 2.5 is untested, looks broken and I should put some work in it.

These patches may cost you time, money and precious hardware, I don't guarantee for anything and IANAL. Anything else?

A number of folks offered ideas for how to shrink the kernel; and Tom Rini pointed out that "everyone can benefit from _every change_ you want to make in your tree, and it's not just an 'embedded' issue." Joern replied, "Right. The purpose of this tree is not to keep changes out of mainline, but to test and enhance some of the uglier ones before they go in. In a perfect world, my tree would contain exactly zero patches. :)"

6. Cleaning Out Unused ioctls

10�Apr�2003 (4 posts) Archive Link: "[PATCH] kill two scsi ioctls"

Topics: Disks: SCSI, Ioctls

People: Andries Brouwer,�John Levon,�Michael Elizabeth Chastain

Andries Brouwer took two ioctls out of the kernel, saying, "The definition for SCSI_IOCTL_BENCHMARK_COMMAND was added in 1.1.2. The definition for SCSI_IOCTL_SYNC was added in 1.1.38. Neither of them has ever been used." John Levon noticed that this left a gap in the ioctl numbering, which might confuse some people. He suggested putting a comment in the code, to explain the jump in numbering. Andries replied:

I prefer a short and clean actual kernel source, and long historical explanations somewhere else, for example in Documentation/ioctl_list.

(Michael Elizabeth Chastain made ioctl_list.2 a man page, but nobody keeps it up-to-date. The current version is from 1.3.27. I am updating it and expect to submit it for the Documentation directory. Maybe more people will update it there.)

John said this would be fine.

7. Framebuffer Updates

10�Apr�2003�-�15�Apr�2003 (12 posts) Archive Link: "[FBDEV updates] Newest framebuffer fixes."

Topics: Framebuffer, Sound: i810, Version Control

People: James Simmons,�Ani Joshi

James Simmons said, "Here are the latest framebuffer changes. Some driver updates and a massive cleanup of teh cursor code. Tony please test it on the i810 chipset. I tested it on the Riva but there is one bug I can't seem to find. Please test this patch. It is against 2.5.67 BK. It shoudl work against 2.5.67 as well." He gave a link to his patch (http://phoenix.infradead.org/~jsimmons/fbdev.diff.gz) .

In the course of discussion, John Weber reported that he'd finally gotten RadeonFB working, but only with a separate driver by Ani Joshi. He asked if this would be included in the main kernel tree at some point. James confirmed that yes it would, at least when some more recent patches became available.

8. Status Of ext2/ext3 Fragment Support

10�Apr�2003 (3 posts) Archive Link: "ext2/3 fragments support"

Topics: Extended Attributes, FS: ext2, FS: ext3

People: Lorenzo Allegrucci,�Andreas Dilger

Lorenzo Allegrucci asked, "Fragments support on ext2/3 filesystems seems disabled or non fully functional. Are there any plans to implement fragments?" Andreas Dilger replied:

They have never been enabled. The "goal" is to imlement fragments as a type of extended attribute, so that they can be packed into a single block or inline in a larger inode (along with other EA data) instead of being fixed-size hunks.

The first thing that needs doing is fixing the current ext2/3 EA sharing scheme, which currently only shares blocks if they are identical and is therefore only really useful for ACLs.

The best proposal so far for EA sharing is to put them into a directory-like structure (maybe one dir per block group or something) and have the EA type and data be packed inline into the directory (like the inode number and filename are done with regular directories). Each inode would also have a "catalog" of the EAs that it has (itself an EA, either inline in a larger inode or in the directory pointed to by, say, i_faddr). Shared entries would be like hard links pointed to by mutliple catalogs, with a refcount.

This was discussed on ext2-devel about a year ago, but no takers on the implementation yet (I might eventually need to implement it this year if nobody beats me to it, because we need better EAs than one per 4kB of disk).

9. Saving Space On Kernel Messages

10�Apr�2003�-�11�Apr�2003 (6 posts) Archive Link: "Painlessly shrinking kernel messages (Re: kernel support for non-english user messages)"

Topics: Compression

People: Timothy Miller,�Alan Cox,�David Lang

Timothy Miller had a suggestion on how to save space in the kernel:

To be brief, the idea I came up with was to identify the 128 most common words in kernel messages and replace them with single character values above 127 which printk would decode on the way out. Once the list was determined, there would be a header file people could use, at their leisure, to make stubstitutions. So, for instance, instead of having this:

printk("invalid: ...");

We would have this:

#define MSG_INVALID "\200"
...
prink(MSG_INVALID "...");

To judge the practicality of this, I used 'strings' on an uncompressed kernel image (2.4.20, IIRC) and then ran it through this:

tr '[:lower:]' '[:upper:]' | tr '[:blank:]' '\n' | sort | uniq -c | tr ' ' 0

This gave me a list of all words found in the kernel along with their counts. Then I ran it through a positively awful little C program which I wrote to determine not the 128 most frequent, but rather, the 128 that would result in the maximum shrinkage (maximize count * (length-1)). The results of that run are given below. The results of the test are that this approach might save up to 62424 bytes of kernel space which is only about 3% of the kernel image size I got the strings from, but it's nearly 27% of the total output I got from 'strings'. Is it worth it? Maybe not yet, but then again, there may be an even more intelligent approach to this compression that we could use, hopefully one which wouldn't require any more effort to use.

Alan Cox replied, "Not a totally crazy idea. You could also do 5pack and some of the other string tricks people have used in time. You also dont need to do word boundaries. For embedded at least this is far from ludicrous as a concept. The tricky piece for all of these is working out how to grab each printk format string and do things to it. That lets you do compression, removal, internationalisation, cataloguing .." Timothy asked for a refresher on what 5pack was, and Alan said, "Its a thing from the old 8bit gaming world. You code in 5bit chunks with a leading length marker. 5bits is enough for a-z and some bits of punctuation, plus capital implying space and 'escape' for an 8bit sequence block. Gets you a bit under 40% compression with real life data and takes about 200 bytes to decode." Timothy tried again with an algorithm that ignored word boundaries, and this time reported, "The results are that the kernel messages are reduced from 232690 to 154365, which is a savings of 33%. Not bad, but it's probably still not worth it yet; the pain is still greater than the benefit." But David Lang said, "this is definantly something that wouldn't make sense to do manually, but if somone can figure out how to do this as part of the build process the 80K saved can't hurt."

10. udev Replacement For devfs

10�Apr�2003�-�12�Apr�2003 (111 posts) Archive Link: "[ANNOUNCE] udev 0.1 release"

Topics: Disks: SCSI, FS: devfs, FS: sysfs, Hot-Plugging

People: Greg KH

Greg KH announced:

I'd like to finally announce the previously vapor-ware udev program that I've talked a lot about with a lot of people over the past months. The first, very rough cut is at:

kernel.org/pub/linux/utils/kernel/hotplug/udev-0.1.tar.gz

But what is it? I've included an initial design document below that was originally written by Dan Stekloff, and hacked up a bit by me. But in short, udev is a userspace replacement for devfs. It will create and destroy /dev entries based on the current system configuration. It does this by watching the /sbin/hotplug events on the system, and reading information about these events from sysfs.

Right now the program is only in 1 piece, not the 3 pieces that the design document talks about, but it does work with the default Linux /dev naming scheme that almost everyone uses. It can only work for devices that create a dev file in sysfs, exposing their major/minor number, so this is limited (currently only block and usb-serial devices do this.)

If you want to test this with block devices, you will need the kobject hotplug patches previously posted here for 2.5.67, which are also available at:

kernel.org/pub/linux/kernel/people/gregkh/misc/kobject-hotplug-?-2.5.67.patch

Anyway, this works for me, on my machines, and I am very interested in feedback from everyone about both this concept, and the implementation of this. I've cced a lot of different lists, as they have all expressed interest in this project.

Yes, I know there's still a lot of work to do (serialization, symlinks, hooking hotplug so that others can also use it, etc.) but it's a first step :)

I'd like to thank Dan Stekloff for constantly badgering me about this project and for writing lots of good design documentation, it is greatly appreciated. Also, thanks to Pat Mochel for coming up with sysfs which allows this project to be able to work at all.

There was a lot of interest, and some skepticism on whether Greg's code could scale properly, or be a true improvement over devfs. There was also some confusion because Greg had taken a couple temporary shortcuts to get the thing working. So it wasn't clear at first, whether certain things were genuine bugs or just expedients of the moment. Here is the design document included in Greg's post:

We've got a couple goals for udev:

  1. dynamic replacement for /dev
  2. device naming
  3. API to access info about current system devices

Splitting these goals into separate subsystems:

  1. udev - dynamic replacement for /dev
  2. namedev - device naming
  3. libsysfs - a standard library for accessing device information on the system.

Udev
------

Udev will be responsible for responding to /sbin/hotplug on device events. It will receive the device class information along with device's sysfs directory. Udev will call the name_device function from the naming device subsystem with that information and receive a unique device name in return. Udev will then query sysfs through the libsysfs for specific device information required for creating the /dev node like major and minor number. Once it has the important information, udev will create a /dev entry for the device, add the device to the in memory table of current devices, and send notification of the successful event through a D-BUS message. On a remove call, udev will remove the /dev entry, remove the device from the in memory table, and send notification.

Udev will consist of a command udev - to be called from /sbin/hotplug. It will require the in memory dynamic database/table for keeping track of current system devices, and a library of routines for accessing that database/table. Udev will not care about "how" devices are named, that will be separated into the device naming subsystem. It's presented a common device naming API by the device naming subsystem to use for naming devices.

namedev
----------

From comments people have made, the device naming part of udev has been pushed into its own "subsystem". The reason is to make this as flexible and pluggable as possible. The device naming subsystem, or namedev, will present a standard interface for udev to call for naming a particular device. Under that interface, system administrators can plug in their own methods for device naming.

We would provide a default naming scheme. The first prototype implementation could simply take the sysfs directory passed in with the device name function, query sysfs for the major and minor numbers, and then look up in a static device name mapping file the name of the device. The static device naming file could look just like devices.txt in the Linux kernel's Documentation directory. Obviously, this isn't a great implementation because eventually we'd like major an minor numbers to be dynamic.

The default naming scheme in the future would have a set of policies to go through in order to determine the name of the device. The device naming subsystem would get the sysfs directory of the to be named device and would use the following information in order to map the device's name:

  1. Label info - like SCSI's UUID
  2. Bus Device Number
  3. Topology on Bus
  4. Kernel Name - DEFAULT

System administrators could use the default naming system or enterprise computing environments could plug in their Universal Unique Identifier (UUID) policies. The idea is to make the device naming as flexible and pluggable as possible.

The device naming subsystem would require accessing sysfs for device information. It will receive the device's sysfs directory in the call from udev and use it to get more information to determine naming. The namedev subsystem will include a standard naming API for udev to use. The default naming scheme will include a set of functions and a static device naming file, which will reside in /etc or /var.

libsysfs
--------

There is a need for a common API to access device information in sysfs. The device naming subsystem and the udev subsystem need to take the sysfs directory path and query device information. Instead of copying code so each one will have to readdir, etc., splitting this logic of sysfs calls into a separate library that will sit atop sysfs makes more sense. Sysfs callbacks aren't standard across devices, so this is another reason for creating a common and standard library interface for querying device information.

11. Compressing RAM Instead Of Swapping

12�Apr�2003 (7 posts) Archive Link: "Page compression in lieu of swap?"

Topics: Compression, Real-Time, Virtual Memory

People: Timothy Miller,�Barry K. Nathan,�Inaky Perez-Gonzalez,�Con Kolivas,�Jan Knutar

Timothy Miller had a suggestion for a better way to handle swap:

I did some searching of the kernel archives and the only things related to the forthcoming idea had to do with compressing pages when writing to swap and doing compressed disks. Here's a different idea...

Inspired by my recent experiments in compressing kernel messages, I started to wonder what else might benefit from compression, and the following idea occurred to me:

Given the hideous amount of time required to access a disk, especially when something else wants to access it, could there be a benefit to "swapping" pages by compressing them to somewhere else in memory? If we could achieve, even say, 30% compression on pages, on average, then we could free up RAM without having to do any I/O. This would be the first line of defense against a low-memory situation, finally resorting to actual disk access when that becomes unworkable or for pages which can't be compressed enough for it to help (which has a penalty worse than just writing to disk). And furthermore, if we were to swap first memory containing compressed pages, we can reduce the total amount of I/O for swapping.

This would, of course, suck a lot of CPU, and in the case of a server running many services where the CPU usage is pegged even when there's a lot of swapping, it would be better to just swap as normal. But in any case where swapping is causing an increase in idle time, I would expect a considerable benefit from being able to free up pages by making LRU pages simply take up less space in RAM when they're not being used.

Barry K. Nathan said, "This has been done before, on (Classic) Mac OS (the program's name was RAM Doubler). It was *far* faster than Apple's swapping implementation, although I don't know how much of that was due to the compression and how much was due to Apple's horrid virtual memory implementation back in the day. It also had some stability problems, but that could have been due to the implementation quality rather than to the overall approach." And Inaky Perez-Gonzalez also said to Timothy, "I tried this sometime ago (2.2.x timeframe) for canning mozilla into an small amount of memory and it was kind of doable - not too complicated, in fact - the only thing is it would reduce the machine to a crawl some times (I guess I did not know how to throttle the swap) - I even got it working with bzip2 -9 [this was a pure exercise]." Elsewhere, Jan Knutar asked if the Compressed Caching Project (http://linuxcompressed.sourceforge.net/) was the same as what Timothy described, and Con Kolivas replied, "Yes it is and works very well. However it isn't smp or preemptible aware yet. I have a patch against -ck* as well, but it isn't popular because of preempt incompatibility." Timothy was very excited to see this, and asked if anyone had tested it out. Jan Knutar replied, "Some benchmarks on the site. Seems to have a negative effect on kernel compiles atleast, on machines with lots of memory... Looks like my 24 meg gateway might just be on the border to benefit from it, unless its 133Mhz overdrive processor (33Mhz isa bus.. wee) makes compression too expensive to be beneficial..."

12. Unwinding vsyscall Code

12�Apr�2003�-�14�Apr�2003 (4 posts) Archive Link: "unwinding for vsyscall code"

People: Ulrich Drepper

Ulrich Drepper said:

Now that the kernel provides code user programs are executing directly (I mean the vsyscall code on x86) it is necessary to add unwind information for that code as well. The unwind information is used not only in C++ code. The new thread code also uses it for the cancellation handling. If we have no such information available we would have to resort to using int $0x80 for all syscalls which are also cancellation points (read, write, ...).

Providing the information from outside the kernel is problematic. First, we would have to recognize when a process starts which code is actually used and install the appropriate unwind table. Second, we would always have to keep libc and kernel in sync. There might be new code sequences in future and once the kernel is changed you'd need a new libc. Not good at all.

Instead the best way I've found is to provide the info in the kernel. Fortunately this is associated with almost no cost. The unwind table is just a block of static data which is copied at system boot time into the vsyscall page just like the normal vsyscall code.

To advertise the existence and location of the unwind table I've added one more AT_* constant. It might in theory be possible to reuse AT_SYSINFO and add a fixed offset but I'd rather not do this. If/When more code is added to the vsyscall page the unwind table gets larger and you want to have the liberty to move it out of the way.

This also brings up an important point: even if more entry points into the vsyscall page are defined, there will always have to be only one unwind table. It is not necessary to add more and more AT_* values to advertise more tables.

The attached patch just adds the AT_* value, makes sure the AT_* value is passed to applications, define the static data for the unwind blocks (two, one for int80 and the other for sysenter), and finally code to copy the data in place. Very simple and unintrusive. The patch is verified to work nicely and unwind now works even when I use vsyscall.

I've added documentation of all the unwind info but it might still be not easy to generate new data if the code sequences are changed or new sequences are added. If you want, you can add a comment somewhere which instructs people to contact me to make the changes.

The patch also includes a bonus: so far all the tables in the sysenter_setup() function will stay behing even if the function gets removed after startup. They are not marked appropriately. I've added __initdata marker, but actually it should be __initrodata if this would exist. Not that it makes much of a difference, the data is gone right away after booting.

13. Linux On Aquanta Clusters

14�Apr�2003 (5 posts) Archive Link: "Linux on Unisys Aquanta HR/6 ?"

Topics: SMP

People: Alan Cox,�Meelis Roos

Meelis Roos asked if anyone had gotten Linux working on the Unisys Aquanta HR/6 (http://www.unimetrix.com/hr6.html) or any other Aquanta; he had a chance to get a cheap 6-processor PPro SMP machine, and wanted advice. Alan Cox replied, "I can't help thinking a single AMD duron would outrun it. For Linux support the big thing you need to know is if the system is "Intel MP 1.1/1.4 compliant". A lot of the ppro boxes were, but 6 ways can be a bit strange (the ALR 6x6 does work )" Meelis replied, "it has been hinted that ALR 6x6 and this box actually use the same mainboard, co-developed by ALR and Unisys: http://www.newfangled.san-jose.ca.us/ALR Revolution 6x6/alr.html (http://www.newfangled.san-jose.ca.us/ALR Revolution 6x6/alr.html) . So it may actually run Linux. The peripherals are supported."

14. Fix For PCMCIA Boot Deadlocks

14�Apr�2003 (3 posts) Archive Link: "[CFT] Hopefully fix PCMCIA boot deadlocks"

Topics: Hot-Plugging, PCI

People: Russell King,�Felipe Alfaro Solana,�Valdis Kletnieks

Russell King said:

Here's my latest patch against 2.5.67 which introduces a proper state machine into the PCMCIA layer for handling the sockets. Unfortunately, I fear that this isn't the answer for the following reasons:

That said, it seems to work for me.

The patch can be found at

http://patches.arm.linux.org.uk/pcmcia/pcmcia-1.diff

Now, thing is, I can't test this patch on its own; I can test it on ARM boxen with yenta cardbus bridges, or statically mapped PCMCIA-only sockets, but the former requires several other patches to the PCMCIA resource subsystem to be functional.

Hence I need other peoples feedback on this patch before I push it Linus-wards.

Felipe Alfaro Solana was very happy with this. He said, "Well, maybe it's not the answer, but it's working for me with 2.5.67-mm3. Besides being too verbose, I have tried booting with the card plugged, booting with the card unplugged and then plugging it, and plugging/unplugging it several time to check that hotplug is working. Haven't found any problems, although I'm testing right now on my main system (my everyday use laptop)." Valdis Kletnieks also confirmed that the patch seemed to be working great; and thanked Russell heartily.

15. What To Expect From 2.5

14�Apr�2003�-�15�Apr�2003 (16 posts) Archive Link: "2.5 'what to expect' document."

Topics: Forward Port, SMP, Software Suspend

People: Dave Jones,�Nigel Cunningham,�Randy Dunlap,�Sam Ravnborg,�Michael Buesch

Dave Jones said, "A few people mailed me recently telling me that they'd stumbled upon this doc," [The Post-Halloween Document] "and wished they'd found a lot sooner, and it's been a while since I last posted it (and naturally, lots of stuff changes) so here's a repost..." He gave a link to the latest version (http://www.codemonkey.org.uk/post-halloween-2.5.txt) , updated to at least 2.5.67. Randy Dunlap posted a patch with a lot of little corrections, which Dave applied; and other folks had questions. Michael Buesch quoted from the "Extra Tainting" section of the doc: "Running certain AMD processors in SMP boxes is out of spec, and will taint the kernel with the 'S' flag. Running 2 Athlon XPs for example may seem to work fine, but may also introduce difficult to pin down bugs. In time it's likely this tainting will be extended to cover other out of spec cases." Michael asked if the kernel might one day be tainted if it found the CPU overclocked. Dave replied:

Theoretically possible on most CPUs, but it's not that simple.

Which leaves those that do have the necessary info.. Which is different per vendor, per family, per model. That's a lot of tests, and it's not a walk in the park to get it all right, which is probably why no-one has done it yet.

Alan tried it in the 2.4.early-ac stage, but gave up on it after a while, after getting lots of reports of it not working out as planned..

Elsewhere, Nigel Cunningham quoted the section on power management: "software suspend is still in development, and in need of more work. It is unlikely to work as expected currently." Nigel asked, "If you wish, your comment could reflect the fact that a more advanced version is available under 2.4 and is being actively maintained and enhanced. It has been ported to 2.5 and is being kept in sync with a view to inclusion in 2.5 or 2.6. It may not be in the kernel tree by the time 2.6 is released, but I will do my utmost to ensure the patches are maintained, since I plan in using it! :>" But Dave replied, "The document is for people wanting to try out 2.5, not 2.4+addons. If the swsusp stuff does get forward ported to 2.5, I'll document it when it arrives there." Nigel reiterated that a 2.5 forward-port was underway; and Randy Dunlap added, "Once it's in 2.4, it can (should) be listed in Dave's doc as a regression if it's not also in 2.5/2.6."

Elsewhere, Sam Ravnborg also had several comments for Dave. For the kernel build section, he pointed out that 'make gconfig' was a GTK-based alternative to 'make xconfig'. He also mentioned that, ""make help" provides a list of typical targets, including debugging targets such as allnoconfig etc." He also said that Dave could use stronger wording against 'make dep'. In his doc, Dave had said that 'make dep' was no longer necessary, but Sam amended, ""make dep" is actually deprecated and for no use since one or two months ago."

Sharon And Joy

Kernel Traffic is grateful to be developed on a computer donated by Professor Greg Benson and Professor Allan Cruse in the Department of Computer Science at the University of San Francisco. This is the same department that invented FlashMob Computing. Kernel Traffic is hosted by the generous folks at kernel.org. All pages on this site are copyright their original authors, and distributed under the terms of the GNU General Public License version 2.0.