Kernel Traffic #118 For 14�May�2001

By Zack Brown

linux-kernel FAQ (http://www.tux.org/lkml/) | subscribe to linux-kernel (http://www.tux.org/lkml/#s3-1) | linux-kernel Archives (http://www.uwsg.indiana.edu/hypermail/linux/kernel/index.html) | kernelnotes.org (http://www.kernelnotes.org/) | LxR Kernel Source Browser (http://lxr.linux.no/) | All Kernels (http://www.memalpha.cx/Linux/Kernel/) | Kernel Ports (http://perso.wanadoo.es/xose/linux/linux_ports.html) | Kernel Docs (http://jungla.dit.upm.es/~jmseyas/linux/kernel/hackers-docs.html) | Gary's Encyclopedia: Linux Kernel (http://members.aa.net/~swear/pedia/kernel.html) | #kernelnewbies (http://kernelnewbies.org/)

Table Of Contents

Introduction

This issue is dedicated to my sister Laetitia, whose birthday is today. Happy birthday! She and her husband Dave recently had their first child, Vincent (../ktimages/vince7.jpg) . Yay!!!

Mailing List Stats For This Week

We looked at 1239 posts in 4875K.

There were 445 different contributors. 211 posted more than once. 148 posted last week too.

The top posters of the week were:

1. Response To Shutdown Events

4�Apr�2001�-�27�Apr�2001 (79 posts) Archive Link: "Let init know user wants to shutdown"

People: Andrew Grover,�Pavel Machek,�John Fremlin,�Miquel van Smoorenburg,�Mike Castle

Pavel Machek posted a patch to let init know that the power button had been pressed, so it could shut down gracefully. Andrew Grover objected, "This is not correct, because we want the power button to be configurable. The user should be able to redefine the power button's action, perhaps to only sleep the system. We currently surface button events to acpid, which then can do the right thing, including a shutdown -h now (which I assume notifies init)." Pavel replied, "There's no problem with configurability -- you can configure init as well. I saw it pretty much analogic to situation with Ctrl-Alt-Del: it also sends signal to init. Init then decides what to do. [I believe requiring acpid for such easy stuff is not neccessary...]" But John Fremlin pointed out, "Using a signal to hit init with is a bit dubious because most signals are hooked up for something else already. For example, SIGTERM sent to my init (http://john.snoop.dk/programs/linux/jinit) would shutdown and start sulogin, which is probably not what you want when you press the off button." Elsewhere, Miquel van Smoorenburg added, "SIGTERM is a bad choise. Right now, init ignores SIGTERM. For good reason; on some (many?) systems, the shutdown scripts include "kill -15 -1; sleep 2; kill -9 -1". The "-1" means "all processes except me". That means init will get hit with SIGTERM occasionally during shutdown, and that might cause weird things to happen."

Mike Castle suggested that perhaps there should be a special mechanism to allow user-space to communicate with the kernel, in order to handle all these situations. Pavel replied, "init _is_ the tool which is right for defining policy on such issues. Take a look how UPS managment is handled."

There followed a long, meandering discussion on the ins and outs of various policies and configurations; but no conclusive decision came out of it.

2. 2.4.4 Sluggish Under Fork Load

28�Apr�2001�-�3�May�2001 (21 posts) Archive Link: "2.4.4 sluggish under fork load"

People: Peter Osterlund,�Linus Torvalds,�Mohammad A. Haque,�J.A. Magallon

Peter Osterlund reported that under fork load, 2.4.4 seemed less responsive than 2.4.3. He said, "For example, when running the gcc configure script, the X mouse pointer is very jerky. The configure script itself runs approximately as fast as in 2.4.3. Another thing is that the bash loop "while true ; do /bin/true ; done" is not possible to interrupt with ctrl-c. A third thing I noticed is that starting a gnome session in redhat 7.0 takes longer. (It takes more time for the gnome splash screen to appear.)" He attributed this behavior to the recent patch, to cause children to run first before parents after a fork; and posted a patch to revert that feature. J.A. Magallon could not confirm the problem, but Mohammad A. Haque said he'd seen the identical behavior as well. Rene Puls also reported identical behavior, and found that Peter's patch fixed it.

At one point, Linus Torvalds said, "The new "run the child first" approach has advantages, but it is entirely possible that the advantages unfairly prioritize things that do a lot of forking." He and several others pointed out that Peter's while loop was just a bash bug. But he agreed there was definitely a problem. He said, "Reverting it outright may be an acceptable approach. I'll think about it: the arguments _for_ the patch are true and real, and it shows up as real improvements on some things.. An alternative approach might be to not give the child the _whole_ timeslice, but give it more than half. Partition it out 66% - 33% or something."

3. ISO9660 Endianness Cleanup

30�Apr�2001�-�3�May�2001 (18 posts) Archive Link: "iso9660 endianness cleanup patch"

People: H. Peter Anvin,�Martin Dalecki,�Pavel Machek

H. Peter Anvin posted a patch, and explained, "I was looking over the iso9660 code, and noticed that it was doing endianness conversion via ad hoc *functions*, not even inlines; nor did it take any advantage of the fact that iso9660 is bi-endian (has "all" data in both bigendian and littleendian format.) The attached patch fixes both. It is against 2.4.4, but from the looks of it it should patch against -ac as well." Martin Dalecki sounded a word of warning, "Please beware: There is a can of worms you are openning up here, since there are many broken CD producer programms out there, which only provide the little endian data and incorrect big endian entries. I had some CD's of this form myself. So the endian neutrality of the iso9660 is only in the theory present..." Pavel Machek remarked with a smirk, "It might be funny to *deliberately* create different filesystems; one on little endian side and one on big endian side. That way windows users would see "macs suck" and mac users "PCs suck", and that with just one cd ;-)."

4. Maximum Number Of Directories In A Directory

1�May�2001�-�5�May�2001 (14 posts) Archive Link: "Maximum files per Directory"

Topics: FS: ReiserFS, FS: ext2

People: Andreas Dilger,�Chris Mason,�H. Peter Anvin,�Daniel Phillips,�Ingo Oeser

Andreas Rogge wanted to create 100,000 mailboxes in a directory, only to find that after 32768, the directory refused to create any new sub-directories. H. Peter Anvin replied that he believed 2^15 directories in a single directory was an ext2 limitation. Andreas Dilger replied:

This is imposed by a number of issues:

For stat (old interface) the st_nlinks count is also an unsigned short, so we _should_ be able to increase EXT2_LINK_MAX to 65500 or so safely. The VFS will have problems if you increase the max link count over 65535 because __kernel_nlink_t is __u16.

I see that reiserfs plays some tricks with the directory i_nlink count. If you exceed 64536 links in a directory, it reverts to "1" and no longer tracks the link count.

You will have problems with performance for directories this large on stock ext2, unless you use Daniel Phillips' indexed directory patch. I have tested 100k+ _files_ in a single directory without problems (Daniel has tested 1M _files_ without problems). I would NOT reccommend doing this on your production mail server at this time, but it may be worth testing at least... It does not (yet) address the issue of lots of subdirectories, but that is something that can be worked on at least.

http://kernelnewbies.org/~phillips/htree/

Chris Mason confirmed Andreas D.'s observations on reiserfs, "The link count isn't used at all when deciding if the directory is empty (we use the size instead), so we can just lie to VFS if someone tries to make tons of subdirs." Andreas D. replied, "For that matter, ext2 doesn't use the link count on directories to determine if they are empty either, so it shouldn't be too hard to do the same with the ext2 indexed-directory code."

Ingo Oeser suggested that any application trying to create so many directories was completely broken, but H. Peter said, "The application is using the VFS the way it is advertised to work. If you think doing ls on an extrememly large directory is painful, you have never seen the droppings of an application which tries to do load-balancing between directories by doing real hashing. THAT is painful! At least in the first case you can use grep."

5. Designing API To Enforce Good Coding Practices

2�May�2001�-�4�May�2001 (21 posts) Archive Link: "unsigned long ioremap()?"

Topics: PCI

People: Geert Uytterhoeven,�Jonathan Lundell,�David S. Miller

Geert Uytterhoeven suggested:

Since you're not allowed to use direct memory dereferencing on ioremapped areas, wouldn't it be more logical to let ioremap() return an unsigned long instead of a void *?

Of course we then have to change readb() and friends to take a long as well, but at least we'd get compiler warnings when someone tries to do a direct dereference.

Jonathan Lundell replied:

Better yet, seems to me, its own type. Say: typedef unsigned long io_ref_t;

It's already done for dma_addr_t, and this seems like an analogous case.

The bigger job would be to fix all the direct dereferences (a worthwhile thing, I guess; a quick scan shows at least a few), as well as to fix uncast assignments of ioremap(). Or ideally to get rid of the casts (most that I see are casts to unsigned long) and type the receiving buffer appropriately.

It'd be a big job. And Linus further suggests that ioremap's first argument is an architecture-specific object, not necessarily either a physical CPU address or a PCI address (though it's typically both in many (most?) i386 implementations). Now *there'd* be a cleanup.

There was some discussion of various possibilities, and at one point David S. Miller remarked, "I suppose the point is that there is a fine line wrt. using APIs to influence people to "do the right thing", and this has been exemplified in several threads I've been involved in wrt. PCI dma and other topics. :-)"

6. Disabling The PC Speaker

4�May�2001 (6 posts) Archive Link: "added a new feature: disable pc speaker"

People: Oystein Viggen,�Simon Richter,�Keith Owens

Nico Schottelius posted a patch to create a compile-time option to disable the speaker on the PC. Simon Richter said it would be nice if this were configurable at runtime via something like sysctl. Keith Owens said the entire problem was user-space, since setterm and xset could both disable the speaker. But Oystein Viggen replied, "Well, some buggy programs don't care about you turning off beeping in X. I think gnome-terminal or such has its own checkbox for turning beeps on or off." Nico said that he'd also first thought the problem was user-space only. But he also agreed with Simon, that a runtime option would be best of all. He asked where to find sysctl documentation, but no one gave any links.

7. Hot-Swapping CPUs And RAM

5�May�2001 (20 posts) Archive Link: "[PATCH] CPU hot swap for 2.4.3 + s390 support"

Topics: Clustering: Mosix, Real-Time, Samba, Virtual Memory

People: Anton Blanchard,�Dwayne C. Litzenberger,�Chris Wedgwood,�Rik van Riel,�David Woodhouse,�Bruce Harada,�Peter Rival,�Jakob Ostergaard,�Dwayne C. Litzenber

Anton Blanchard announced:

You can find a new version of the hot swap cpu patch at:

http://samba.org/~anton/patches/cpu_hotswap-2.4.3-patch

The version for s390 (you need to first apply the 2.4.3 kernel patch available on the IBM s390 Linux website) is at:

http://samba.org/~anton/patches/cpu_hotswap-2.4.3-patch-s390

Many thanks to Heiko Carstens <Heiko.Carstens@de.ibm.com> for adding s390 support and fixing a few bugs in the initial implementation. You should be able to attach and detach CPUs depending on workload in your s390 Linux guest images :)

One of the advantages of this patch is that it removes cpu_logical_map() and cpu_number_map() which people had a tendency to get wrong.

It should also be easy to support more than BITS_PER_LONG cpus as there is no concept of online_cpu_map any more.

Dwayne C. Litzenberger started salivating onto the floor, and asked, "How far away is the capability to "teleport" processes from one machine to another over the network? Think of the uptime!" A couple people pointed him to MOSIX (http://www.mosix.org) , but Jakob Ostergaard replied that MOSIX actually wouldn't give long uptimes due to process migration. He pointed out that processes on MOSIX clusters were always tied to their home node, and would die no matter how far they had migrated, if the home node died. Bruce Harada gave a pointer to Heterogeneous Process Migration: The Tui System (http://citeseer.nj.nec.com/299905.html) .

There was no reply to that, but elsewhere, Peter Rival asked when hot-swap or hot-add RAM would be supported, and Chris Wedgwood replied, "Adding memory probably isn't going to be too hard... but taking existing memory off line is tricky. You have to find some way of finding all the pages that are in use and then dealing with them appropriately, and when some are locked or contain kernel data this would be extremely difficult I should think." Later he added, "It's hard with current memory allocation and management paradigms, if we wanted to abstract things more and make (break) certain rules, I'm sure it can me made to work -- the only thing is, we would loose _MUCH_ speed and efficiency (and waste much more space), so much so I doubt anyone would serious want to know about it. We would also have to violate certain assumptions of RT applications."

At one point, Rik van Riel said that hot-remove RAM wouldn't really be so difficult, because:

1. the kernel uses virtual memory itself and accesses its data structures through page tables

2. reverse mapping stuff is easy (though it costs 8 bytes of overhead per mapped pte, probably doubling page table overhead)

This only leaves two issues, the first is device drivers and the second is the question whether we'd want the overhead needed to implement the (fairly easy) memory relocation.

Chris asked how Rik would handle relocating pages which were mlocked, without violating RT contraints. Rik replied, "Fuck RT constraints. Linux doesn't have infinitely small scheduling latencies, it's easy to copy a page without increasing the scheduling latencies much." David Woodhouse gave his take:

You have to copy the page, then map it into the same virtual address (be that userspace or kernelspace) as the old one. Mark the page readonly when you start to copy it, and have a fault handler which immediately marks it writable and returns. If the source is writable by the time you've finished the copy, repeat.

If you have to repeat yourself more than $n times, you're probably experiencing livelock. At that point, do what Rik said - to hell with the RT constraints, disable interrupts and do the copy. At least your cache is warm :)

8. KDB Wishlist

8�May�2001 (10 posts) Archive Link: "kdb wishlist"

Topics: Big Memory Support, FS: sysfs

People: Keith Owens,�Tigran Aivazian,�Manfred Spraul,�George Anzinger,�Juan J. Quintela,�Vamsi Krishna S.

Keith Owens requested:

This is part of my kdb wishlist, does anybody fancy writing the code to add any of these features? It would be a nice project for anybody wanting to start on the kernel. Replies to kdb@oss.sgi.com please. Current patches at http://oss.sgi.com/projects/kdb/download/

  1. Change kdb invocation key from ^A to ^X^X^X within 3 seconds. ^A is used by emacs, bash, minicom etc.
  2. Command history. Handle up/down/left/right/delete keys. Each kdba_io routine is responsible for recognising the arch specific keys, with a common history and editting routine.
  3. Clean up repeating commands. Pressing enter at the kdb prompt repeats the previous command, no matter what the previous command was. Some commands it makes no sense to repeat (bp in particular), for other commands you want to repeat the command but without the parameter (md in particular).
  4. Embed width and count options in md and mm commands. Some hardware requires that accesses be a specific width, this can be achieved by setting BYTESPERWORD but it is awkward. We want md1 to read one byte, md2, md4, md8 commands. All can have a count field, e.g. md1c8 reads 8 bytes one at a time. mm1, mm2, mm4, mm8 to set memory no count field.

Vamsi Krishna S. volunteered to work on item 4. Tigran Aivazian suggested adding to the wishlist:

make it possible (it is trivial but a pain to have to do it manually every time I upgrade to your latest version!) for those extra "modules" to be statically linked in. So that one doesn't have to keep these lines in the rc.local

if [ -f /proc/sys/kernel/kdb ]
then

insmod kdbm_pg > /dev/null 2>&1
insmod kdbm_vm > /dev/null 2>&1
fi

and then discover that the modules are from the compilation corresponding to a different tweak in page.h or highmem or whatever (let him who readeth understand ;)

Long time ago I suggested removing the infrastructure for these "modules" completely (justification being -- it is not useless _only_ in a very exotic case of the need to teach kdb new features on a running kernel without permission to reboot) but you objected and that is fine, but at least making it optionally possible would be _very nice_, please.

There was no reply to this, but Manfred Spraul also suggested, "'ss' and especially 'ssb' could print the new value of the overwritten register/memory address in each line, perhaps both the old and new value." Keith exhorted people to actually code up the existing wishes before adding new ones.

There was no reply to this, but elsewhere, George Anzinger replied to item 1 of Keith's original list. He pointed out, "^X^X swaps point and mark in emacs. One (well, I) often will do ^X^X^X^X to examine where mark is and then return to point." Someone suggested using the break condition instead, and Juan J. Quintela replied, "kdb uses BREAK in the serial port (that minicom uses C-a for sending a break is an anecdote :) But the problem at hang is the console. I vote for the ^X^X^X as I a think that it is not a difficult shortcut. (and yes, I also use emacs and ^X^X all the time, but I think that this combination is not specially bad, and I suppose that the pet aplication of other people will have problems with something like: ^A^A^A that I never use)."

Sharon And Joy

Kernel Traffic is grateful to be developed on a computer donated by Professor Greg Benson and Professor Allan Cruse in the Department of Computer Science at the University of San Francisco. This is the same department that invented FlashMob Computing. Kernel Traffic is hosted by the generous folks at kernel.org. All pages on this site are copyright their original authors, and distributed under the terms of the GNU General Public License version 2.0.