Kernel Traffic #228 For 17 Aug 2003

By Zack Brown

If you like Kernel Traffic and want to send me a little money, click here:

Table Of Contents

Mailing List Stats For This Week

We looked at 1990 posts in 9347K.

There were 539 different contributors. 279 posted more than once. 218 posted last week too.

The top posters of the week were:

 

1. Kernel 2.6 Size Increase Troubling For Embedded Developers
23 Jul 2003 - 8 Aug 2003 (77 posts) Archive Link: "Kernel 2.6 size increase"
Topics: Disks: IDE, FS: ROMFS, FS: ramfs, FS: smbfs, FS: sysfs, Kexec, Real-Time: RTAI, Small Systems
People: Bernardo InnocentiIhar Philips FilipauMike FedykWilly TarreauNicolas PitreMiles BaderDavid WoodhouseTom RiniRandy DunlapDavid S. MillerChristoph HellwigBill DavidsenFrancois RomieuRichard B. JohnsonKarim Yaghmour

Bernardo Innocenti was concerned about the increase in size between 2.4 and 2.6-test, especially for the impact this would have on embedded systems. He broke the kernel down into subsystems and analyzed the size increase of each. For networking, it appeared that the bulk of the increase was taken up by xfrm, an IPsec user configuration interface. He suggested that the module be made optional; and a patch to do this was accepted finally on August 9th. Bernardo also noted that among drivers, the block drivers had generally become smaller, while character drivers had grown. He also said that among filesystems, "almost all modules have got 30-40% bigger, therefore bloat is probably caused by inlines and macros getting more complex."

Richard B. Johnson thought at least some of the general increase might be due to the kernel explicitly initializing variables to 0, causing the compiled binary to take up space storing these empty values; but David S. Miller pointed out that a lot of explicit 0-initializations were removed in the 2.5 timeframe, which would reduce the compiled kernel size. And Randy Dunlap added that there were yet more patches (http://developer.osdl.org/ogasawara/kj-patches/uninit_static/) available to catch cases still lingering in the kernel code.

Addressing the filesystem growth, Christoph Hellwig wanted to test Bernardo's assertion that the bloat was probably caused by inlines and macros. He posted a horrifying hack to try to expose the problem at the expense of various hardware platforms (David S. Miller started to speak up against this, until he realized that Christoph didn't want to submit that code for inclusion anywhere, he just wanted Bernardo to test with it). The discussion meandered along for a bit with no conclusion, and eventually Ihar Philips Filipau remarked, "Linus repeating 'small is beatiful' sounds more and more like crude joke... As for embedded market - it is already in deep fork and far far away from vanilla kernels... Vanilla really not that relevant to real world..." Mike Fedyk replied:

Vanilla will be what people put into it. And I have seen more messages from embedded people complaining, than actually doing and submitting patches for merging.

So the embedded trees are a deep fork huh? Did you or anyone else do anything to merge during 2.5?!

And now you see why there is a "deep" fork...

Ihar replied:

  • Real-time stuff is a must - something like RTAI.
  • Things like Linux Trace Toolkit - soone or later you have to start using them to tune performace.
  • Patches to remove mandatory (for 2.2/2.0) PCI/IDE support were pretty common too.
  • Patch to shrink network hashes - norm of life.
  • Patch to kill PCI names database.
  • And this is only things I was using personally (and I remember about) in my short 4 years carrier.

CONFIG_TINY - http://lwn.net/Articles/14186/ - got something like this merged? - so I'm the first guy in the download queue on ftp.kernel.org (ftp://ftp.kernel.org) !

Kernel heavily tuned for servers and workstations (read - modern PCs).

At my previous position company was using kernel prepared by Karim Yaghmour and right now we using kernels from MontaVista.

Far from vanillas.

Francois Romieu pointed out the CONFIG_EMBEDDED option that did exist in the official kernel, and asked Ihar to publish the patches he'd mentioned. Ihar admitted he hadn't know about CONFIG_EMBEDDED; but he also couldn't get it working. He didn't provide any details about his problem, however, so no one could help him. But Francois Romieu at least confirmed that CONFIG_EMBEDDED seemed to work fine on his own system.

Elsewhere, Bernardo did try Christoph's ugly hack, and found that not much improved. 2.6-test was still pretty big. Willy Tarreau said:

I did the same observation a few weeks ago on 2.5.74/gcc-2.95.3. I tried to track down the responsible, to the point that I completely disabled every driver, networking option and file-system, just to see, and got about a 550 kB vmlinux compiled with -Os. 550 kB for nothing :-(

I don't have the config nor the exact numbers right here now, but I can redo the tests on 2.6.0-test1 if needed.

I was interested in using a very minimal pre-boot kernel with kexec which would automatically select a valid image among several ones. But 500 kB overhead for a boot loader quickly refrained me...

Bernardo pointed out that a lot of 2.6 features, like SysFS and the various I/O schedulers, could not be configured out of the kernel. He added, "There are probably many other things mostly useless for embedded systems that I'm not aware of." Bill Davidsen replied that currently, the push was to get a stable 2.6.0 released in the near future, and that a better time to consider the embedded issue would be after a number of releases of the stable series had already come out. But he agreed that definitely some parts of the kernel could be made optional at that point. Nicolas Pitre also suggested that "Being able to remove the block layer entirely, just as for the networking layer, should be considered too, since none of ramfs, tmpfs, nfs, smbfs, jffs and jffs2 just to name those ones actually need the block layer to operate. This is really a big pile of dead code in many embedded setups." Bernardo and Miles Bader loved this idea; and Miles remarked, "When I've used a debugger to trace through the kernel reading a block on a system using only romfs, it's utterly amazing how much completely unnecessary stuff happens. Of course it's a lot harder to find a clean way to make it optional than it is to complain about it ... :-)" Bernardo asked how difficult it would be to remove the block layer, and David Woodhouse said, "Depends how thorough you want to be. I wanted to go the whole way and actually remove the definition of struct buffer_head, then remove all the code which would no longer compile at all."

Elsewhere, Bernardo actually tried stripping SysFS out of the kernel, just to see what would happen. He said, "I just saved 7KB and got a kernel that couldn't boot because root device translation depends on sysfs ;-)" Tom Rini replied, "Now that someone has gone down the path (and, thanks for doing it), we know how much is saved and what needs to be done to get it to work. Lets just hope it doesn't grow that much more." Barnardo posted his patch for anyone interested, cautioning everyone that this was only intended for testing purposes, because like he said, the resulting kernel wouldn't boot.

 

2. Framebuffer Client Notification Mechanism
29 Jul 2003 - 7 Aug 2003 (24 posts) Subject: "[PATCH] Framebuffer: client notification mecanism & PM"
Topics: Framebuffer
People: Benjamin HerrenschmidtJames Simmons

Benjamin Herrenschmidt posted a patch that "adds a mecanism for in-kernel "clients" of a framebuffer device to get notified of events on this framebuffer device. It adds some basic Power Management callbacks based on this, and implements support in fbcon. This allows fbdev low level drivers to instruct clients like "fbcon" to stop touching the framebuffer as the hardware is going to be suspended, and to restore the display after resume." There were some comments, and a few posts down the line, James Simmons said, "I knew it was a matter of time before "client" management would happen. Is this a 2.6.X thing tho or shoudl we wait for the next developement cycle. I don't mind working on experimental stuff." And Benjamin replied, "We need that now for proper power management." This suited James, and the thread went on for several days in technical areas before petering out.

 

3. Filesystem Errors In 2.6.0-test2
4 Aug 2003 - 10 Aug 2003 (11 posts) Subject: "ext3 badness in 2.6.0-test2"
Topics: Disk Arrays: RAID, FS: ext2, FS: ext3
People: Andrew MortonNeil BrownFrank van de PolDaniel JacobowitzLinus Torvalds

Daniel Jacobowitz got an ext3 error of the form, "EXT3-fs error (device md0) in start_transaction: Journal has aborted". Unfortunately, after this, the disk was completely inaccessible for reading or writing, so any system logs that might have helped were unavailable. Andrew Morton said that without the log data, it was hard to tell for certain. But he said, "Could have been an IO error, or the block/MD/device layer returned incorrect data. ext3 used to go BUG a lot in the latter case, but nowadays we try to abort the journal and go read-only." Neil Brown also saw the same problem, and figured it was probably with ext3. Frank van de Pol also confirmed seeing the same problem; Neil managed to get some log entries and posted them, and Andrew said there was definitely an ext3 bug in there somewhere. But he also added, "I find it distinctly fishy how this happens so much with ext3-on-md, and so little with ext3-on-just-a-disk." Neil replied:

I can reproduce this easily with various configurations of ext3 over raid5, and get a similar problem with ext2 over raid5 (corrupt inodes rather than directory entries) but ext3 over raid0 is rock-solid.

So I guess the finger points generally in the direction of raid5. Now I've just got to figure if it is a bug in r5, or some assumption that it makes that is no longer valid (I was briefly suspicious of PF_READAHEAD which could have made a real mess of raid5, but that wouldn't have this symptom)

In their initial reports, Daniel and Frank had both also said they'd been using RAID. Andrew was happy the thing was at least reproducible; and folks started bandying patches around to try to isolate a fix. Andrew said the PF_READAHEAD was actually a likely candidate in spite of Neil's reservations; and posted a fix that had actually already made it into Linus Torvalds' tree. Neil tried it, and it did seem to fix the problem. There were a couple more posts after that, and the thread petered out.

 

4. Setting Per-User Resource Limits
5 Aug 2003 - 7 Aug 2003 (7 posts) Archive Link: "Is it possible to add this feature."
People: Mike FedykMartin PoolPatrick McLeanAlan CoxRik van Riel

Someone asked if it were possible to limit memory-usage and CPU-usage on a per-user bases, and Kurt Wall suggested 'ulimit -m' for memory limiting and 'ulimit -t' for CPU limiting. But Mike Fedyk countered, "This is per session, and the user can have many sessions. Unless you limit the number of sessions a user can have..." And Martin Pool added, "Mike is correct that you cannot have system-wide per-user limits at the moment, at least in the standard kernel. However, it would be possible to add it, if you find somebody to develop it for you." Patrick McLean announced:

I am going to be working on this feature with a friend starting in September as a term project (we are both undergrads in computer science), and a way to get into kernel hacking :) Send me a mail if you want more info, or if you want us to keep you up to date on our progress.

We will also be loking for ways to specify the limits in a fairly simple, but scalable way, and we will be happy for any suggestions.

And Alan Cox suggested, "Google for two things - firstly Rik van Riel's bits of work (I think it was Rik anyway) on a fair share scheduler, also "beancounter" which was a patch long ago that started to attack the limits issues)" .

 

5. Converting One Filesystem To Another
6 Aug 2003 - 8 Aug 2003 (15 posts) Archive Link: "reiserfs4"
Topics: FS: ReiserFS, FS: ext3
People: Andreas DilgerIvan GyurdievOleg DrokinTomas SzepeHans Reiser

Vladimir Lazarenko wanted to upgrade to ReiserFS 4, but asked if there were a converter that would allow him to migrate to the new version without data loss. He really wanted to avoid the whole 'backup, create new partition, and restore' scenario. Hans Reiser gave a pointer to ConvertFS (http://tzukanov.narod.ru/convertfs/) , a third-party work in progress. But Andreas Dilger said the whole concept of a converter was broken. He said:

If you are converting your current filesystem to an _experimental_ filesystem, wouldn't you want to have a backup in case the new filesystem had a bug in it?

Considering that such a conversion tool would be used only very rarely, wouldn't you want to make a backup in case the conversion tool was broken?

The safest conversion is to make a backup with tar or similar, and then restore it after a formatting the new filesystem.

Ivan Gyurdiev replied that sometimes "people don't have the resources (hard disk space, tape drives, money) to backup their data, and might still be interested in testing a new filesystem. They might be willing to take a risk with the new fs and converter. Amazing as it may sound, people do that. I am such a tester, and I'd find a converter to be a useful tool. But since the previous discussion on the subject concluded it'd be really hard to impossible to write one, I guess I'll have to settle for new hard drive(s)." Oleg Drokin gave anothing link to the ConvertFS page, saying that it's "only requirement seems to be that both fs types should have read/write support in Linux."

Ivan gave it a whirl, but found that it required having 50% of the disk free for use by the program. This, for him, defeated the purpose of using the tool at all, since he could just copy the data himself in that case. Tomas Szepe also said, "I'm afraid I cannot recommend using this tool. A test conversion from reiserfs to ext3 (inside a vmware machine) screwed up the data real horrorshow: directory structure seems ok but file contents are apparently shifted." Hans replied that VMWare might be interfering, and Tomas agreed, but added, "Nevertheless I'm inclined to believe this is rather a FIBMAP related kernel bug that has been introduced after the current version of the convertfs toolset was released in March 2002."

 

6. Linux 2.6.0-test3 Released
8 Aug 2003 - 13 Aug 2003 (21 posts) Archive Link: "Linux 2.6.0-test3"
Topics: USB
People: Linus Torvalds

Linus Torvalds announced 2.6.0-test3 () , saying:

The bulk of the diff by far is various architecture updates, and in particular bringing MIPS[64] a bit closer to being up-to-date for 2.6.x But there's arm, alpha, h8300 and ia64 updates too.

Merging the SELinux security architecture also ends up growing the patch, even though it may not be all that noticeable for most normal users.

For most x86 users, the CPU frequency updates, network driver updates, and some USB updates are most likely to matter.

And this should fix the PCMCIA lock-up that a number of people have seen happening since 2.5.71 or so. Thanks for people involved in testing and fixing that one.

Also, Andrew fixed a read-ahead bug that was introduced in test2 that could cause (non-readahead) IO failures under load.

 

7. Hyperthreading Configuration Problems In 2.6
9 Aug 2003 - 12 Aug 2003 (11 posts) Subject: "[2.6.0-test3] Hyperthreading gone"
Topics: Hyperthreading, Power Management: ACPI
People: Greg NorrisFlorian WeimerHugh DickinsLen BrownMarcelo Tosatti

Florian Weimer noticed that hyperthreading didn't work for him under 2.6.0-test3: only one CPU would ever be activated during a run. He said recent 2.4 kernels (starting with 2.4.20) did successfully support hyperthreading on his machine. Gabor Micsko confirmed the problem on his own system; Greg Norris asked, "Did you select CPU Enumeration Only, or "normal" ACPI? If the former, did you specify the "acpismp=force" parameter at bootup?" Florian said he'd selected enumeration only, but not "acpismp=force" at bootup. He explained, "Previous experience (with some 2.5.x versions) indicates that Linux does not support full ACPI on this machine. The documentation suggests that the command line option enables full ACPI, so I hesitate to do this." Greg added, "According to the 2.6.0-test3 menuconfig help text, the parameter is required when CPU Enumeration Only is selected, and enables only limited ACPI support. For whatever it's worth, that matches my experience." And Florian replied, "I don't think it's clear from the description. It's certainly unexpected that a compile-time option doesn't activate a feature, but merely adds a boot option to do so."

Close by, Hugh Dickins told Florian that the "acpismp=force" specification at bootup was necessary, even though ACPI was not supported on his hardware. He agreed it was confusing, "and the ACPI guys agree it's wrong and to be fixed." He asked Len Brown about the status of this, since as far as he knew Len had been about to submit a patch to deal with this very problem, just four weeks before. Len replied, "My changes go to Marcelo via Andy" [Grover]. "This one has been waiting in his staging area while he was out on vacation. Now that he is back -- unless something broke in his tree -- I assume he'll be sending it along to Marcelo shortly." And Marcelo Tosatti said, "Good to know. Andrew, I'll wait for you on those updates to release -rc3."

 

8. Status Of Netconsole In 2.6
11 Aug 2003 - 12 Aug 2003 (5 posts) Subject: "[PATCH][RFC] Netconsole debugging tool for 2.6"
Topics: SMP
People: Matt MackallJon BurgessJeff Garzik

Matt Mackall announced:

Because my development box makes the room it's in uncomfortably warm, I've decided to take a stab at resurrecting Ingo's netconsole patch.

For those who missed it the first time around (for 2.4.10), this module is a "serial console over networks" which lets you catch kernel messages, oopses and so on that can't be caught by syslog.

Since I thought the biggest problem with the first version was configuration, I went ahead and wrote some reasonable option parsing and made it also work as built-in so you can now boot with:

linux netconsole=2525@10.0.0.1/eth1,9353@10.0.0.2/12:34:56:78:9a:bc

or just

linux netconsole=@10.0.0.1/,@10.0.0.2/

I've also added support for a third NIC (TLAN). Accepting patches for other cards (only about 10 lines of code each).

Issues:

  • Probably better ways to handle device locking these days
  • SMP-safe?
  • Would like to get logging up much earlier in the boot process
  • Need support for more cards

Jon Burgess asked:

Is this different from the netdump patch which RedHat include in their kernel?

The RH kernel patch is at http://www.kernelnewbies.org/kernels/rh9/SOURCES/linux-2.4.18-netdump.patch

The tools are shipped in netdump-*.rpm with the distribution.

Matt hadn't known about the Red Hat patch, and after some analysis offered a comparison:

Theirs:

  • does crashdumps
  • does syslog without levels
  • has hooks for receive

Mine:

  • works in 2.6
  • has non-appalling configuration
  • works as a built-in and is available earlier in boot
  • does syslog with levels (haven't posted this though)

Jeff Garzik also added:

netconsole does syslog with levels, too. I agree netdump/netconsole have complete awful configuration. I was thinking netlink would be a good configurator.

The kernel printk <foo> prefixes map into syslog quite nicely.

In any case, there is my own active effort into cleaning up netdump to be less x86-specific, and get it ready for mainline.

Maybe we can start discussing converging all these implementations on netdev@oss.sgi.com? (that's where the networking developers live)

Matt didn't like that idea, and the thread ended.

 

 

 

 

 

 

We Hope You Enjoy Kernel Traffic
 

Kernel Traffic is grateful to be developed on a computer donated by Professor Greg Benson and Professor Allan Cruse in the Department of Computer Science at the University of San Francisco. This is the same department that invented FlashMob Computing. Kernel Traffic is hosted by the generous folks at kernel.org. All pages on this site are copyright their original authors, and distributed under the terms of the GNU General Public License, version 2.0.