Kernel Traffic #209 For 16 Mar 2003

By Zack Brown

Table Of Contents

Mailing List Stats For This Week

We looked at 1988 posts in 10143K.

There were 479 different contributors. 261 posted more than once. 201 posted last week too.

The top posters of the week were:

 

1. Linux 2.5 Compiled Binaries Larger Than 2.4
4 Mar 2003 - 7 Mar 2003 (16 posts) Archive Link: "Kernel bloat 2.4 vs. 2.5"
Topics: FS: NFS, Networking
People: Daniel EggerAndrew MortonChris Wedgwood

Daniel Egger reported:

I've seen surprisingly few messages about the dramatic size increase between a simple 2.4 and a 2.5 kernel image.

I just decided to check back with the 2.5 series again after my last try with 2.5.53 (which wouldn't even boot) but had to dramatically cut down the kernel featurewise to keep it below 1MB because I can't boot it over tftp otherwise.

909824 Feb 14 20:02 vmlinuz-192.168.11.3-2.4.20
954880 Mar 4 17:01 vmlinuz-192.168.11.3-2.5.63

What you see here is a 2.4 kernel with almost everything needed to run the machine built in and a (rsync'ed) 2.5.63 kernel with everything but the basic stuff + ipv4 + NIC + NFS (+ other necessary features not builtable as modules) built as modules.

Are there any patches I've missed to get that down? A slight tad bigger and I couldn't even work with recent kernels if modules actually worked... :/

Andrew Morton replied, "2.4 has magical size reduction tricks in it which were not brought into 2.5 because we expect that gcc will do it for us," and asked Daniel which compiler he used. Daniel said he used gcc 3.2.3 from Debian, and saw huge growth in the kernel image from 2.4 to 2.5; elsewhere, Chris Wedgwood said he was using GCC 2.95.4 from Debian, and also saw a large discrepancy between the two kernels. Andrew replied, "2.4 has hacks to make it smaller. iirc they were worth ~200 kbytes, or around 10%. gcc-3.x string sharing was supposed to make those hacks unnecesary. However a quick test here shows gcc-3.2.1 generating a 10% larger 2.5 image than gcc-2.95.3, so a club may need to be taken to 2.5 as well."

 

2. Linux 2.5.64 Released
4 Mar 2003 - 6 Mar 2003 (13 posts) Archive Link: "Linux 2.5.64"
Topics: FS: NFS, USB
People: Linus Torvalds

Linus Torvalds announced 2.5.64 (http://www.kernel.org/pub/linux/kernel/v2.5/ChangeLog-2.5.64) , saying:

Hmm.. Stuff all over, and a lot of spelling fixes. But there's a fair number of "real" things here too, merges with Andrew, Dave etc.

As already reported on linux-kernel, this should fix the htree problems (apart from some potential NFS export issues, apparently), and it also has the nicer hash chain code from Andi (no actual changes in the hash algorithms themselves, just the list changes).

Sparc, USB, networking updates.

Ans as I've been a bit snowed under by a lot of email, if you sent me stuff and it isn't here, double-check and re-send (but even better, try to see if you can find one of my "merge points" and check with them too)

 

3. Moving Swap Configuration Into The General Setup Menu
4 Mar 2003 - 7 Mar 2003 (11 posts) Archive Link: "[PATCH] move SWAP option in menu"
People: Randy DunlapGabriel PaubertTomas SzepeTom Rini

Randy Dunlap suggested, "Please apply this patch (option B of 2 choices) from Tomas Szepe to move the SWAP option into the General Setup menu." Gabriel Paubert noticed that Randy's patch only had an effect on x86 machines. He asked, "Why restrict it to Intel only? I don't know if it works properly on other architectures, but at least it would give people the opportunity to test it on embedded PPC/Arm/MIPS/CRIS/whatever." Randy said he'd accept a patch, and Tom Rini sent one along.

 

4. Kernel Boot Speed
5 Mar 2003 - 6 Mar 2003 (5 posts) Archive Link: "Kernel Boot Speedup"
Topics: BSD, Disks: IDE, FS: ext2, Kexec
People: Andy PfifferAdam SulmickiHelge HaftingJohn Bradford

Someone asked how to get the kernel to boot as fast as possible, and Andy Pfiffer replied:

To get to that kind of boot-up speed, the best way is to never shutdown.

On a StrongArm platform I worked on, we managed to put the CPU to sleep and the DRAM controller into self-refresh mode and a few other housekeeping chores (like checksumming our saved CPU state to be able to verify it on resumption), and could spring back to life with the press of a power button in about the same amount of time it took for the cold-cathode back-light to warm up enough to see the built-in screen.

On a modern laptop, it may be possible, in theory, to accomplish the same kind of thing. The key is to be able to not lose the contents of memory. I'm not well versed on current state-of-the-art power-management on commodity x86 platforms, so your mileage may vary.

If you want cold-start boot on a PC, you'll probably need to completely skip the BIOS (have a look at LinuxBIOS and/or kexec), skip the probing of devices on reboot, and drastically shorten (or run later) any user-mode scripts that are invoked.

On the machines that I have measured (p3-800 and p4-1.7Xeon, a well-configured kernel, after subtracting out BIOS time and stupid scsi reprobing, is up and open for business in about 10 seconds after the LILO handoff. The *system* however, isn't often available for another 30 or 40 seconds, perhaps longer.

Adam Sulmicki added, "Also, when you are using LinuxBIOS then time for the hard disk to spin up actually becomes significant. And it is of order of several seconds (and up to 30 seconds according to specs for ATA). To counter this problem you may want to put kernel and root stuff on Compact Flash and then use CF<->IDE adapter to use CF as primary boot device. (As side benefit it allows you to easily get around 256KiB limitation of most eerpom (bios) sockets on your typical motherboard)"

Helge Hafting also suggested to the original poster:

As a first step, compile the kernel yourself. Include only drivers for stuff you actually have and use, drop everything else. That should give you a kernel that boots in a few seconds, unless you have some really slow piece of hardware.

Of course the kernel boot time is only part of what we perceive as "boot time", i.e. time from power-on till you can use the machine.

A normal pc boot goes like this:

1. The bios does its stuff. No amount of kernel tweaking can help you with this, because this happens before the kernel is loaded. You can tweak bios options or get a better bios or motherboard though. Many bioses are really slow - I'm lucky and have one that gets to the kernel loading stage so fast that the flat panel screen don't have time to keep up. (The bios starts briefly with some graphichs mode, then turns to 80x25 text for compatibility reasons. I don't get to see that transition unless I pause it at the lilo stage)

2. The bios loads a linux loader, typically lilo. Lilo then loads the kernel of choice. Lilo may be configured with a keypress timeout of several seconds - I have shortened that to 0.2 seconds, you may remove it entirely. You may also want to configure lilo with the compact option, it loads a little faster that way.

3. The kernel boots. This is what you may shorten by being clever. Leave out everything you don't need, compile into the kernel anything needed during boot. (I.e. don't use modules unnecessarily, they cause extra disk accesses) You know the kernel boot has started when it print something like "Linux hh 2.5.63-mm2" or similiar. The kernel boot has ended when it prints

VFS: Mounted root (ext2 filesystem) readonly. Freeing unused kernel memory: 320k freed

or something like that. This is usually pretty quick. The machine isn't ready for use yet though.

4. Various init scripts run - depending a lot on your distribution. This typically involves lots of disk access and may be slow. You can trim down the init scripts a lot if you know what you're doing - they're general-purpose but you may have something more specific in mind.

The easy tip is to uninstall anything that have a boot script but you don't use. Such as unnecessary servers. This also makes the machine safer on a network - less stuff to break into.

You may be able to speed the boot scripts up by creating some _huge_ initrd containing as much of the boot scripts and related executables as possible. This works because an initrd is loaded by sequential disk accesses while the boot process use time-consuming seeking.

If all you care about is to login fast, move the script that enables login earlier in the boot process. Similiarly, if you use X, move xdm (or whatever starts X) earlier. There's no reason to wait for webservers and similar to start before running X, but thats what distributors usually do,

And if you want to get into X fast - use a lightweight window manager! Something like icewm, twm or similiar. You will particularly want to throw away KDE and gnome. (This is is not as drastic as it sounds, because you can run your kde/gnome apps under plain icewm just fine.) KDE alone adds 40 seconds or so to my startup time, more than everything else taken together. So of course I don't use it.

Unless you have a special machine, most of your startup waiting will be waiting for the bios, or for disk seeks. Having 256-512M of RAM helps, because the cache _won't_ run out during boot. 10000RPM disks helps too. And you definitely want more than one drive. Having /usr and /var on separate spindles speeds up the boot because the programs loads from /usr and tends to use data on /var. Having the root with /etc on some third spindle is even better, because /etc is where the programs reads their configuration. This division will avoid a lot of seeking around as all your software starts up.

As far as window managers, John Bradford recommended FVWM2. And for Helge's item 4 (init scripts), John added, "you can save a *lot* of boot time like this - my main box runs just three init scripts, (I use a BSD-style init script layout, and the main script calls two others). The main script is 2095 bytes, and the others are 5144 and 2713 bytes. Most of that is taken up with comments. I've booted a 486, 4Mb laptop in to 2.2.13 in around 30 seconds, (from power on), by cutting the init scripts down to almost nothing."

 

5. Adding New System Calls
5 Mar 2003 - 7 Mar 2003 (6 posts) Archive Link: "[PATCH] Making it easy to add system calls"
People: George AnzingerLinus TorvaldsJamie Lokier

George Anzinger posted a patch to create a new sys_calls.h file, to define the names and numbers of all system calls. He explained, "Of course we will be adding no more system calls, but it does make things a _lot_ easier." Linus Torvalds replied:

Me no likee.

The fact is, we add system calls maybe a few times a year. Having to update two places instead of just one is not very onerous.

More importantly, the system call numbers should be something you have to think about, and a setup that makes it easy to merge new system calls with different numbers "by mistake" is a _bad_ setup.

Don't get me wrong - I'm not saying that "system calls should be hard to add, because only real men should add system calls". But I _am_ saying that it should be damn hard to add a system call by mistake at the wrong number, which is something your patch makes a lot easier than the current situation.

And I do believe that adding system calls should inherently be something that you have to think twice about, even if one of the thoughs is just literally writing out the number that you decided on. So while I don't think it should be "hard", it should definitely not be made any easier than it already is.

George was fine with that. Jamie Lokier put in, "It would be nice to have a single list of non-architecture-specific system calls though. Think of the number of times a system call has been added to many architectures but left out, quite by accident, of one or two until someone notices."

 

6. Linux 2.5.64-mm1 Released
5 Mar 2003 - 11 Mar 2003 (8 posts) Archive Link: "2.5.64-mm1"
Topics: Kernel Release Announcement
People: Andrew Morton

Andrew Morton announced 2.5.64-mm1 and said::

http://www.kernel.org/pub/linux/kernel/people/akpm/patches/2.5/2.5.64/2.5.64-mm1/

  • Included Ingo's file-offset-in-pte patch which allows pages which are in nonlinear mappings to be reestablished by the kernel's pagefault handler. This is enabled against all mappings for testing purposes.
  • No functional changes to the anticipatory scheduler this time. Just stabilisation work. It doesn't seem to oops any more.
  • A bunch of buxfixes plus the usual sweepings off the factory floor.

 

7. Linux Test Project 20030206 Released
6 Mar 2003 (1 post) Archive Link: "[ANNOUNCE] LTP-20030206"
Topics: Bug Tracking, Disk Arrays: LVM, POSIX, Version Control
People: Robert Williamson

Robert Williamson announced:

The Linux Test Project test suite ltp-full-20030206.tgz has been released. Visit our website (http://ltp.sourceforge.net) to download the latest version of the testsuite that contains 1000+ tests for the Linux OS. Our site also contains other information such as: test results, a Linux test tools matrix, technical papers and HowTos on Linux testing, and a code coverage analysis tool. A new area for keeping up with fixes for known blocking problems in 2.5 kernel releases has been added as well, and can be found at http://ltp.sourceforge.net/errata .

The list of test cases that are expected to fail is located at: http://ltp.sourceforge.net/expected-errors.php

The highlights of this release are:

  • All tests from the Open POSIX* Testsuite have been ported and merged in.
  • New test scripts have been added for system stress and LVM testing.
  • Entire suite has been updated to support non-root 'make install'
  • New API added to allow test creators to easily query the kernel version of the test machine.
  • Changes were implemented to support GCC 3.3 standards.
  • All patches, fixes, and updates accepted into CVS have been included.

We encourage the community to post results, patches, or new tests on our mailing list, and to use the CVS bug tracking facility to report problems that you might encounter. More details available at our web-site.

 

8. Getting Rid Of ipconfig.c
6 Mar 2003 - 9 Mar 2003 (32 posts) Archive Link: "Make ipconfig.c work as a loadable module."
Topics: FS: NFS, Version Control
People: Robin HoltAlan CoxJeff Garzik

Robin Holt said:

The patch at the end of this email makes ipconfig.c work as a loadable module under the 2.5. The diff was taken against the bitkeeper tree changeset 1.1075.

Currently ipconfig.o must get statically linked into the kernel. I have a proprietary driver which the supplier will not provide a GPL version or info. In order to mount root over NFS, I need to get the vendors driver loaded via a ramdisk.

But Alan Cox rejoined:

The right fix is to delete ipconfig.c, it has been the right fix for a long long time. There are initrd based bootp/dhcp setups that can also then mount a root NFS partition and they do *not* need any kernel helper.

Indeed probably the biggest distro using nfs root (LTSP) doesn't use ipconfig even on 2.4.

DaveM can you just remove the thing. See http://www.ltsp.org for initrds that don't need it in

Jeff Garzik added, "Many have wanted to delete ipconfig.c for a while now..."

 

9. IPsec-Tools 0.2 Released
6 Mar 2003 (1 post) Archive Link: "ANNOUNCE: updated ipsec-tools (0.2) package available"
People: Derek Atkins

Derek Atkins announced:

I've released an updated IPsec-Tools package (0.2) which fixes the build problems people have had with glibc-2.3 and openssl-0.9.7 systems. For full information read the NEWS file in the release.

The package is available from http://ipsec-tools.sourceforge.net/

 

10. klibc Licensing Discussion
6 Mar 2003 - 12 Mar 2003 (63 posts) Archive Link: "[BK PATCH] klibc for 2.5.64 - try 2"
Topics: BSD, Klibc, Power Management: ACPI, Version Control
People: H. Peter AnvinRoman ZippelLinus TorvaldsMatthew WilcoxAlan CoxGreg KH

Greg KH released a new patch for klibc, which included a license file. Roman Zippel noticed that the license chosen was not the GPL, and asked for an explanation. H. Peter Anvin replied, "it's the MIT license, which differs from the (new) BSD license only in the no-endorsement clause, which seemed superfluous. It was chosen because klibc is a non-dynamic library, and it would otherwise be extremely awkward to link proprietary code against it if someone would like to do so. Furthermore, I'm the author of most of the code in there, and if someone really wants to rip it off it's not a huge deal to me." A couple posts down the line, Roman said, "If something goes into the kernel, the kernel license would be the obvious choice. Granting additional rights or using a dual license is a relatively small problem." He asked for a better explanation, and Linus Torvalds said:

The reasoning is very simple:

  • klibc is small. It would be pointless to make it a shared library, because the infrastructure to do so and the indirection required would likely be bigger than klibc itself (unless klibc is eventually bloated up)
  • klibc is potentially useful outside just standard kernel initrd images, and in fact for development it is nice to use it that way.

Put the two together, and the GPL really doesn't look like a very good license for klibc. Yeah, you can disagree about what the actual exceptions are, but clearly there has to be _some_ exception to the license.

Also, since the kernel GPL thing doesn't taint user space apps (very much documented since day 1), there really isn't any _reason_ to use the GPL in the first place. klibc wouldn't ever get linked into the kernel, only into apps.

As such, and since Peter is the main author, I don't see your argument, Roman.

Matthew Wilcox remarked, "klibc is losing at least some potential developers by virtue of its licence. I'm not willing to release code under the BSD licence and would prefer full GPL. I'm willing to compromise on LGPL, but Peter isn't. He came out with some nonsense about wanting proprietary apps in early userspace (which seems like a ludicrous thing to _favour_, but...) which LGPL doesn't prevent you from doing, even with a non-shared library." And Alan Cox said, "Well you can use the BSD klibc code in your LGPL library, Peter just can't get the changes back ;)"

Close by, Linus went on:

I don't personally think that BSD/MIT is any better than GPL with restriction, and the real issue boils down to "what license do people want to work on". Since Peter so far has been one of the major developers, his opinion (regardless of _why_ he holds that opinion) matters more than most to me.

Of course, some people have said that _they_ would want to work on it only if it's GPL, but hey, the proof is in the pudding, and "A bird in the hand is worth two in the bush". In other words, talk is cheap, and code rules. Right now that means that hpa rules, methinks.

However, I also have to say that klibc is pretty late in the game, and as long as it doesn't add any direct value to the kernel build the whole thing ends up being pretty moot right now. It might be different if we actually had code that needed it (ie ACPI in user space or whatever).

Elsewhere, he also said:

Guys, which part of "he who writes the code gets to choose the license" do you not _get_?

I find few things more morally offensive than whiners who whine about other peoples choice of license.

I found it totally inappropriate when some of the crazier BSD guys were whining about the use of the GPL in Linux for _years_. They seem to finally have shut up lately, or maybe I've just gotten sufficiently good at ignoring them.

But I find it _equally_ offensive when somebody whines the other way. I can understand it from rms, if only because I _expect_ it. But why the hell people who didn't actually DO anything whine about Peter's choice of license FOR THE CODE HE WROTE, I don't see.

This is the "shut up and put up" philosophy of software licensing. Either you do the work, or you sit quietly and watch others do it. If you do the work, you get to impact the license. If you don't, you had better SHUT THE F*CK UP!

Btw, the same goes for every single BK whiner out there.

Later, he added:

Hey, I'm a GPL user myself, obviously. I don't much like the BSD license, and no project _I_ start is likely to ever be under that license. In fact, I seriously doubt that I'd ever really even want to get seriously involved with a project that could just be hijacked without source at any time.

However, that doesn't make pressuring hpa about it ok.

Also, you guys should think about what this whole project was about: it's about the smallest possible libc. This is NOT a project that should live and prosper and grow successful. That's totally against the whole point of it, it's not _supposed_ to ever be a glibc-like thing. It's supposed to be so damn basic that it's not even _interesting_. It's one of those projects that is better off ignored, in fact. It's like a glorified header file.

(At this point hpa asks me to shut up, since I've now depressed him more than any of the GPL bigots ever did ;)

I can _totally_ see hpa's point that he would be perfectly happy with people "stealing" parts of it - the code in question is not something that anybody should _ever_ have to re-create, even if he's the most evil person on earth and hates the GPL and wants to kill us all. Because it's not _worth_ recreating.

 

11. Status Of perfctr
6 Mar 2003 - 9 Mar 2003 (5 posts) Archive Link: "perfctr and Linus' tree?"
Topics: Profiling, Scheduler
People: Hiro YoshiokaMikael PetterssonAlbert Cahalan

Hiro Yoshioka asked:

I have a question. Is there any progress on merging the perfctr patch to Linus' kernel tree?

http://www.uwsg.iu.edu/hypermail/linux/kernel/0303.0/0647.html

I found the DCL patch set includes the perfctr patch. http://lists.osdl.org/pipermail/dcl_developer/2003-March/000009.html

Mikael Pettersson replied:

No progress since Linus totally ignored it, but at least two perfctr-patched trees exist. OSDL does one for the development kernel, and Jack Perdue has pre-patched RedHat kernel .rpms. (For Jack's stuff, check out PAPI -> Links -> Related Software.)

I'm planning to simplify the kernel <--> user-space interface in perfctr-2.6 (drop /proc/pid/perfctr and go back to /dev/perfctr), and then I _think_ I can do a version that doesn't require patching kernel source. (It will do binary code patching at module load-time instead. Horrible as that sounds, it's easier to deal with for users.)

Albert Cahalan felt that these changes would not make perfctr more likely to be accepted into the kernel, but Mikael replied:

I don't like patching kernel object code at all. But the #1 usability problem I'm facing is that to use the stuff, people _must_ patch and rebuild their kernels, due to the callbacks from switch_to and a few other places, and the task_struct layout change. That scares away some people, and some try it but get it wrong with confusing (and hard to debug) results.

(Besides, patching and recompiling the kernel doesn't always work. There are examples of binary-only HW vendor modules that are specific to certain versions of certain vendors' binary kernels.)

Naturally, the normal procedure of rebuilding the kernel from patched sources would remain as the default. The object code patching approach (which is what I'd use for a plug-and-play binary .rpm for example) would basically use System.map and a glue module to do the patching (installing callbacks), and then the stock driver module would be inserted as usual.

With Linus ignoring the patch, the driver needing callbacks from process scheduling/fork/exit/execve points, and users having problems with kernel recompiles, what do you expect me to do?

 

12. Minutes From March 7 LSE Conference Call
7 Mar 2003 (3 posts) Archive Link: "Minutes from LSE Call March 7"
Topics: Assembly, Clustering, Ottawa Linux Symposium, Virtual Memory
People: Hanna LinderWilliam Lee Irwin IIIPaul LarsonJohn HawkesMuliMuli Ben-YehudaAndi Kleen

Hanna Linder reported:

LSE Con Call Minutes March 7th

Minutes compiled by Hanna Linder. All mistakes are my own. Please send corrections/comments to the list. And if you start a huge thread with hundreds of responses, please change the subject ;)

-------

I. Martin Bligh - hlist has patch from andi kleen Andi Kleen change the hlist hash to be a singly linked list from a doubly linked list. The code is smaller but the performance is the same. It has been accepted by Linus and is in 2.5.64.

II. Hanna Linder - hiding projects on lse Going to hide the scheduler and the apic routing projects in the lse.sf.net site. Also planning on moving the 2.5 lse work to an osdl site since the existing sourceforge site is filled up with a lot of 2.4 stuff.

III. Hanna Linder - lockmeter port beta ready Hanna got the lockmeter port to 2.5.64 booted and working. But she only ported the i386 architecture and needs to finish the basic port for the other archs. She is going to do more testing today and will send out the code later. John Hawkes (the author) said he is about half way done with kernprof and will look at the lockmeter port when Hanna is done.

IV. Paul Larson - gcov patch http://sourceforge.net/project/showfiles.php?group_id=3382&release_id=108054

resync of gcov patch that hubertus franke did. neet little profiling program that provides better granularity than some ofther profilers. it give you per file and per line code coverage info. this is good for the ltp project so we can see how much of the code our tests hit. martin- are there big performance impacts with this code? paul, not sure really.

Problem with config mod versions so make sure to turn it off. also profiling of loaded modules isnt working correctly so if you want to profile it compile it into the kernel. other than that it is working pretty well.

lcov - another tool on the ltp site that goes out and looks at all the gcov data and pulls it into a web page to let you browse the source tree. both user and kernel level info. This is where it will be when it is packaged up: http://sourceforge.net/project/showfiles.php?group_id=3382&release_id=108054

Bill asked what the scheme is for accounting? sampling or incrementing/decrementing counters? Are the counters per cpu? no. be default that is not handled well. You wont run into an issue where it says you didnt run something because it was on another procesor but there are some locking issues. Nigel Hines is workig on this problem. the preferred solution is a hack using an awk script to look at the assembly and add locks where needed.

Martin - why dont you just use per cpu locks? paul- need to look into it. But the problem is with the compiler not the kernel so it might not help.

Martin is going to include it in his -mjb tree since it is a config option you can turn on and off.

V. Bill Irwin - Page Clustering

ftp://ftp.kernel.org/pub/linux/kernel/people/wli/vm/pgcl/

If we have a 16 byte struct page for every 512 bytes we are wasting a lot of memory. so try to keep track of every 1 kb or larger region. To make sure we have our accounting straight. The end result is when we are walking page tables you walk them in hardware page size and everything else ignores it and keeps track of sw page size. end up with an interesting relationship where every struct page is pointed to by a factor of ptes.

Bill has made a lot of recent progress. things like swap have been restored to functionality. It boots on most the machines he has tried it on. Also working on performance aspects of it right now. The real danger of all this is that you get internal (not external) fragmentation. Basically allocate a whole sw page of pte page size and dont get to use all of it because you are missing some logging somewhere to take advantage of it. Bill has been keeping some strategies in the back of his mind to interoperate with some things that are being kicked around.

Bill is using some simplified heuristics to search for pages to fault in. Turns out those heuristics suck so he needs to go in and do a different set. The ones originally done in Hugh's patch did something in the order of scanning acros an entire vma looking for pte's pointing to a particular page. It didnt have any alignment restrictions. Bill does have alignment restrictions and Hugh's solution would break down pretty quickly (kernel compiles swapping).

The one that hurts is the one that crosses page table pages. shared pages doesnt really like that. the way it works with rmap is it is on a per page basis. Still crossing the page table page is bad.

That is the main gist of it. Currently Bill is workig on tweaking the heuristics.

hanna- how much memory do you need to get the benefit of this?

bill - two benefits- larger page table size. and the arrays of all the struct pages in the system is smaller.

hanna asked if akpm put it in his tree and bill said it is not the kind of thing akpm is going to hang on to right now. Bill wants to get it allworking first. he is going to break it off in chunks and send it in piecewise.

The ols talk is going to be mainly about this work. there are some pretty hairy bugs in there to wrestle with. Should make for an interesting talk.

The people testing it right now are mainly Muli Ben-Yehuda, Zwane, Badari, Paul Larson. Bill is interested in having more people look at it or run it.

Bill thinks he is at critical mass now for main changes.

Muli - is going to work with Bill on a bug he found.

Bill - Testing it as far as it being effective. He has shown it reduces the core map on a 48 gig machine. by a factor of 16. well I picked the factor, you do it at compile time. The dmesg are in his ftp dir on kernel.org so you can look at the difference between zone normal and high mem.

William Lee Irwin III corrected, "The bit about Hugh's heuristics is backward; the heuristics he used for 2.4.x were very effective. It's my homegrown heuristics that are breaking down very quickly wrt. performance and fragmentation." And elsewhere, about lcov, Paul Larson said, "lcov-1.0 (since it's only been available from cvs before) is packaged up and available now for anyone interested."

 

13. Status Of Device Number Allocation
7 Mar 2003 - 10 Mar 2003 (62 posts) Archive Link: "[PATCH] register_blkdev"
Topics: Access Control Lists, Disks: SCSI, Hot-Plugging, Modems
People: Alan CoxLinus TorvaldsChristoph Hellwig

In the course of discussion, Christoph Hellwig referred to the fact that Linus Torvalds had said there would be no new allocations of major device numbers; and that the kernel would be migrating to a system of dynamic allocation of device numbers. Alan Cox replied, "No vendor I have spoken too intends to care what Linus thinks about it. Linus tried this in 2.4. We all got together to create a numbering repository instead of letting Linus do it." Linus replied:

I was right, though. Look at how useless the fixed numbers are getting.

I certainly agree that we'll need to open up the number space, but I really do think that the way to _manage_ it is something like what Greg pointed to - dynamic tols with "rules" on allocation, instead of the stupid static manual assignment thing.

We're pretty close to it already. I thought some Linux vendors are already starting to pick up on the hotplugging tools, simply because there are no real alternatives.

And once you do it that way, the static numbers are meaningless. And good riddance.

Alan replied, "Static naming/permissions management is current simply the best of available evils for many things. With stuff like modem arrays on serial ports its also neccessary to know what goes where. I'm all for moving to setups when possible where things like SCSI volumes carry a volume name and permission/acl data in the label."

 

14. Lustre Lite 1.0 beta 1 Released
12 Mar 2003 (1 post) Archive Link: "[ANNOUNCE] Lustre Lite 1.0 beta 1"
Topics: Bug Tracking, Networking, POSIX
People: Peter Braam

Peter Braam announced:

Summary
-------

We're pleased to announce that the first Lustre Lite beta (0.6) has been tagged and released. Seven months have passed since our last major release, and Lustre Lite is quickly approaching the goal of being stable, consistent, and fast on clusters up to 1,000 nodes.

Over the last few months we've spent thousands of hours improving and testing the file system, and now it's ready for a wider audience of early adopters. In particular, Lustre users on ia32 and ia64 Linux systems running 2.4.19 and Red Hat 2.4.18-based kernels. Lustre may work on other Linux platforms, but has not been extensively tested, and may require some additional porting effort.

We expect that you will find many bugs that we are unable to provoke in our testing, and we hope that you will take the time to report them to our bug system (see Reporting Bugs below).

Features
--------

Lustre Lite 0.6:

  • has been tested extensively on ia32 and ia64 Linux platforms
  • supports TCP/IP and Quadrics Elan3 interconnects
  • supports multiple Object Storage Targets (OSTs) for file data storage
  • supports multiple Metadata Servers (MDSs) in an active/passive failover configuration (requires shared storage between MDS nodes). Automatic failover requires an external failover package such as Red Hat's clumanager.
  • provides a nearly POSIX-compliant filesystem interface (some areas remain non-compliant; for example, we do not synchronize atimes)
  • aims to recover from any single failure without loss of data or application errors
  • scales well; we have tested with as many as 1,100 clients and 128 OSTs
  • is Free Software, released under the terms and conditions of the GNU General Public License

Risks
-----

As with any beta software, but especially kernel modules, Lustre Lite carries the real risk of data loss or system crashes. It is very likely that users will test situations which we have not, and provoke bugs which crash the system. We must insist that you

BACKUP YOUR DATA

prior to installing Lustre, and that you understand that

we make NO GUARANTEES about Lustre.

Please read the COPYING file included with the distribution for more information about the licensing of Lustre.

Known Bugs
----------

Although Lustre is for the most part stable, there are some known bugs with this current version that you should be particularly aware of:

  • Some high-load situations involving multiple clients have been known to provoke a client crash in the lock manager (bug 984)
  • Failover support is incomplete; some access patterns will not recover correctly
  • Recovery does not gracefully handle multiple services present on the same node
  • Failures can lead to unrecoverable states, which require the system to be umounted and remounted (and, in some case, nodes may require a reboot)
  • Unmounting a client while an MDS is failed may hang the "umount" command, which will need to be "kill"ed manually (bug 978)
  • Metadata recovery will time out and abort if there are clients which hold uncommitted requests, but which do not detect the death and failover of the MDS. Running a metadata operation on quiescent clients will cause them to join recovery. (bug 957)

Getting Started
---------------

<https://projects.clusterfs.com/lustre/LustreHowto> contains instructions for downloading, building, configuring, and running Lustre. If you encounter problems, you can seek help from others in the lustre-discuss mailing list (see below).

Reporting Bugs
--------------

We are eager to hear about new bugs, especially if you can tell us how to reproduce them. Please visit <http://bugzilla.lustre.org/> to report problems.

The closer that you can come to the ideal described in <https://projects.clusterfs.com/lustre/BugFilingi (https://projects.clusterfs.com/lustre/BugFiling) >, the better.

Mailing Lists
-------------

See <http://www.lustre.org/lists.html> for links to the various Lustre mailing lists.

Acknowledgement
---------------

The US government has funded much of the Lustre effort through the National Laboratories. In addition to money they have provided invaluable experience and fantastic help with testing both in terms of equipment and people. We thank them all, but in particular Mark Seager, Bill Boas and Terry Heidelberg's team at LLNL which went far beyond the call of duty, Lee Ward (Sandia) and Gary Grider (LANL), Scott Studham (PNNL). We have also had the benefit of partnerships with UCSC, HP, Intel, BlueArc and DDN and we are grateful to them.

 

 

 

 

 

 

We Hope You Enjoy Kernel Traffic
 

Kernel Traffic is grateful to be developed on a computer donated by Professor Greg Benson and Professor Allan Cruse in the Department of Computer Science at the University of San Francisco. This is the same department that invented FlashMob Computing. Kernel Traffic is hosted by the generous folks at kernel.org. All pages on this site are copyright their original authors, and distributed under the terms of the GNU General Public License, version 2.0.