Kernel Traffic #153 For 11 Feb 2002

By Zack Brown

Table Of Contents

Mailing List Stats For This Week

We looked at 3201 posts in 14822K.

There were 753 different contributors. 375 posted more than once. 273 posted last week too.

The top posters of the week were:

1. Maximum Number Of Anonymous Filesystem Mounts In 2.4

17 Jan 2002 - 25 Jan 2002 (27 posts) Archive Link: "2.4.17:Increase number of anonymous filesystems beyond 256?"

Topics: FS: NFS, FS: autofs

People: Andi KleenPete ZaitcevRainer KrienkeH. Peter Anvin

Rainer Krienke needed to increase the maximum number of anonymous filesystems in 2.4.17, so he changed the default value of the unnamed_dev_in_use array from 256 to 1024; however, he still hit a ceiling of 800, after which he got errors when trying to mount more NFS filesystems. Andi Kleen felt this was an NFS problem, saying, "Some NFS servers only accept connections from 'secure ports' < 1024. You need one local port per connections. Some Ports <1024 are already used by other services. This a natural limit with secure ports. If you can configure your NFS server to not insist on secure ports (it's usually an export option, unfortunately defaulting to on with many OS) you can just use any ports." He posted a patch to accomplish this, and added, "Another way if you wanted to avoid source patching would be to make sure that different connections go via different source IP addresses; e.g. by defining multiple IP aliases on the server and the client and defining different routes for the server aliases with different local ip addresses as prefered from address (=from option in iproute2). This way the ports would be unique again because they're for multiple local IP addresses; you would have 800 ports per local IP. Using the patch is probably cleaner though."

Pete Zaitcev also replied to Rainer's initial post with a patch, saying that he'd also changed unnamed_dev_in_use, but that he'd also had to modify the 'mount' command to accept a "-o nores" argument. He explained, "Initially I did a sysctl, but Trond M. asked for a mount argument, in case you have to mount from several servers, some of which require reserved ports, some do not." He also remarked that anyone needing so many mounted filesystems would probably be better off redesigning their setup. Rainer defended his setup, saying:

At our site we store all user (~4000 users) data centrally on several NFS servers (running solaris up to now). In order to ease administration we chose the approach to mount each user directory direcly (via automount configured by NIS) on a NFS client where the user wants to access his data. The most important effect of this is, that each users directory is always reachable under the path /home/<user>. This proofed to be very useful (from the administrators point of view) when moving users from one server to another, installing additionl NFS servers etc, because the only path the user knows about and sees when e.g. issuing pwd is /home/<user>. The second advantage is, that there is no need to touch the client system: No need for NFS mounts in /etc/fstab to mount the servers export directory and so there are no unneeded dependencies frpm any client to the NFS servers.

Now think of a setup where no user directory mounts are configured but the whole directory of a NFS server with many users is exported. Of course this makes things easyer for the NFS-system since only one mount is needed but on the client you need to create link trees or something similar so the user still can access his home under /home/<user> and not something like /home/server1/<user>. Moreover even if you create link trees when you issue commands like pwd you see the real path (eg /server1/<user>) instead of the logical (/home/<user>). Such paths are soon written into scripts etc, so that if the user is moved sometime later things will be broken. You simply loose a layer of abstraction if you do not mount the users dir directly. The only other solution I know of would be amd. Amd automatically places a link. But since we come from the sun world, we simply uses suns automounter and there were no problems up to now.

As another not umcommon setup you might think of NAS storage in a larger company. Again you have the choice to mount on a per user basis or you mount the parent directory containing many users. In the latter case again you loose the /home abstraction to some degree.

So I think it would be really good to have at least the option to have more than 256 NFS mounts even if one has to use unsecure ports for this purpose.

H. Peter Anvin replied, "can easily be resolved with vfsbinds. Even Sun has a specific syntax in their automounter to deal with this (server:common_root:tail). If I ever do another autofs v3 release I will probably try to incorporate that via vfsbinds."

2. Latest O(1) Scheduler Patch For 2.5

21 Jan 2002 - 26 Jan 2002 (8 posts) Archive Link: "[patch] O(1) scheduler, -J4"

Topics: Big Memory Support, Big O Notation, Real-Time, SMP, Scheduler

People: Ingo MolnarMartin J. BlighRobert LoveErich Focht

Ingo Molnar announced:

the -J4 scheduler patch is available:

http://redhat.com/~mingo/O(1)-scheduler/sched-O1-2.5.3-pre2-J4.patch
http://redhat.com/~mingo/O(1)-scheduler/sched-O1-2.4.17-J4.patch

there are no open/reported bugs, and no bugs were found since -J2. The scheduler appears to be stabilizing steadily.

-J4 includes two changes to further improve interactiveness:

  1. the introduction of 'super-long' timeslices, max timeslice is 500 msecs - it was 180 msecs before. The new default timeslice is 250 msecs, it was 90 msecs before.

    the reason for super-long timeslices is that IMO we now can afford them. The scheduler is pretty good at identifying true interactive tasks these days. So we can increase timeslice length without risking the loss of good interacte latencies. Long timeslices have a number of advantages:

    • nice +19 CPU hogs take up less CPU time than they used to.
    • interactive tasks can gather a bigger 'reserve' timeslice they can use up to do bursts of processing.
    • CPU hogs will get better cache affinity, due to longer timeslices and less context-switching.

    Long timeslices also have a disadvantage:

    • under high load, if an interactive task manages to fall into the CPU-bound hell then it will take longer for it to get the next slice of processing.

    i have measured the pros to beat the cons under the workloads i tried, but YMMV - more testing by more people is needed, comparing -J4's interactive feel (and nice behavior, and kernel compilation performance) against -J2's interactive feel/performance.

  2. slight shrinking of the bonus/penalty range a task can get.

    i've shrunk the bonus/penalty range from +-19 priority levels to +-14 priority levels. (from 90% of the full range to 70% of the full range.) The reason why this can be done without hurting interactiveness is that it's no longer a necessity to use the maximum range of priorities - the interactiviness information is stored in p->sleep_avg, which is not sensitive to the range of priority levels.

    The shrinking has two benefits:

    • slightly denser priority arrays, slightly better cache utilization.
    • more isolation of nice levels from each other. Eg. nice -20 tasks now have a 6 priority levels 'buffer zone' which cannot be reached by normal interactive tasks. nice -20 audio daemons should benefit from this. Also, normal CPU hogs are better isolated from nice +19 CPU hogs, with the same 6 priority levels 'buffer zone'.

    (by shrinking the bonus/penalty range, the -3 rule in the TASK_INTERACTIVE definition was shrunk as well, to -2.)

Changelog:

Martin J. Bligh reported that a fix he'd submitted to Ingo, and that had been accepted, was now partially missing from Ingo's current patch. Without it, his 8-way NUMA box wouldn't boot. They went back-and-forth on it a bit, and evidently resolved the discrepency, because Martin later reported:

Measuring the performace of a parallelized kernel compile with warm caches on a 8 way NUMA-Q box. Highmem support is turned OFF so I'm only using the first 1Gb or so of RAM (it's much faster without HIGHMEM).

prepare:
make -j16 dep; make -j16 bzImage; make mrproper; make -j16 dep;

measured:
time make -j16 bzImage

2.4.18-pre7

330.06user 99.92system 1:00.35elapsed 712%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (411135major+486026minor)pagefaults 0swaps

2.4.18-pre7 with J6 scheduler

307.19user 88.54system 0:57.63elapsed 686%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (399255major+484472minor)pagefaults 0swaps

Seems to give a significant improvement, not only giving a shorter elapsed time, but also lower CPU load.

3. VM Update And Benchmarks

21 Jan 2002 - 29 Jan 2002 (28 posts) Archive Link: "2.4.18pre4aa1"

Topics: Big Memory Support, Raw IO, Virtual Memory

People: Andrea ArcangeliDaniel PhillipsRandy Hron

Andrea Arcangeli announced:

This is the first release moving the pagetables in highmem. It only compiles on x86 and it is still a bit experimental. I couldn't reproduce problems yet though. the new pte-highmem patch can be downloaded from here:

ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.18pre4aa1/20_pte-highmem-6

Next relevant things to do are the non-x86 archs compilation, and I'd like to sort out the vary-IO for rawio and the hardblocksize-O_DIRECT patch.

ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.18pre4aa1.bz2
ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.18pre4aa1/

Randy Hron posted a Changelog (http://home.earthlink.net/~rwhron/kernel/2.4.18pre4aa1.html) for these patches, as well as some benchmarks (http://home.earthlink.net/~rwhron/kernel/k6-2-475.html) comparing many kernels, including these. Andrea thanked him kindly for putting up those pages. Daniel Phillips took exception to Randy's use of dbench in the benchmarks, saying that it produced "flakey" results. But Andrea said, "dbench has a value. the only problem with dbench is that you can trivially cheat and change the kernel in a broken way, but optimal _only_ for dbench, just to get stellar dbench numbers, but this is definitely not the case with the -aa tree, -aa tree is definitely not optimized for dbench." Daniel insisted that dbench results could vary wildly between identical tests, and were useful for benchmarks. But Andrea replied, "Anyways dbench tells you mostly about elevator etc... it's a good test to check the elevator is working properly, the ++ must be mixed with the dots etc... if the elevator is aggressive enough. Of course that means the elevator is not perfectly fair but that's the whole point about having an elevator. It is also an interesting test for page replacement, but with page replacement it would be possible to write a broken algorithm that produces good numbers, that's the thing I believe to be bad about dbench (oh, like tiotest fake numbers too of course). Other than this it just shows rmap12a has an elevator not aggressive enough which is probably true, I doubt it has anything to do with the VM changes in rmap (of course rmap design significant overhead is helping to slow it down too though), more likely the bomb_segments logic from Andrew that Rik has included, infact the broken page replacement that lefts old stuff in cache if something might generate more unfairness that should generate faster dbench numbers for rmap, but on this last bit I'm not 100% sure (AFIK to get a fast dbench by cheating with the vm you need to make sure to cache lots of the readahead as well (also the one not used yet), but I'm not 100% sure on the effect of lefting old pollution in cache rather than recycling it, I never attempted it)." Daniel was impressed with this analysis, and added, "It's a hint at how hard the elevator problem really is." But he was still suspicious of dbench results.

Elsewhere, even Randy agreed that "dbench results are not perfectly repeatable. I agree that dbench results that vary by 20% or so may not be meaningful. I think dbench is of some value though. In some cases the difference between kernels is 200% or more." He posted some numbers, showing the discrepencies between identical runs, and also showing that the fundamental conclusions were unchanged in spite of that.

4. VM Subsystem Documentation

23 Jan 2002 - 25 Jan 2002 (12 posts) Archive Link: "White Paper on the Linux kernel VM?"

Topics: Virtual Memory

People: Dave JonesDenis VlasenkoDaniel PhillipsOlaf DietscheAndrea ArcangeliRik van RielDerek Glidden

D. L. Beaman asked where to find docs on Andrea Arcangeli's new VM code. David Beckett gave a link to a doc by Derek Glidden (http://www.nks.net/linux-vm.html) . Rik van Riel gave a link to a lecture on the old 2.4 VM (http://surriel.com/lectures/) . Dave Jones remarked, "probably the closest thing to documentation for the newer 2.4 VM is in some slides from Andrea.. ftp://ftp.suse.com/pub/people/andrea/talks/2001/pluto-dec-pub-0.tar.gz. Some bits may have changed a little since those slides were done, (especially in -aa VM) but for the most part they make interesting reading."

Elsewhere, Denis Vlasenko said that the source code was all the documentation anyone should need. He added, "Writing docs wastes developer's time: they will write how they want VM to operate or how they think it operates (while some bug can make actual VM operate differently) instead of improving/debugging current VM code." Daniel Phillips replied:

No you're wrong. Not writing docs wastes the time of other developers. It sends the message 'my time is more important than yours'. In the case of Linus, that may be true, but it's very definitely not true for any other core developer.

It is the responsibility of each developer to prepare at least minimal - but sufficient - documentation for the work they do. Others who specialize in preparing documentation can then use that material as a starting point for preparing clearer, more extensive documentation. But if core developers shirk their responsibility in this area, we will continue to suffer from a chronic shortage of good, current kernel documentation.

Olaf Dietsche added, "I don't want to judge, wether writing docs wastes developer's time. But, when I first tried understanding file systems, it was of tremendous help having docs like Documentation/filesystems/vfs.txt and the like. I couldn't have done it without."

5. New Features On kernel.org

24 Jan 2002 - 25 Jan 2002 (7 posts) Archive Link: "If you haven't seen it yet..."

People: H. Peter AnvinErik Andersen

H. Peter Anvin announced, "If you haven't noticed yet, there are two new links on the kernel.org (http://kernel.org) front page: V and VI (View and View Incremental.) Both take you to a diff viewer application and display the full diff and incremental diff (when available), respectively." Erik Andersen replied, "Wonderful, thanks! Adding diffstat output would also be very nice so I can easily see if the stuff I'm messing with is or isn't changed."

6. Catching Boot Messages

24 Jan 2002 - 26 Jan 2002 (11 posts) Archive Link: "Linux console at boot"

People: George BonserKeith Owens

George Bonser wanted to pause the console during boot, so he could take note of error messages before they scrolled by. He couldn't wait until after boot, because his box was crashing during startup. Folks suggested holding down ctrl-pgup to keep the earliest text visible; using ctrl-s and ctrl-q to pause and unpause the display; changing the source to add a delay between each console message; and taking photographs of the screen.

Keith Owens and others said that using a serial console was really the best bet.

7. Tuning Scheduler Parameters At Run-Time

24 Jan 2002 - 25 Jan 2002 (4 posts) Archive Link: "[PATCH]: O(1) 2.4.17-J6 tuneable parameters"

Topics: Big O Notation, FS: sysfs, Scheduler

People: Jack F. VogelIngo Molnar

Jack F. Vogel posted a patch, and said to Ingo Molnar (author of the new O(1) scheduler), "Our group has a customer request for a scheduler that will give them some tuneable parameters, and your changes have actually had some parameters change thru the deltas you've made. It seemed like it might be useful to take them and make them tweakable on a running system. I am not 100% convinced of the goodness of this, but I wanted to submit it for your consideration. The current code performs great btw, thanks for all your hard work." Ingo replied that it would slow the code down too much to include run-time tuneable parameters, but that as a development tool that patch could be useful. He said:

i'd suggest to name the /proc/sys/sched/ values the same way the constants are called. Eg. /proc/sys/sched/CHILD_FORK_PENALTY. This makes it easier to communicate suggested parameter changes.

i have a script that dumps the current sched-parameters state:

[root@mars root]# ./getsched
echo 95 > /proc/sys/kernel/CHILD_FORK_PENALTY
echo 3 > /proc/sys/kernel/EXIT_WEIGHT
echo 3 > /proc/sys/kernel/INTERACTIVE_DELTA
echo 200 > /proc/sys/kernel/MAX_SLEEP_AVG
echo 30 > /proc/sys/kernel/MAX_TIMESLICE
echo 100 > /proc/sys/kernel/PARENT_FORK_PENALTY
echo 70 > /proc/sys/kernel/PRIO_BONUS_RATIO
echo 60 > /proc/sys/kernel/PRIO_CPU_HOG_RATIO
echo 20 > /proc/sys/kernel/PRIO_INTERACTIVE_RATIO
echo 200 > /proc/sys/kernel/STARVATION_LIMIT

the script is very simple:

cd /proc/sys/kernel

for N in *[A-Z]*; do echo "echo "`cat $N`" > /proc/sys/kernel/$N"; done

otherwise our approach is identical. This patch would always stay separate, but could be readily applied by people who want more control over the scheduler for development or whatever other reasons.

Jack liked this, and they took the discussion off-line.

8. Doing 64-Bit Arithmetic In Kernel Modules

24 Jan 2002 - 30 Jan 2002 (12 posts) Archive Link: "unresolved symbols __udivdi3 and __umoddi3"

Topics: Assembly

People: Tim SchmielauSimon HaynesDaniel PhillipsAndreas DilgerJeff GarzikRichard B. JohnsonHorst von Brand

Simon Haynes needed to do 64-bit arithmetic in a module he was working on, but the code to allow this was in libc, which can't be linked from the kernel. Andreas Dilger and Richard B. Johnson suggested just doing left and right bit-shifts to accomplish multiplication and division, and Tim Schmielau added, "If 64-bit arithmetics cannot be avoided, the do_div64() macro defined in include/asm/div64.h comes in handy." Simon replied that bit-shifting wouldn't work for him, because he was dealing with numbers that were not restricted to powers of 2. The div64 solution seemed to work well, but he added:

I have looked at this header file and I do not understand the asm syntax.

In particular the only x86 div instruction I know only returns a 32 bit div result. Because I don't understand the div64 header I cannot see how a 64 bit result is calculated.

Tim and Daniel Phillips replied that this actually couldn't be done with that macro, because of platform-dependent restrictions. Tim said, "I think do_div(x,y) should work for all platforms if y < 2^16 and x/y < 2^32, but I may stand corrected. Actually, Momchil Velikov just pointed out that some archs only do 32 bit divs in do_div64, where at least the C code from asm-parisc/div64.h should be used." And Daniel added, "64bits/32bits = 64bits is easily calculated with two 64/32 hardware divides, in assembly."

Simon had also, in the same post, asked if there were any documentation available on that code, and Daniel said, "It's painful, isn't it? And no, I don't know where it's documented." But Mark Zealey dredged from the misty past, and found GCC_INLINE_ASM_HOWTO (http://pkl.net/~mark/GCC_INLINE_ASM_HOWTO) and rmiyagi-inline-asm.txt (http://pkl.net/~mark/rmiyagi-inline-asm.txt) . Momchil Velikov also gave a link to to spot in Using and Porting GNU Compiler Collection (http://gcc.gnu.org/onlinedocs/gcc-3.0.3/gcc_5.html#SEC103) that covered it as well, although Daniel wondered if there were any supplemental docs covering the assembly-language instructions themselves. Momchil and Horst von Brand pointed him to Intel's reference manuals, but Daniel replied:

I was afraid somebody would say that.

I need look no further than the shelf at my right hand for a full set of Pentium documentation. I do not consider that an adequate substitute for a document expressing the syntax of all machine instructions of a particular architecture in the GNU syntax.

The only excuse I can think of for not having such a document is "we're all so busy we couldn't write it, please use the Intel documentation". Please don't suggest that we have no need for our own documentmentation, written in a form familiar to us.

Jeff Garzik suggested writing a script to extract the needed information from the data provided in binutils, and dump that into a document of some kind, but there was no reply.

9. Linus Gives BitKeeper A Test Run

28 Jan 2002 - 6 Feb 2002 (397 posts) Archive Link: "A modest proposal -- We need a patch penguin"

Topics: Compression, FS: ext3, Forward Port, Hot-Plugging, Networking, PCI, USB, Version Control

People: Rob LandleyLinus TorvaldsRik van RielGreg KHAlan CoxLarry McVoyIngo MolnarGeorg NikodymEli CarterAlexander ViroTom RiniDave JonesJeff GarzikKai GermaschewskiAlexey KuznetsovAndrew MortonBen Collins

Rob Landley proposed the creation of a new office in kernel development: the "patch penguin", who's duties would be to "make Linus's life easier by doing what Alan Cox used to do, and what Dave Jones is unofficially doing right now. (In fact, I'd like to nominate Dave Jones for the office, although it's Linus's decision and it would be nice if Dave got Alan Cox's blessing as well.)" He added that this was needed because "Linus doesn't scale, and his current way of coping is to silently drop the vast majority of patches submitted to him onto the floor. Most of the time there is no judgement involved when this code gets dropped." A number of people agreed with this assessment and the proposal, and there was some discussion about it. But Linus Torvalds said:

One "patch penguin" scales no better than I do. In fact, I will claim that most of them scale a whole lot worse.

The fact is, we've had "patch penguins" pretty much forever, and they are called subsystem maintainers. They maintain their own subsystem, ie people like David Miller (networking), Kai Germaschewski (ISDN), Greg KH (USB), Ben Collins (firewire), Al Viro (VFS), Andrew Morton (ext3), Ingo Molnar (scheduler), Jeff Garzik (network drivers) etc etc.

If there are problems with certain patches, it tends to be due to one or more of:

In short, if you have areas or patches that you feel have had problems, ask yourself _why_ those areas have problems.

A word of warning: good maintainers are hard to find. Getting more of them helps, but at some point it can actually be more useful to help the _existing_ ones. I've got about ten-twenty people I really trust, and quite frankly, the way people work is hardcoded in our DNA. Nobody "really trusts" hundreds of people. The way to make these things scale out more is to increase the network of trust not by trying to push it on me, but by making it more of a _network_, not a star-topology around me.

In short: don't try to come up with a "patch penguin". Instead try to help existing maintainers, or maybe help grow new ones. THAT is the way to scalability.

There was a ton of discussion on this, but the main highlight of the thread was the attention given to BitKeeper (http://www.bitkeeper.com/) . At one point, Rik van Riel remarked in a different context, "bitkeeper is a really nice tool to help me carry the patch from version to version." Elsewhere but shortly thereafter, Linus said in a completely different subthread:

One thing intrigued me in this thread - which was not the discussion itself, but the fact that Rik is using bitkeeper.

How many other people are actually using bitkeeper already for the kernel? I know the ppc guys have, for a long time, but who else is? bk, unlike CVS, should at least be _able_ to handle a "network of people" kind of approach.

Greg KH replied:

I am, for the USB and PCI hotplug stuff:

http://linuxusb.bkbits.net/

It makes tracking what patches got applied, and which didn't, and forward porting those that didn't to the next release, a breeze.

My trees are world readable, and people are welcome to send me patches against it, or even bitkeeper changesets, but I have yet to receive one of those yet :)

Alan Cox asked, "can bitkeeper easily be persuaded to take "messages" back all the way to the true originator of a change. Ie if a diff gets to Linus he can reject a given piece of change and without passing messages back down the chain ensure they get the reply as to why it was rejected, and even if nobody filled anything in that it was looked at and rejected by xyz at time/date ?" Larry McVoy (BitKeeper creator) replied:

It's certainly possible and there are changes we could make to make it more useful. Right now, there is no record of a change if it goes and then gets rejected right back out; it's as if you patched and then you did a reverse patch.

The good news is that each change (patch) has an identifier, they look like

awc@bitmover.bitmover.com|ChangeSet|20011230212716|39200

and if we kept a record of those that were rejected, it would be trivial for a developer to track whether his change was in/not seen/rejected.

Slightly elsewhere and down the road, but ultimately also in Linus' same subthread, Rik put in:

Bitkeeper also seems to have some problems applying out-of-order changesets or applying them partially.

Changesets sent by 'bk send' are also much harder to read than unidiffs ;)

I think for bitkeeper to be useful for the kernel we really need:

  1. 'bk send' format Linus can read easily
  2. the ability to send individual changes (for example, the foo_net.c fixes from 1.324 and 1.350) in one nice unidiff
  3. the ability for Linus to apply patches that are slightly "out of order" - a direct consequence of (2)

Larry said that item 1 was already done, though it wasn't the default. He felt that item 3 was really the main problem, and he wanted to discuss it. He said:

There are two issues, one is out of order and the other is what we call "false dependencies". I think it is the latter that bites you most of the time. The reason you want out of order is because of the false dependencies, at least that is my belief, let me tell you what they are and you tell me.

Suppose that you make 3 changes, a driver change, a vm change, and a networking change. Suppose that you want to send the networking change to Linus. With patch, you just diff 'em and send and hope that patch can put it together on the other end. With BK as it stands today, there is a linear dependency between all three changes and if you want to send the networking change, you also have to send the other 2.

How much of the out order stuff goes away if you could send changes out of order as long as they did not overlap (touch the same files)?

Tom Rini thought this would help in a lot of cases, and Larry explained his corporate requirements:

here's the deal on this topic. The out of order thing is a much bigger deal for a large open source project than it is for our commercial customers. It's also hard to do correctly, it would take at least a month. Linus has flirted with using BitKeeper multiple times and then gotten distracted. If we had dropped everything and fixed the issues he needs fixed rather than the issues our commercial customers need fixed, we'd be out of business and you'd have the lovely task of trying to make this work in the BitKeeper source. You can believe me or not, but you'd have very little chance of getting it to work, the BK source base is a lot larger and more complex than the generic part of the Linux kernel.

If enough of the kernel people start using BitKeeper and demanding the out of order stuff, we'll do it. The lead time is about a month, so you have to deal with that. On the other hand, if this turns into yet another kernel "she loves, she loves me not, she loves me, she loves me not" about BK, we're not going to fix the out of order stuff until it is the most important issue for our commercial customers.

Hopefully, you'll take this in the spirit in which it is intended. We want to help out the kernel team, that was the reason for writing BitKeeper. We have to survive, however, and that means paying attention to the commercial needs as well. If/when it looks like Linus & Co are serious about using BK, I'll staff a couple of people to address the out of order stuff and commit to a delivery date. It's clear that it is the biggest "gotcha" about BK & the kernel work flow.

Ingo Molnar also replied to Larry's question, asking in turn, "could this be made: 'as long as they do not touch the same lines of code, taking 3 lines of context into account'? (ie. unified diff definition of 'collisions' context.)" Rik thought this would be excellent, and would actually fix a problem he experienced with BitKeeper. But Larry said to Ingo, "What you described is diff/patch. We have that already and if it really worked in all the cases there would be no need for BitKeeper to exist. I'll be the first to admit that BK is too pedantic about change ordering and atomicity, but you need to see that there is a spectrum and if we slid BK over to what you described it would be a meaningless tool, it would basically be a lot of code implementing what people already do with diff/patch." Ingo replied with a concrete example:

i sent 8 different scheduler update patches 5 days ago:

[patch] [sched] fork-fix 2.5.3-pre5
[patch] [sched] yield-fixes 2.5.3-pre5
[patch] [sched] SCHED_RR fix, 2.5.3-pre5
[patch] [sched] set_cpus_allowed() fix, 2.5.3-pre5
[patch] [sched] entry.S offset fix, 2.5.3-pre5.
[patch] [sched] cpu_logical_map fixes, balancing, 2.5.3-pre5
[patch] [sched] compiler warning fix, 2.5.3-pre3
[patch] [sched] unlock_task_rq() cleanup, 2.5.3-pre3

these patches, while many of them are touching the same file (sched.c) are functionally orthogonal, and can be applied in any order. Linus has applied all of them, but he might have omitted any questionable one and still apply the rest.

how would such changes be expressed via BK, and would it be possible for Linus to reject/accept an arbitrary set of these patches?

Larry replied:

There is a way to do this in BK that would work just fine. It pushes some work back onto the developer, but if you are willing to do it, we have no problem doing what you want with BK in its current form and I suspect that the work is very similar to what you are already doing.

In your list above, all of the patches are against 2.5.3-pre5. If you did the work for each patch against the 2.5.3-pre5 baseline, checking it in, saving the BK patch, removing that changeset from the tree, and then going onto the next change, what you'd have is exactly the same set of patches as you have no. Literally. You could type the appropriate command to BK and you should be able to generate a bit for bit identical patch.

In BK's mind, what you have done is to make a very "bushy" set of changes, they are all "side by side". If you think of a graph of changes, you started with

[older changes]
v
[2.5.3-pre4]
v
[2.5.3-pre5]

and then you added one change below that, multiple times. If you were to combine all of those changes in a BK tree, it would look like

[older changes]
v
[2.5.3-pre4]
v
[2.5.3-pre5]
[sched1] [sched2] [sched3] [sched4] [sched5] [sched6] [sched7]

and BK would be happy to allow you to send any subset of the sched changes to Linus and it would work *exactly* how you want it to work. If we could get people to work like this, there are no problems. Just to make it really clear you would do this

for p in patch_list
do bk vi sched.c # that will lock it if isn't

hack, debug, test
bk citool # check it in and make a changeset
bk makepatch -r+ > ~/bk-patches/patch.$p
bk undo -fr+ # get back to the same baseline
done

Here's what people actually do. They make the first change, then make the second change in the same tree that already has the first change, and so on. BitKeeper faithfully records the linear sequence of changes and enforces that the changes propogate as that linear sequence. You can skip some at the end but you can't skip any in the middle.

In your particular case, we really need out of order changesets to allow the second type of work flow and cherry picking. However, a fairly common case is that the changes are all in unrelated files and *even then* BitKeeper enforces the linearity. That's the problem I think we need to fix first, it's not a complete solution, but it is the 80-90% solution. Until we give you a 100% solution, you have to realize that you are making side by side changes and actually do it that way.

If any of this is not clear to anyone, please speak up and I'll try and draw some pictures and add some explanation.

However, Linus had some criticisms, but proposed a case where he would test out BitKeeper in preparation to adopting it for the long term. He said:

The thing is, bk should be able to do this on its own.

The rule on check-in should be: if the resultant changeset can be automatically merged with an earlier changeset, it should be _parallel_ to that changeset, not linear.

And note the "automatically merged" part. That still guarantees your database consistency at all levels.

Let us assume that you have a tree that looks like

        a -> b -> c

together with modifications "d". Now, "bk ci" (or, because this is more expensive than a plain "ci", you can call it "bk backmerge" or something, and all sane people will just stop using "ci") should instead of doing a plain

        a -> b -> c -> d

it would see how far back it can go with an automatic merge and add "d" at the _furthest_ point possible. So let's assume that "d" really cannot be merged with "b" but doesn't clash with "c", so what you'd create with "bk backmerge" is the "maximally parallel version:

        a -> b -> c
                \
                 > d

Now, you'll say that this is exponentially expensive to do, and I agree. But CPU-time is cheap, and notice that this "backmerge" would be done not in one centralized location, but in parallel at different developers.

(Yeah, my example may look "cheap", but if you do exclusively backmerging, then after a while your trees will have lots of very wide development, and the m,ore interesting backmerge is when you already have

        a -> b -> c -> f

                -> d

                -> e

and you're adding "g", and it doesn't merge with "c" but not with "d" and "e", so you have to test every path backwards to see where you can push it. In this case you'd end up with

        a -> b  -> c -> f

                     -> g

                -> d

                -> e

kind of tree.)

Just how expensive would that be? I'd be willing to make my machine sweat a bit, if it would just automatically generate the most parallel changesets possible.. And I bet you could do _this_ fairly easily if you didn't care too much about trying to be too efficient.

You're saying your merges are perfect. USE that knowledge, and make it do "bk backmerge".

(Once you do backmerge, the false dependencies no longer exist, AND suddenly "bk" actually gives people information that they didn't have before: the revision history actually tells you the _true_ dependencies in the tree, not some stupid "this is the order in which things were checked in" stuff that doesn't add any value that can't be seen from the changeset directly)

Larry, give me that, and I'll promise to use bk exclusively for two months to shake it out..

Georg Nikodym was impressed with Linus' analysis, but pointed out that 'g' in BitKeeper's world was not just a change, but was:

actually a label that implies a point in time as well as all the change that came before it. Stated differently, it is a set.

Using your graph:

        a -> b -> c -> f

                -> d

                -> e

and the way that people currently think (and thus speak) of these things, saying that you're using a version 'e' kernel is ambiguous because it may or may not have 'c' or 'd'. This ambiguity also complicates the task of reproducing a tree at some known state later.

You, as the center of the known universe may not need to concern yourself with such trifles, but speaking as one of those fabled commercial customers, the ability to say unambiguously say what's been changed (read: fixed) is really important.

All that said, I like your idea about revision graph compression. My concerns might simply be mitigated by being able to insert "release" points (simply a tag that implies/includes all the preceding changesets).

Linus replied:

I disagree.

"g" is really just one thing: it is "the changes".

However, you are obviously correct that any changes are inherently dependent on the context those changes are in. And there are multiple contexts.

One context is simply the "when in time" context, which is interesting, as when you serialize all changesets you see the time-wise development. And this is really the _only_ context that BK (and most other source control tools) really think about.

However, in another sense, the "when in time" context is completely meaningless. Should you care what the _temporal_ relationship between two independent patches is? I say "Absolutely not!". The temporal relationship only hides the _real_ revision information, which is what the patch actually depends on.

And note that my suggestion to have a "bk backmerge" does not remove the temporal relationships. All changesets already have a timestamp (they clearly have to have it, just so that you can see when they happened and say "what did this tree look like one month ago?"). So we already _have_ the temporal information, and encoding that temporal information into the "relationship" information actually ends up costing you quite dearly.

I'd say that most (maybe all) of the complaints about not being able to apply changesets in any random order comes exactly from the fact that developers _know_ that temporal relationships are often not relevant. From a development standpoint, temporal relationships are only relevant if they match the _causal_ relationships, and even then you can often find patches that are _caused_ by previous patches, but that are valid on their own (and can indeed be more important, or even completely obviate the need for the original patch).

So what I'm saying is that from a patch revision standpoint, temporal information is useless. You still have enough information to recreate the tree "at time 'g'" by just doing the equivalent of "bk get -c<date-of-g>".

Regarding the need to know exactly what's been changed, Linus added:

you don't lose that ability. You still have all the information you used to have, you just have even _more_ information. You have the information on not just when the change was checked in (temporal), you have the information on what files/changes it really _depended_ on.

Now, after those arguments, I'll just finish off with saying that I actually agree with you to some degree - as I already said in private email to Larry, I would definitely also want to have a way of stopping back-merging at some point. In particular, when I'd tag a relase (ie something like Linux-2.5.3, I would potentially also want to set a "backmerge marker tag" which basically tells the backmerge logic that it's not worth it to try to backmerge past that version tag.

That would decrease the chance of confusion considerably, and it would also avoid the exponential complexity problem. Let's face it, exponential algorithms can be really useful, but you do want to have some way of making sure that the bad behaviour never happens in practice. A way of limiting the set of changelogs to be considered for backmerging not only means that humans don't get as confused, the computer also won't have to work insanely to try to go back to Linux version 0.01 if a small patch happened to apply all the way back.

Eli Carter gave a hypothetical situation, in which a "Changeset adds driver for device Q. Now, let's further suppose that in 2.5.x we have perfect modularity for drivers and at that time Q is added... we just add a directory, linux/drivers/Qdrv. It won't conflict with 2.5, 2.4.x, 2.2.x, etc.. However, because 2.2.x does not have the hooks necesary to see it, Q won't work on those kernels. There is a design-level dependency in changeset q." Linus replied:

Not so hypothetical, and entirely true. Which is, why I suggest that such "deep merging" wouldn't actually go past tags.

Let's face it, the source control tool cannot know what works and what doesn't. The only thing it can ever know about is "what applies". It can take the approach that everything only applies to the last branch - which is the traditional approach, but which introduces dependencies that simply aren't there, and that _cannot_ be cut. This is what bk does now.

But the other approach is to say "whatever applies, applies, and as long as I don't lose revision information I'll move it backwards". That has other problems (as you point out), but now those problems are manageable, and have solutions.

I'd rather take the problem that can be solved over the problem that cannot.

The fact is, development _often_ is in the situation where somebody does a quick-and-dirty fix, and then only later goes in and digs deeper and does the right fix that makes the original fix completely unnecessary.

The way BK works now, if we call the quick-and-dirty fix "A", and the real fix "B", the developer has a really hard time just sending "B" to me. He'd have to re-clone an earlier BK tree, re-do B in that tree, and then send me the second version.

I'm suggesting that he just send me B, and get rid of his tree. There are no dependencies on A, and I do not want _crap_ in my tree just because A was a temporary state for him.

Larry replied that this would lose useful information, but Linus said, "No. If the useless crap ends up hiding the real points in the revision history, getting rid of crud is _good_. Remember how I asked for a way to "batch up" revision history? Same issue. Crap is crap, and it more often hides the real issues than enlightens anything at all."

The discussion skewed of toward flames at this point, but a little ways further down the subthread, Alexander Viro said:

I can't speak for Linus, but my main problem with BK is similar to what you'd described. Here's what I'm usually doing and what I'd like to be able to do with BK:

Suppose I have 5 deltas - A, B, C, D, E. I want to kill A.

I add a branch that consists of B' (B backported to original) and ABB'^{-1}. It joins the original at AB.

I backport C to B'. Now I've got B', C', ABC(B'C')^{-1}. Again, it joins the original branch.

Repeat for D and E. Now I've got the following picture (apologies for BUAG):

* -B'-> * -C'-> * -D'-> * -E'-> *
|                              /
A                            crap
V                            V
* -B-> * -C-> * -D-> * -E-> *

_Now_ I change the direction of last arrow. Yes, it's more or less reverted A. And now I want to consider the top branch as the main history.

IOW, what I want is ability to play revisionist. And it's not limited to removing patches - if I've found a bug in A, I want to be able to add A-fix and move it all way back to right after A. And merge them. B, C, D and E might have changed from that, but that's what I want. Moreover, I might have some junk left in the end (i.e. ABCDEA-fix == (AA-fix)B'C'D'E'noise) and I'd really like to be able to say that (AA-fix)B'C'D'E' is the main history now and other path (ABCDE A-fix noise^{-1}) is buried.

If you can give a way to do that - I'm happy.

Larry didn't quite get what Alexander was saying, and Alexander tried again:

I don't want A (or entire old path) to disappear. What I want is ability to have two paths leading to the same point + ability to mark one of them as "more interesting".

I.e. the result I want is _two_ sets of changesets with the same compositions.

And _that_ is compatible with replication - I simply want the new path in revision graph to be propagated. Along with the "this path is more interesting" being written on the new one.

Can that be done? I want a way to re-split the set of deltas. I'm perfectly happy with old one staying around, as long as we remember that results of old and new are the same object and that new is a prefered way to look at the damn thing.

I suspect that it could be doable with with something as simple as "if you ask to merge two identical nodes, I'll just mark them as such and ask which is more interesting". IIRC, BK doesn't require tree topology on the graph - it can have converging branches.

_If_ that is feasible - the rest can be scripted around it.

Larry replied:

Ahh, you want LODs. And they neatly solve the problem you described. And a bunch of others. Think of a LOD as a revision history graph. Imagine being able to create a new, empty (or partially populated) "container". That container is a LOD. You can do set operations from one LOD to the other. They are a lot like branches except that they themselves can branch & merge.

The way that we'd do what you wanted is you'd create a new LOD, stick B, C, D, E into it, and make it the default LOD in your repository.

LODs have some very nice attributes - each change is a set element, the LOD is nothing more than a recorded history of what set elements are in this LOD, and you can cherry pick from one LOD to the other. Out of order, sparsely, whatever.

The only restriction is that you have to have all the changes in your graph. There is no concept of a sparse graph. You can trim off stuff that happens after some point but you can't remove points in the middle, even if they are in the other LOD. Is that OK?

Linus first sounded like he'd accept this as an answer and then later it fell out of favor because even though he could hide a bad changeset in another LOD, he didn't want it in the graph at all. I don't know how to do that.

The other gotcha is that LODs are only partially implemented and are going to stay that way until we achieve concensus on how BK should work for you.

Elsewhere, in a "Linus should use BitKeeper" subthread, Linus said, "The thing is, I actually _want_ to use BK (as opposed to CVS, which I really don't think cuts it)." And again elsewhere, he remarked:

right now I'm trying to see if it makes any difference if I try to use BK for a month or two. I seriously doubt it will really "fix" everything, but neither do I think big re-organizations and patch-lists will. But I'd be stupid if I wasn't willing to try something.

(So far, trying out BK has only meant that I have spent _none_ of my time merging patches and reading email, and most of my time writing helper scripts and emails to Larry to make it possible to use BK in sane ways that suit me. And I'll doubt you'll see any real productivity increase from me for a while ;)

Larry replied, "Hey, we're hacking away as well, we'll have some fixes for you ASAP. I'm just adding a few touches to Wayne's diff change you wanted."

10. Booting Multiple OSes From Linux

30 Jan 2002 - 4 Feb 2002 (52 posts) Archive Link: "[RFC] x86 ELF bootable kernels/Linux booting Linux/LinuxBIOS"

Topics: Executable File Format, Kexec, Networking, SMP

People: Eric W. BiedermanAndrew Morton

Eric W. Biederman announced:

I have been working on this quietly and now my pieces generally work, and I am down to dotting the i's, and crossing the t's. As my start of pushing for inclusion I want to get some feedback (in case I have missed something), and to give some feedback so people understand what I am doing.

First the connections.

What a bootable ELF formatted kernel is.

My next step is to integrate all of my pieces and do some cleanup but I have functioning code for everything.

An x86 ELF bootable kernel: ftp://download.lnxi.com/pub/src/linux-kernel-patches/linux-2.4.17.elf_bootable.diff

A patch to boot an arbitrary static ELF executeable. ftp://download.lnxi.com/pub/src/linux-kernel-patches/linux-2.4.17.eb-kexec-apic-lb-mtd2.diff

A kernel fix to do proper SMP shutdown so that you can kexec on a SMP kernel ftp://download.lnxi.com/pub/src/linux-kernel-patches/linux-2.4.17.eb-apic-lb-mtd2.diff

A kernel patch that implements minimal some LinuxBIOS support. ftp://download.lnxi.com/pub/src/linux-kernel-patches/linux-2.4.18-pre7.linuxbios.diff

A standalone executable to adapt an existing x86 bzImage to be an ELF bootable kernel. ftp://download.lnxi.com/pub/src/mkelfImage/mkelfImage-1.12.tar.gz

A first draft of a specification that goes into detail about how an ELF image is interpreted, and extensions that can be added so the bootloader name, the bootloader version, and similar interesting but functionally unnecessary information can be passed to the loaded image, and so the bootloader can find out similar kinds of information about the ELF executable it is loading. ftp://download.lnxi.com/pub/src/linuxbios/specifications/draft1_elf_boot_proposal.txt

For what it is worth I have gotten fairly positive feedback from both the Etherboot and the LinuxBIOS communities so far. And etherboot and memtest86 have both been modified already to be ELF bootable. And there is current work that gets a long ways with Plan9.

My kexec work is direct competition to two kernel monte, bootimage, lobos. I have been using it in production for about a year, and I haven't encountered real problems. The biggest issue I have had is with the kernel not properly shutting down devices.

In the short term shutting down devices is trivially handled by umounting filesystems, downing ethernet devices, and calling the reboot notifier chain. Long term I need to call the module_exit routines but they need a little sorting out before I can use them during reboot. In particular calling any module_exit routing that clears pm_power_off is a no-no.

Also while it should work in most cases any loaded ELF image that starts using BIOS/firmware services to drive the hardware is on it's own. Putting all devices back in a state that they match the firmwares cached state is a poorly defined problem. However normal firmware calls that ask if for the memory size or IRQ routing information should work correctly.

More on etherboot can be found at: http://www.etherboot.org

More on LinuxBIOS can be found at: http://www.linuxbios.org

Andrew Morton threw his hat up in the air and did a little jig, saying, "Great work, and thanks! I look forward to 2-second SMP reboots." A bunch of folks followed up with some technical suggestions and discussion. There was some idea that the design was flawed, and folks discussed alternatives.

11. LVM Rewrite

30 Jan 2002 - 5 Feb 2002 (11 posts) Archive Link: "[ANNOUNCE] LVM reimplementation ready for beta testing"

Topics: Disk Arrays: EVMS, Disk Arrays: LVM

People: Joe ThornberHeinz MauelshagenJeff GarzikAndreas DilgerDaniel PhillipsPatrick Caulfield

Joe Thornber announced that he, Patrick Caulfield, Alasdair Kergon, and Heinz Mauelshagen, of Sistina, had rewritten LVM. He said:

the LVM2 software is ready for beta testing.

This is a complete reimplementation of the existing LVM system, both driver and userland tools.

We encourage you to give it a try and feed back your test results, bug-fixes, enhancement requests etc. through the normal lists linux-lvm@sistina.com and lvm-devel@sistina.com.

The new kernel driver (known as "device-mapper") supports volume management in general and is no longer Linux LVM specific. As such it is a separate package from LVM2 which you will need to download and install before building LVM2.

ftp://ftp.sistina.com/pub/LVM2/device-mapper/device-mapper-beta1.tgz

The userland tools are available from here:

ftp://ftp.sistina.com/pub/LVM2/tools/LVM2.0-beta1.tgz

This release does not support snapshots or pvmove. These features will go into a subsequent beta release, hopefully within the next fortnight.

This is Beta software which is *not* meant to be running on your production systems. If necessary, keep backups of your data and LVM metadata (/etc/lvmconf/*).

Daniel Phillips was very happy about the new code, but Andreas Dilger wanted to know what the license of the new code would be. Joe replied, "LVM2 is GPL/LGPL-licensed - just like the original version of LVM. This means the whole Linux community benefits from this aspect of Sistina's work. The device-mapper and LVM2 packages will *always* be GPL/LGPL." He encouraged anyone who wanted to, to submit patches and other contributions.

Close by, however, Heinz added, "OTOH we need to survive as a company and therefore will implement comercial enhancements which will BTW enable us to do support and further development of the above free software." Jeff Garzik also asked for a clarification of "GPL/LGPL". He asked, "Are certain parts GPL and other parts LGPL? If so, which parts?" Heinz replied:

The LVM2 sofware no longer uses a particular driver which is just usable for its own purpose. It rather accesses a different, so-called 'device-mapper' driver, which implements a generic volume management service for the Linux kernel by supporting arbitray mappings of address ranges to underlying block devices. Because this is a generic service rather than an application within the kernel, it is open to be used by multiple LVM implementations (for eg. EVMS could be ported to use it :-)

The device-mapper driver is under the GPL and our Beta1 release dated Wednesday, which included the LVM2 tools as well, supports 2.4 kernels. We are aiming to get it integrated into the stock kernel and are implementing the necessary changes (bio interface) for 2.5 now. We released a device-mapper library (implements a generic API for the device-mapper) which is under the LGPL with it.

The LVM2 tools have a library with routines to for eg. access the device-mapper library, deal with LVM metadata (VGDA), support different metadata formats and offer configuration file support which is under the LGPL as well. The tools themselves (vgcreate, lvcreate, ...) are under the GPL.

To sum up, he said, the LVM2 tools and device-mapper driver were GPL, while the LVM2 library and device-mapper library were LGPL.

 

 

 

 

 

 

Sharon And Joy
 

Kernel Traffic is grateful to be developed on a computer donated by Professor Greg Benson and Professor Allan Cruse in the Department of Computer Science at the University of San Francisco. This is the same department that invented FlashMob Computing. Kernel Traffic is hosted by the generous folks at kernel.org. All pages on this site are copyright their original authors, and distributed under the terms of the GNU General Public License version 2.0.